NOTE
Communicated by Heiga Zen
How to Pretend That Correlated Variables Are Independent by Using Difference Observations Christopher K. I. Williams
[email protected] School of Informatics, University of Edinburgh, Edinburgh EH1 2QL, U.K.
In many areas of data modeling, observations at different locations (e.g., time frames or pixel locations) are augmented by differences of nearby observations (e.g., δ features in speech recognition, Gabor jets in image analysis). These augmented observations are then often modeled as being independent. How can this make sense? We provide two interpretations, showing (1) that the likelihood of data generated from an autoregressive process can be computed in terms of “independent” augmented observations and (2) that the augmented observations can be given a coherent treatment in terms of the products of experts model (Hinton, 1999).
1 Introduction In automatic speech recognition, it is often the case that hidden Markov models (HMMs) are used on observation vectors that are augmented by difference observations (so-called δ features; see Furui, 1986). Under the HMM, each observation vector is modeled as being conditionally independent given the hidden state. How can this make sense, as close-by differences are clearly not independent? A similar difficulty arises in image analysis tasks such as texture segmentation (see, e.g., Dunn & Higgins, 1995). Here derivative features obtained from Gabor filters or wavelet analysis, for example, are modeled as being independent at different locations, despite the fact that these features will have been computed sharing some pixels in common. In this article, we present two solutions to this problem. In section 2, we show that if the data are generated from a vector autoregressive (AR) model, then the likelihood can be expressed in terms of “independent” difference observations. In section 3, we show that the local models at each location can be combined using a product of experts model (Hinton, 1999) to provide a well-defined joint model for the data and that this can be related to AR models. Section 4 discusses how these interpretations are affected if the local models are conditional on a hidden state variable, as is the case for HMMs. Neural Computation 17, 1–6 (2005)
c 2004 Massachusetts Institute of Technology
2
C. Williams
2 An AR Model Consider a temporal vector autoregressive model, Xt =
p
Ai Xt−i + Nt ,
(2.1)
i=1
where the Ai ’s are square matrices and Nt is independent and identically distributed gaussian noise ∼ N(0, N ). Xt and Nt have dimension D for all t. To avoid complicated end effects, we will use periodic (wraparound) boundary conditions, so that the subscript t − i should be read mod(t − i, N). Thus, there are N random variables X0 , . . . , XN−1 , which collectively we denote as X, and similarly for N. Then X and N are related by N = TX for an appropriate matrix T. Thus, 1 −1 exp − NTt N Nt 2 t=0 p T p N−1 1 −1 exp − Ai Xt−i N Ai Xt−i , = 2 i=0 t=0 i=0
P(X) ∝
N−1
(2.2)
(2.3)
p where we have set A0 = −I so that Nt = − i=0 Ai Xt−i . p Now let Y0t , . . . , Yt be linearly independent linear combinations of Xt , . . . , Xt−p . For example, we could choose Y0t = Xt , Y1t = Xt − Xt−1 and so on. As the Yit ’s are simple linear combinations of Xt , . . . , Xt−p , we have p
Ai Xt−i =
i=0
p
Bi Yit ,
(2.4)
i=0
for some set of matrices Bi . We can now write T p p N−1 1 −1 exp − Bi Yit N Bi Yit , P(X) ∝ 2 i=0 t=0 i=0
(2.5)
showing that the likelihood of the underlying X process can be expressed in terms of a product of terms involving the difference observations up to p order p at each time. Stacking Y0t , Y1t , . . . , Yt as the vector Yt we have P(X) ∝
N−1 t=0
1 exp − YTt MYt , 2
(2.6) j
where the (i, j) block of the matrix M (between Yit and Yt ) has the form −1 Bj . Equation 2.6 almost looks like a product of independent gaussians, BTi N
How to Pretend Correlated Variables Are Independent
3
but note that M is singular (it has rank D as it arises from Nt ) so the correct normalization factor of the gaussian cannot be obtained from it. As a simple example, consider the scalar AR(1) process Xt = αXt−1 + Nt and set Yt0 = Xt , Yt1 = Xt − Xt−1 . Thus, Xt − αXt−1 = (1 − α)Xt + α(Xt − Xt−1 ) = (1 −
α)Yt0
+
αYt1 .
(2.7) (2.8)
To obtain the likelihood for the sequence X, the matrix M will have the form 1 (1 − α)2 α(1 − α) , (2.9) M= 2 α2 σn α(1 − α) where σn2 = var(Nt ). As expected, M has rank 1 (it is an outer product). Interestingly, the matrix M is not equal to the inverse covariance of the Yt ’s derived from the distribution for X. To show this, we first use the result that for the scalar AR(1) process on the circle, the covariance C[j] = Xt Xt−j is given by C[j] =
σn2 (α |j| + α |N−j| ) . (1 − α 2 )(1 − α N )
(2.10)
Thus, cov(Yt ) =
0 0 Yt Yt Yt0 Yt1
Yt0 Yt1 C[0] = (C[0] − C[1]) Yt1 Yt1
(C[0] − C[1]) . (2.11) 2(C[0] − C[1])
Inversion of cov(Yt ) shows that it is not equal to M as given in equation 2.9. Notice that the joint distribution of Y0 , . . . , YN−1 is singular. If we take an AR process on the X variables, then one can choose linear combinations of the Xt s that are truly independent by carrying out an eigenanalysis. (For the periodic boundary conditions described above and time-invariant coefficients, the eigenbasis would be the Fourier basis.) However, if we allow ourselves an overcomplete basis set, then we have shown that the likelihood of X under the AR process can readily be computed using “independent” densities at each location. Although we have given the derivation above using gaussian noise, in fact the conclusion concerning expressing the likelihood of the X sequence in terms of a product of terms involving Yt ’s is independent of the form of the noise driving the AR process. It is also possible to extend the AR model described above beyond the temporal one-dimensional chain. For example, Abend, Harley, and Kanal (1965) describe Markov mesh models in two dimensions. A simple example of such a model is a “third-order” Markov mesh, where Xi,j depends autoregressively on Xi,j−1 , Xi−1,j−1 , and Xi−1,j . The same construction in terms of Y variables can be used in this case.
4
C. Williams
3 Product of Experts Interpretation At an individual location, we have a model Pt (Yt ) for the augmented vector Yt . To define a joint distribution on X, we set P(X) =
1 Pt (Yt ), Z t
(3.1)
where Z is a normalization constant (known in statistical physics as the partition function). This is the product of experts construction (Hinton, 1999). One can also think of this as a Markov random field construction where
P(X) ∝ exp −E(X) and E(X) = − t log Pt (Yt ). If each Pt (Yt ) is gaussian, then P(X) will also be gaussian, and Z = (2π)N/2 |C|1/2 where C is the covariance matrix of X. Again we consider a simple example relating to a scalar AR(1) process, so Yt = (Xt , Xt − Xt−1 )T . Let 1 Pt (Yt ) ∝ exp − {a0 Xt2 + a1 (Xt − Xt−1 )2 }, 2
(3.2)
with a0 , a1 > 0. Then we obtain the joint distribution, 1 2 2 P(X) ∝ − Xt + a1 (Xt − Xt−1 ) . a0 2 t t
(3.3)
C−1 , the inverse covariance matrix of X, is circulant with entries a0 + 2a1 on the diagonal and −a1 in the bands above and below the diagonal and in the northeast and southwest corners. For the AR(1) process, Xt = αXt−1 + Nt with Nt ∼ N(0, β −1 ), we obtain corresponding entries of β(1 + α 2 ) on the diagonal and −βα off the diagonal. The overall scale of a0 and a1 has the def
same effect as β in setting the variance of the process but r = aa01 = (1−α) α , so for any given α value, there is a corresponding value of r1 . For the gaussian case with expert t involving interactions between Xt and Xt−p , we obtain a quadratic form with the same pattern of banding as in the inverse covariance matrix of an AR(p) process, but as above, for some choices of parameters there may not be a corresponding AR process. Again this construction can be extended to two (or more) dimensions. For example, in 2D we might consider the variable Xi,j and the differences to its four neighbors to the north, south, east, and west to obtain a fivedimensional Y vector. Equation 3.1, with each expert being gaussian, then defines a gaussian Markov random field over the lattice of X variables. 2
1 Interestingly for r ∈ (−4, 0), there are no corresponding values of α. Note that α = 0 ⇒ a1 = 0.
How to Pretend Correlated Variables Are Independent
5
4 Incorporating Hidden State In speech recognition using HMMs, the Yt s are modeled as conditionally independent given the discrete hidden variable st . We now consider how this affects the interpretations given above. For interpretation 1, we consider a switching AR(p) process or AR-HMM (see, e.g., Woodland, 1992), so that Xt depends on Xt−1 , . . . , Xt−p and also st . For example, using gaussian noise and setting st = k, we have Xt ∼ p N( i=1 Aki Xt−i , k ). Notice that the AR model parameters now depend on
p the switching variable. However, we can still write the prediction i=1 Aki Xt−i as a linear combination of the Yit s, so the likelihood can be written in the form of “independent” contributions from the Yt s. Note that the usual forward and backward HMM recursions can be carried out for the ARHMM. For interpretation 2, we have the individual component densities Pt (Yt | st ), and the joint distribution P(X | s) =
1 Pt (Yt | st ), Z(s) t
(4.1)
where s = (s0 , . . . , sN−1 ). Notice that the normalization constant in general depends on s, and thus when given X, the computation of P(X | s) depends no only on the component densities but also on Z(s). However, if Pt (Yt | st ) is gaussian and has the same covariance structure but different means depending on st for all t, then Z would turn out to be independent of s. While writing this article, I became aware of the work of Tokuda, Zen, and Kitamura (2003), who correctly derive the product of gaussian experts construction conditional on s and note the general dependence of Z(s) on s. They also observe that use of the Viterbi algorithm to find the state sequence s that maximizes P(s) t Pt (Yt | st ) (which is easily done with standard dynamic programming techniques) will not, in general, yield the sequence that maximizes P(s | X), because of the Z(s) term. Most practical HMM-based speech recognition systems use mixtures of gaussians to model the Yt s at each frame. The product of experts interpretation readily handles this situation. For an AR model interpretation, the use of a mixture distribution for the Yt s already suggests a switching AR process with the switching variable hidden. 5 Discussion We have described both conditionally specified models (AR processes) and simultaneously specified models (products of experts) to define the joint density2 P(X) and relate it to the augmented feature vectors {Yt }. 2
This terminology is derived from Cressie (1993, sec. 6.3).
6
C. Williams
While this article describes a theoretical framework for understanding why using difference observations make sense, it would be interesting to examine empirically the question of how well AR and products of experts models characterize the dependencies between time frames or pixel locations. Acknowledgments This note was inspired by questions raised by Joe Frankel’s Ph.D. thesis. Thanks to John Bridle and Joe Frankel for helpful conversations and comments on earlier drafts, to Joe Frankel for drawing my attention to Tokuda et al. (2003), and to the anonymous referees for their comments, which helped to improve the article. References Abend, K., Harley, T. J., & Kanal, L. N. (1965). Classification of binary random patterns. IEEE Transactions on Information Theory, 11(4), 538–544. Cressie, N. A. C. (1993). Statistics for spatial data. New York: Wiley. Dunn, D., & Higgins, W. E. (1995). Optimal Gabor filters for texture segmentation. IEEE Transactions on Image Processing, 4(7), 947–964. Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech and Signal Processing, 34, 52–59. Hinton, G. E. (1999). Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks (Vol. 1, pp. 1–6). London: IEE. Tokuda, K., Zen, H., & Kitamura, T. (2003). Trajectory modelling based on HMMs with the explicit relationship between static and dynamic features. In Proceedings of the 8th European Conference on Speech Communication and Technology (Eurospeech 2003). N.P.: International Speech Communication Association. Woodland, P. C. (1992). Hidden Markov models using vector linear prediction and discriminative output distributions. In Proceedings of 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. I, pp. 509–512). Piscataway, NJ: IEEE. Received February 3, 2004; accepted May 28, 2004.
NOTE
Communicated by Thorsten Joachims
Convergence of the IRWLS Procedure to the Support Vector Machine Solution Fernando P´erez-Cruz
[email protected] Gatsby Computational Neuroscience Unit, 17 Queen Sqaure, London WC1N 3AR, U.K., and Department of Signal Theory and Communications, University Carlos III in Madrid, Avda Universidad 30, 28911 Leganes (Madrid), Spain
Carlos Bousono-Calz ˜ on ´
[email protected] Antonio Art´es-Rodr´ıguez
[email protected] Department of Signal Theory and Communications, University Carlos III in Madrid, Avda Universidad 30, 28911 Leganes (Madrid), Spain
An iterative reweighted least squares (IRWLS) procedure recently proposed is shown to converge to the support vector machine solution. The convergence to a stationary point is ensured by modifying the original IRWLS procedure. 1 Introduction Support vector machines (SVMs) are state-of-the-art tools for linear and nonlinear input-output knowledge discovery (Vapnik, 1998; Scholkopf ¨ & Smola, 2001). The SVM relies on the minimization of a quadratic problem, which is frequently solved using quadratic programming (QP) (Burges, 1998). The iterative reweighted least square (IRWLS) procedure for solving SVM for clas´ sification was introduced in P´erez-Cruz, Navia-V´azquez, Rojo-Alvarez, and Art´es-Rodr´ıguez (1999) and P´erez-Cruz, Navia-V´azquez, Alarcon-Diana, ´ and Art´es-Rodr´ıguez (2001) and it was used in P´erez-Cruz, Alarcon-Diana, ´ Navia-V´azquez, and Art´es-Rodr´ıguez (2000) to construct the fastest SVM solver of the time. It solves a sequence of weighted least-square problems that, unlike other least-square procedures such as Lagrangian SVMs (Mangasarian & Musicant, 2000) or least square SVMs (Suykens & Vandewalle, 1999; Van Gestel et al., 2004), leads to the true SVM solution, as we will show here. However, to prove its convergence to the SVM solution, the IRWLS procedure has to be modified with respect to the formulation that appears in P´erez-Cruz et al. (1999, 2001). The IRWLS has been also proposed for solving regression problems (P´erez-Cruz, Navia-V´azquez, Alarcon-Diana, ´ & Art´es-Rodr´ıguez, 2000). AlNeural Computation 17, 7–18 (2005)
c 2004 Massachusetts Institute of Technology
8
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
though we will deal only with the IRWLS for classification, the extension of this proof to regression is straightforward. The article is organized as follows. We prove the convergence of the IRWLS procedure to the SVM solution in section 2 and summarize the algorithmic implementation of it in section 3. We conclude with some comments in section 4. 2 Proof of Convergence of the IRWLS Algorithm to the SVC Solution The support vector classifier (SVC) seeks to compute the dependency between a set of patterns xi ∈ Rd (i = 1, . . . , n) and its corresponding labels
φ(.)
yi ∈ {±1}, given a transformation to a feature space φ(·) (Rd −→ RH and d ≤ H). The SVC solves min
w,ξi ,b
n 1 ξi w2 + C 2 i=1
subject to yi (φT (xi )w + b) ≥ 1 − ξi ξi ≥ 0, where w and b define the linear classifier in the feature space (nonlinear in the input space, unless φ(x) = x) and C is the penalty applied over training errors. This problem is equivalent to the following unconstrained problem, in which we need to minimize LP (w, b) =
n 1 L(ui ) w2 + C 2 i=1
(2.1)
with respect to w and b, where ui = 1−yi (φT (xi )w+b) and L(u) = max(u, 0). To prove the convergence of the algorithm, we need LP (w, b) to be both continuous and differentiable; therefore, we replace L(u) by a smooth approximation, 0, L(u) = Ku2 /2, u − 1/(2K),
u LP (zk + ηk pk ) = LP (zk+1 ). We know that LP (wk , bk ) = LP (wk , bk ) and, being ws and bs the minimum of k k equation 2.3, LP (w , b ) ≥ LP (ws , bs ), equality will hold only if ws = wk and bs = bk , due to the fact that we are solving a least-square problem. Consequently, LP (wk , bk ) ≥ LP (wk+1 , bk+1 ) ∀ηk ∈ (0, 1], because (wk+1 , bk+1 ) are a convex combination of (wk , bk ) and (ws , bs ) and LP (w, b) is a convex functional, and equality will hold only if wk+1 = wk = ws and bk+1 = bk = bs . Now we will set ηk to enforce that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ), to guarantee that LP (wk , bk ) = LP (wk , bk ) > LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ). To show that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ), it is sufficient to prove that L(uki ) +
(uk+1 )2 −(uki )2 dL(u) i du |uki 2uki
≥ L(uk+1 ) ∀i = 1, . . . , n. i (u )2 −(uk )2
i i is tangent to L(ui ) at ui = uki , its For uki ≥ 0, L(uki ) + dL(u) du |uki 2uki minimum is attained at ui = 0, and its minimal value is greater than or
equal to zero. Therefore, in this case, L(uki ) + dL(u) du |uki
(ui )2 −(uki )2 2uki
≥ L(ui ) for any
ui ∈ R. We show an example for = 1 in Figure 1. For uki < 0, we need to ensure that L(uk+1 ) ≤ 0, which can be obtained i k+1 , bk+1 ) are a convex combination of (wk , bk ) and only for uk+1 ≤ 0. As (w i (ws , bs ), uk+1 can be greater than zero only if usi > 0. For the samples, whose i uki
uki < 0 and usi > 0, we will need to set ηik ≤
uki uki −usi
to ensure that uk+1 ≤ 0, i
IRWLS Convergence to the SVM Solution
11
L(u) L(1)+(u2−12)/2 L(−1)
2.5 2 1.5 1 0.5 0 −2
−1
0
1
2
u Figure 1: The dash-dotted line represents the actual SVM loss function L(u). The dashed line represents the approximation to the loss function used in equation 2.3 when uki = 1, and the solid line represents this approximation when uki < 0.
and it can be easily checked that 0 < ηk < 1. Then if we set ηk = min S
uki
uki , − usi
(2.6)
where S = {i| uki < 0 & usi > 0}, we will ensure that LP (wk+1 , bk+1 ) ≥ LP (wk+1 , bk+1 ). In the case S = ∅, we set wk+1 = ws and bk+1 = bs (i.e., ηk = 1), which proves that LP (zk + ηk pk ) < LP (zk ). Now we can set c1 ∈ (0, c∗1 ] to fulfill equation 2.4, where c∗1 =
LP (zk +ηk pk )−LP (zk ) ∇LP (zk )T pk
is greater than zero because
∇LP < 0. Otherwise, would not be a descending direction. Before proving the second Wolfe condition for the IRWLS, let us rewrite LP (w, b) as follows: (zk )T pk
LP (w, b) =
pk
n 1 L(uki ) w2 + C 2 i=1 dL(u) yi [φT (xi )(wk − w) + (bk − b)], + du uki
and let us define
LP (w, b) =
n 1 L(uk+1 ) w2 + C i 2 i=1 dL(u) yi [φT (xi )(wk+1 − w) + (bk+1 − b)], + du uk+1 i
12
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
which is equivalent to LP (w, b) but defined over the actual solution instead. Being LP (w, b) convex, it can be readily seen that LP (w, b) ≥ LP (w, b) and LP (w, b) ≥ LP (w, b) ∀w ∈ RH and ∀b ∈ R. As (wk+1 , bk+1 ) are a convex combination of (wk , bk ) and (ws , bs ), we can rewrite pk = [(ws − wk )T (bs − bk )]T = [(wk+1 − wk )T (bk+1 − bk )]T /ηk , leading in the left-hand side of equation 2.5 to: ηk (∇LP (zk+1 )T pk ) k+1 T
) −C
= (w ×
n
i=1 k w
n dL(u) dL(u) φ (xi )yi −C yi du uk+1 du uk+1 i=1 i i T
− bk+1 − bk
wk+1
= wk+1 2 − (wk+1 )T wk − C
n dL(u) du i=1 k+1
uk+1 i
× yi [φT (xi )(wk+1 − wk ) + (b − bk )] n dL(u) = wk+1 2 − (wk+1 )T wk − C du k+1 T
× yi [φ (xi )(w
k+1
1 − wk 2 − C 2
k
i=1 k+1
− w ) + (b
n i=1
ui
k
− b )]
n 1 k 2 L(uk+1 ) + + C L(uk+1 ) w i i 2 i=1
1 1 1 = wk+1 2 − (wk+1 )T wk + wk 2 + wk+1 2 2 2 2 n +C L(uk+1 ) − LP (wk , bk ) i i=1
1 = wk+1 − wk 2 + LP (wk+1 , bk+1 ) − LP (wk , bk ). 2 We now repeat the same algebraic transformations over the right-hand side, of equation 2.5, leading to: ηk (∇LP (zk )T pk )
n dL(u) dL(u) φ (xi )yi −C yi = (w ) − C du uki du uki i=1 i=1 k+1 w − wk × bk+1 − bk n dL(u) = (wk+1 )T wk − wk 2 + C du k k T
n
T
i=1
ui
× yi [φT (xi )(wk − wk+1 ) + (bk − bk+1 )]
IRWLS Convergence to the SVM Solution
13
n n 1 1 + wk+1 2 + C L(uki ) − wk+1 2 − C L(uki ) 2 2 i=1 i=1
1 = − wk+1 − wk 2 − LP (wk , bk ) + LP (wk+1 , bk+1 ). 2 We now show that
wk+1 − wk 2 /2 + LP (wk+1 , bk+1 ) − LP (wk , bk ) ηk
>
−wk+1 − wk 2 /2 − LP (wk , bk ) + LP (wk+1 , bk+1 ) , ηk
which is equivalent to wk+1 − wk 2 + [LP (wk+1 , bk+1 ) − LP (wk+1 , bk+1 )] + [LP (wk , bk ) − LP (wk , bk )] > 0, because ηk ∈ (0, 1]. The terms L(wk+1 , bk+1 ) − L (wk+1 , bk+1 ) and L(wk , bk ) − L (wk , bk ) are equal to or greater than zero because the loss function is convex. Moreover, wk+1 − wk 2 ≥ 0, and it is zero only if wk+1 = wk . Therefore, we are not at the solution ∇LP (zk+1 )T pk > ∇LP (zk )T pk . Now, we can set c2 ∈ [c∗2 , 1)1 to fulfill equation 2.5, where c∗2 = ∇LP (zk+1 )T pk is less than one because ∇LP (zk )T pk < 0; otherwise, pk would ∇LP (zk )T pk not be a descending direction. We now need to prove that the proposed algorithm stops when the gradient of LP (w, b) vanishes. The Zoutendijk condition (Nocedal & Wright, 1999) tell us that if LP (w, b) is bounded below and it is Lipschitz continuous,2 and the optimization procedure fulfills the Wolfe conditions, then ∇LP (wk , bk )2 ∇L (wk ,bk )T pk
cos2 θk → 0 as k → ∞, where cos2 θk = ∇L P(wk ,bk )pk . If we prove that θk P does not tend to π/2 as k → ∞, we would have proven that the gradient of LP (w, b) vanishes and that the proposed algorithm converges to a minimum. Finally, we would need to prove that the achieved solution corresponds to the SVM solution, which we will first prove. The minimum of equation 2.3 is obtained by solving the following linear system: n T φ(xi )yi ai (1 − yi (φ (xi )w + b)) w − 0 i=1 . = n 0 T − yi ai (1 − yi (φ (xi )w + b)
(2.7)
i=1
If c∗2 < 0, the minimum value c2 can take is c1 , and if c∗1 > 1, the highest value c1 can take is c2 , but this does not affect the given proof. 2 L (w, b) is equal to or greater than zero, and it is Lipschitz continuous, because we P made it differentiable. 1
14
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
The IRWLS procedure stops when ws = wk and bs = bk ; if we replace them in equation 2.7, we are led to C dL(u) T s s φ(xi )yi k k (1 − yi (φ (xi )w + b )) w − du u u i i=1 i n C dL(u) − yi k (1 − yi (φT (xi )ws + bs ) u du k
s
n
i=1
i
ui
n dL(u) s φ(xi )yi w − C du usi 0 i=1 , = = n 0 dL(u) −C yi du usi i=1
(2.8)
which is equal to equation 2.2. Consequently the IRWLS algorithm stops when it has reached the SVM solution. To prove the sufficient condition, we need to show that if wk = w∗ and k b = b∗ , the IRWLS has stopped. Suppose it has not; then we can find ws = wk and bs = bk such that LP (wk , bk ) > LP (ws , bs ), and the strictly decreasing property will lead to LP (w∗ , b∗ ) > LP (ws , bs ), which is a contradiction because w∗ and b∗ give the minimum of LP (w, b). We have just proven that if the IRWLS has stopped, we will be at the SVM solution, and if we are at the SVM solution, the IRWLS has stopped. Finally, we do not need to prove that θk does not tend to π/2, because we have just shown that the algorithm stops iff we are at the SVM solution and, consequently, this is the point at which the gradient of LP (w, b) vanishes. That was what we still needed to prove, ending the proof of convergence. 3 Iterative Reweighted Least Squares for Support Vector Classifiers The IRWLS procedure, when introduced in P´erez-Cruz et al. (1999), did not consider the modification to ensure convergence presented in the previous section (i.e., ηk < 1 in some iterations). We will now describe the algorithmic implementation of the procedure. But before presenting the algorithm, let us rewrite equation 2.7 in matrix form, T Da + I aT
T a aT 1
T w Da y = , b aT y
(3.1)
where = [φ(x1 ), φ(x2 ), . . . , φ(xn )]T , y = [y1 , . . . , yn ]T , a = [a1 , . . . , an ]T , (Da )ij = ai δij (∀i, j = 1, . . . , n), I is the identity matrix, and 1 is a column vector of n ones. This system can be solved using kernels, as well as the regular SVM, by imposing that w = i φ(xi )yi αi and i αi yi = 0. These conditions can be obtained from the regular SVM solution (KKT conditions; see Scholkopf ¨
IRWLS Convergence to the SVM Solution
15
& Smola, 2001, for further details). Also, they can be derived from equation 2.2 in which the αi have replaced the derivative of L(ui ). The system in equation 3.1 becomes H + Da −1 yT
1 y α , = 0 0 b
(3.2)
where (H)ij = k(xi , xj ) = φT (xi )φ(xj ) and k(·, ·) is the kernel of the nonlinear transformation φ(·) (Scholkopf ¨ & Smola, 2001). The steps to derive equation 3.2 from equation 3.1 can be found in P´erez-Cruz et al. (2001). The IRWLS can be summarized in the following steps: 1. Initialization: set k = 0, α0 = 0, b0 = 0 and u0i = 1 2. Solve equation 3.2 to obtain 3.
αs
and
Compute usi . Construct S = {i|uki αk+1 = αs and bk+1 = bs and go to
∀i = 1, . . . , n.
bs .
< 0 & step 5.
usi > 0}. If S = ∅, set
4. Compute ηk , using equation 2.6, and αk+1 and bk+1 . If L(αk+1 , bk+1 ) > L(αs , bs ), set αk+1 = αs and bk+1 = bs . 5. Set k = k + 1 and go to step 2 until convergence. The modification in the third step helps to further decrease the SVM functional. The value of ηk in equation 2.6 is a sufficient condition but not a necessary one, and in some cases, αs and bs can produce a further decrease in L(α, b) than using αk+1 and bk+1 . It is worth pointing out that the solution achieved in the first step coincides with the least-square support vector machine solution (Suykens, Van Gestel, De Brabanter, De Moor, & Vandewalle, 2003), as Da is the identity matrix multiplied by C. In a way, we can say that the starting point of the IRWLS procedure is the LS-SVM solution. 4 Comments on the IRWLS Procedure In this article, we have proven the convergence of the IRWLS procedure to the SVM solution. This algorithm was devised for solving SVMs and presents several properties that make it desirable. First, the IRWLS algorithm, as its name indicates, needs to solve only a simple least-square problem in each iteration. Moreover, the linear system is formed only by the samples whose uki > 0, while those samples whose uki < 0 will not affect the functional LP (w, b) in equation 2.3. This property allows working with only part of the kernel matrix, significantly reducing the run-time complexity. Second, during the first iterations, which are the most computationally costly, many samples change the value of ui from positive to negative. Moreover, if ηk < 1, a sample whose uki < 0 ends with uk+1 = 0, and in the next i iteration its ai = KC. Therefore if any sample whose u∗i ≥ 0, but in some
16
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
intermediate iteration uki < 0, the algorithm will recover it. The IRWLS changes several constraints from active to inactive in one iteration, while most QP algorithms stop when a constraint changes. We illustrate this property with a simple example in Figure 2. Third, we point out that the value of ηk is 1 in most iterations; only seldom does a sample change from uki < 0 to usi > 0 and L(wk+1 , bk+1 ) ≤ L(ws , bs ), which explains why the IRWLS was working correctly without this modification. The role of K can be analyzed from the proof and implementation perspectives. A finite value of K allows demonstrating that the IRWLS procedure converges to the SVM solution. If K was infinite, the functional would not be differentiable, and although equation 2.2 and 2.8 will be equal, that would not mean that wop (bop ) is equal to w∗ (b∗ ). From the implementation viewpoint, we are adding at least 1/K to the diagonal of H; therefore, if H is nearly singular (as it is for most problems of interest—even for infinite VC dimension kernels), a finite value of K would avoid numerical instabilities. We usually fix K between 104 and 1010 , depending on the machine precision,
(a)
(b)
(d)
(e)
(c)
(f)
120
LP in (2)
100 80 60 40 0
# Iter
10
20
Figure 2: (a–e) The intermediate solution of the IRWLS procedure for a simple problem respectively, for iterations 1, 2, 3, 8, and 18 (final). (f) The value of LP for every iteration. The squares represent the negative-class samples and the circles the positive-class samples. The solid circles and squares are those samples whose uk+1 > 0. The solid line is the classification boundary, and the dashed i lines represent the ±1 margins. It can be seen that in the first step, almost half of the samples change from a u0i > 0 to u1i < 0, significantly advancing toward the optimum and reducing the complexity of subsequent iterations. The solution in iteration 8 is almost equal to the solution in iteration 18; in these intermediate iterations, the algorithm is only fine-tuning the values of αi .
IRWLS Convergence to the SVM Solution
17
and do not modify it during training. For this range of K, the solution does not vary significantly, meaning that we are close to or at the SVM solution. A software package in MATLAB can be downloaded from our web page (http://www.gatsby.ucl.ac.uk/∼fernando). 4.1 Extending the IRWLS Procedure. We have proven the convergence of the IRWLS procedure to the standard SVM solution. This procedure is sufficiently general to be applied to other loss functions. It can be directly applied to any convex, continuous, and differentiable loss function. As we rely on the derivative of the loss, we demonstrate that the limiting point of the IRWLS procedure is the solution to the actual functional; we need the first-order Taylor expansion to be a lower bound on the loss function to show the sufficient decreasing property. To complete the IRWLS procedure, one needs to come up with a quadratic approximation, which has to be at least locally an upper bound to the loss function to ensure the strictly decreasing property. This upper bound has to take the same value and derivative that the loss function does at the actual solution to ensure that the sequence of solutions converges to the solution of our functional. The question that readily arises is if it can be employed for nonconvex loss functions. First, if it is used at all, it would lead only to a local minimum; another method would be needed to assess the quality of the obtained solution. If the loss function is not convex, the sufficient decreasing property would not hold for every possible solution found by the second step of the IRWLS procedure and further analysis would be necessary, taking into account the shape of the nonconvex function, to demonstrate the convergence of the algorithm. Acknowledgments We kindly thank Chih-Jen Lin for his useful comments and pointing out the weakness of our previous proofs. References Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Knowledge Discovery and Data Mining, 2(2), 121–167. Mangasarian, O. L., & Musicant, D. R. (2000). Lagrangian support vector machines. Journal of Machine Learning Research, 1, 161–177. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer. P´erez-Cruz, F., Alarcon-Diana, ´ P. L., Navia-V´azquez, A., & Art´es-Rodr´ıguez, A. (2000). Fast training of support vector classifiers. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13, Cambridge, MA: MIT Press.
18
F. P´erez-Cruz, C. Bousono-Calz ˜ on, ´ and A. Art´es-Rodr´ıguez
P´erez-Cruz, F., Navia-V´azquez, A., Alarcon-Diana, ´ P. L., & Art´es-Rodr´ıguez, A. (2000). An IRWLS procedure for SVR. In Proceedings of the EUSIPCO’00. Tampere, Finland. P´erez-Cruz, F., Navia-V´azquez, A., Alarcon-Diana, ´ P. L., & Art´es-Rodr´ıguez, A. (2001). SVC-based equalizer for burst TDMA transmissions. Signal Processing, 81(8), 1681–1693. ´ P´erez-Cruz, F., Navia-V´azquez, A., Rojo-Alvarez, J. L., & Art´es-Rodr´ıguez, A. (1999). A new training algorithm for support vector machines. In Proceedings of the Fifth Bayona Workshop on Emerging Technologies in Telecommunications (pp. 116–120). Baiona, Spain. Scholkopf, ¨ B., & Smola, A. (2001). Learning with kernels. Cambridge, MA: MIT Press. Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2003). Least squares support vector machines. Singapore: World Scientific. Suykens, J. A. K., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300. Van Gestel, T., Suykens, J. A. K., Baesens, B., Vanthienen, S., Dedene, G., De Moor, B., & Vandewalle, J. (2004). Benchmarking least squares support vector machines classifiers. Machine Learning, 54(1), 5–32. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Received January 7, 2004; accepted May 25, 2004.
LETTER
Communicated by Bruno Olshausen
Efficient Coding of Time-Relative Structure Using Spikes Evan Smith
[email protected] Department of Psychology, Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
Michael S. Lewicki
[email protected] Department of Computer Science, Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
Nonstationary acoustic features provide essential cues for many auditory tasks, including sound localization, auditory stream analysis, and speech recognition. These features can best be characterized relative to a precise point in time, such as the onset of a sound or the beginning of a harmonic periodicity. Extracting these types of features is a difficult problem. Part of the difficulty is that with standard block-based signal analysis methods, the representation is sensitive to the arbitrary alignment of the blocks with respect to the signal. Convolutional techniques such as shift-invariant transformations can reduce this sensitivity, but these do not yield a code that is efficient, that is, one that forms a nonredundant representation of the underlying structure. Here, we develop a non-block-based method for signal representation that is both time relative and efficient. Signals are represented using a linear superposition of time-shiftable kernel functions, each with an associated magnitude and temporal position. Signal decomposition in this method is a non-linear process that consists of optimizing the kernel function scaling coefficients and temporal positions to form an efficient, shift-invariant representation. We demonstrate the properties of this representation for the purpose of characterizing structure in various types of nonstationary acoustic signals. The computational problem investigated here has direct relevance to the neural coding at the auditory nerve and the more general issue of how to encode complex, time-varying signals with a population of spiking neurons. 1 Introduction Nonstationary and time-relative acoustic structures such as transients, timing relations among acoustic events, and harmonic periodicities provide essential cues for many types of auditory processing. In sound localization, Neural Computation 17, 19–45 (2005)
c 2004 Massachusetts Institute of Technology
20
E. Smith and M. Lewicki
human subjects can reliably detect interaural time differences as small as 10 µs, which corresponds to a binaural sound source shift of about 1 degree (Blauert, 1997). In comparison, the sampling interval for an audio CD sampled at 44.1 kHz is 22.7 microseconds. Auditory grouping cues, such as common onset and offset, harmonic comodulation, and sound source location, all rely on accurate representation of timing and periodicity (Slaney & Lyon, 1993). Time-relative structure is also crucial for the recognition of consonants and many types of transient, nonstationary sounds. Neurophysiological research in the auditory brainstem of mammals has found cells capable of conveying precise phase information up to 4 kHz or of tracking the quickly varying envelope of a high-frequency sound (Oertel, 1999). The importance of these acoustic cues has long been recognized, but extracting them from natural signals still poses many challenges because the problem is fundamentally ill posed. In natural acoustic environments, with multiple sound sources and background noises, acoustic events are not directly observable and must be inferred using numerous ambiguous cues. Another reason for the difficulty in obtaining these cues is that most approaches to signal representation are block based; the signal is processed piecewise in a series of discrete blocks. Transients and nonstationary periodicities in the signal can be temporally smeared across blocks. Large changes in the representation of an acoustic event can occur depending on the arbitrary alignment of the processing blocks with events in the signal. Signal analysis techniques such as windowing or the choice of the transform can reduce these effects, but it would be preferable if the representation was insensitive to signal shifts. Shift invariance alone, however, is not a sufficient constraint on designing a general sound processing algorithm. Another important constraint is coding efficiency or, equivalently, the ability of the representation to capture underlying structure in the signal. A desirable code should reduce the information rate from the raw signal so that the underlying structures are more directly observable. Signal processing algorithms can be viewed as a method for progressively reducing the information rate until one is left with only the information of interest. We can make a distinction between the observable information rate, or the rate of the observable variables, and the intrinsic information rate, or the rate of the underlying structure of interest. In speech, the observable information rate of the waveform samples is about 50,000 bits per second, but the intrinsic rate of the underlying words is only around 200 bits per second (Rabiner & Levinson, 1981). Information reduction can be achieved by either selecting only the desired information (and discarding everything else) or removing redundancy, such as the temporal correlations between samples. This reduces the observable information rate while preserving the intrinsic information. In this letter, we investigate algorithms for fitting an efficient, shiftinvariant representation to natural sound signals. The outline of the letter is as follows. The next section describes the motivations behind this approach
Efficient Coding of Time-Relative Structure Using Spikes
21
and illustrates some of the shortcomings of current methods. After defining the model for signal representation, we present different algorithms for signal decomposition and contrast their complexity. Next, we illustrate the properties of the representation on various types of speech sounds. We then present a measure of coding efficiency and compare these algorithms to traditional methods for signal representation. Finally, we discuss the relevance of the computational issues discussed here to spike coding and signal representation at the auditory nerve. 2 Representing Nonstationary Acoustic Structure Encoding the acoustic signal is the first step in any algorithm for performing an auditory task. There are numerous approaches to this problem, which differ in both their computational complexity and in what aspects of signal structure are extracted. Ultimately, the choice about what the representation encodes depends on the tasks that need to be performed. In the ideal case, the encoding process extracts only that information necessary to perform the task and suppresses noise or unrelated information. A generalist approach, like that taken by most mammalian auditory systems, requires a representation that is efficient for a wide range of signals. As natural sounds contain both relatively stationary harmonic structure (e.g., animal vocalizations) as well as nonstationary transient structure (e.g., crunching leaves and twigs), this generalist approach requires a code capable of efficiently representing these disparate sound classes (Lewicki, 2002a). Here we seek an auditory representation that is useful for a variety of different tasks. 2.1 Block-Based Representations. Most approaches to signal representation are block based, in which signal processing takes place on a series of overlapping, discrete blocks. This not only obscures transients and periodicities in the signal, but can also have the effect that for nonstationary signals, small time shifts can produce large changes in the representation, depending on whether and where a particular acoustic event falls within the block. Figure 1 illustrates the sensitivity of block-based representation with small shifts in speech signals. The upper panel shows a short speech waveform sectioned into blocks using two sequences of Hamming windows (solid and dashed curves). Each window spans approximately 30 msecs (512 samples) and successive blocks (A1, A2, and so on) are shifted by 10 msecs. The B blocks offset from the A blocks by an amount indicated by the dot-dash vertical lines (∼ 5 msecs), representing the arbitrary alignment of the signal with respect to the two block sequences. The lower panel shows spectral representations for the three corresponding blocks (solid for the A blocks, dashed for the B blocks). The jagged upper curves show the power spectra for each windowed waveform. The smooth lower curves (offset by −20 dB) show the spectrum of the optimal filter derived by linear predictive coding.
22
E. Smith and M. Lewicki !
!
"
3IGNAL,EVELD"
! "
!
"
! "
"
MS ! "
´ ´ ´ ´
&REQUENCYK(Z
Figure 1: Block-based representations are sensitive to temporal shifts. (Top panel) A speech waveform with two sets of overlaid Hamming windows, A1–3 (continuous lines above waveform) and B1–3 (dashed lines below the waveform). (Lower panels) The power spectrum (jagged) and LPC spectrum (smooth) of hamming windows offset by less than 5 ms are overlaid (A, continuous; B, dashed). In either of these, small shifts (e.g., from A2 to B2) can lead to large changes in the representation.
The sound used in Figure 1 is /et/ in the context of the word Vietnamese. The three-block sequence contains an abrupt, transient signal feature, a relatively high-frequency, and high-amplitude /t/ sound occurring at about the 38th msec. The windows preceding the /t/, A1 and B1, contain only the /ee/ vowel waveform. The spectra of these windows nearly overlap, although differences resulting from the slow change in the vowel can be seen. The spectra for windows A2 and B2 show a dramatic difference in the range of 3 to 5 kHz. This results entirely from the arbitrary alignment of each window and the /t/. Because window B2 contains a significant portion of the /t/ waveform, it shows a pronounced increase in powering the higher range. The spectra for the following windows, A3 and B3, are again nearly overlapping, as the /t/ is well represented in both windows. Notice that the increase in the power for window B2 is not as great as that for the final windows. This implies that the alignment of the B sequence will cause a temporal smearing of the constant onset, spreading the energy of the 2 msec transient over a 10 msec window. Discrimination of the phonemes, such as /ba/ and /pa/, is based on differences as small as 5 to 10 msecs in voice-onset time (Liberman, Delattre, & Cooper, 1958). The temporal smear-
Efficient Coding of Time-Relative Structure Using Spikes
23
! m #
"
k k k
Figure 2: A continuous filter bank produces a shift-invariant representation but does not reduce the information rate. An input signal (A) is convolved with a filter bank (B). The output of the convolution (C) has increased the dimensionality of the input signal.
ing illustrated in Figure 1 could create an ambiguity in the onset of voicing and lead to an alteration in the phoneme perception of a listener. 2.2 Convolutional Representations. One way to minimize the shift sensitivity problem is to increase the block rate. This reduces the variability of the observed spectra but results in a very inefficient code, because there are then several, slowly changing representations of the same underlying acoustic events. In the limit, increasing the block rate simply produces a filter bank in which windowed sinusoids are convolved with the signal. Although this yields a representation that is invariant to shifts, a major drawback is that a filter bank does not reduce the information rate because the dimensionality of the each output is identical to the input; furthermore, there is one output for each filter. This problem is illustrated in Figure 2. The speech waveform in the top row of Figure 2A (the /et/ in Vietnamese and identical to that used in figure 1) is convolved with each of the three (time domain) filters shown in the right column (see Figure 2B). The filters are Gabor functions with peak resonance frequencies at the first and second formants (360 and 2750 Hz) and 4000 Hz. The filter outputs (see Figure 2C) show that the formant energy is roughly constant throughout the sound, while energy in the /t/ is relatively localized. Clearly, it would be preferable to have an efficient representation that was insensitive to signal shift, preserving transients and harmonic shifts, but encoded structure in an event-based fashion. 3 A Sparse, Shiftable Kernel Representation Here we employ a sparse, shiftable kernel method of signal representation (Lewicki & Sejnowski, 1999; Lewicki, 2002b). In this model, the signal x(t) is encoded with a set of kernel functions, φ1 , . . . , φM , that can be positioned arbitrarily and independently in time. The mathematical form of the repre-
24
E. Smith and M. Lewicki
sentation with additive noise is x(t) =
nm M
m sm i φm (t − τi ) + (t),
(3.1)
m=1 i=1
where τim and sm i are the temporal position and coefficient of the ith instance of kernel φm , respectively. The notation nm indicates the number of instances of φm , which need not be the same across kernels. In addition, the kernels are not restricted in form or length. A more general way to express equation 3.1 is to assume that the kernel functions exist at all time points during the signal and let the nonzero coefficients determine the positions of the kernel functions. In this case, the model can be expressed in convolutional form, x(t) = sm (τ )φm (t − τ )dτ + (t), (3.2) m
where sm (τ ) is the coefficient at time τ for φm . By using a sparse coefficient signal sm (t) composed only of delta functions, equation 3.2 reduces to equation 3.1. A similar approach assuming only sparse coefficients has been used for coding of natural movies (Olshausen, 2002). The key theoretical abstraction of the model is that the signal is decomposed in terms of discrete acoustic events, represented by the kernel functions, each of which has a precise amplitude and temporal position. Here we assume the kernels are gammatone functions (gamma-modulated sinusoids) whose center frequency and width are set according to an equivalent rectangular band (ERB) filter bank cochlear model using Slaney’s auditory toolbox for Matlab (Slaney, 1998). Except where noted, we used a set of 64 kernel functions for the results below. The use of gammatone functions is well motivated by both biology and natural sound statistics (Lewicki, 2002a). In principle, we could also adapt the set of kernel functions to maximize the efficiency of the code. Figure 3 illustrates the generative model. A signal is represented in terms of a sparse set of discrete temporal events, a spike code. For example, the waveform in Figure 3A consists of three aperiodic “chirps,” each composed of discrete acoustic events with differing amplitudes but identical relative temporal alignments. This signal can be represented by nine spikes, each with a precise time and amplitude. We can plot this representation in terms of a spikegram, Figure 3B, where the nine spikes are shown as ovals of varying size, intensity, and position. Each oval indicates the temporal and spectral position (center of mass and center frequency, respectively) of one gammatone kernel function, with oval size and intensity indicating the amplitude of the kernel coefficient. Representing a kernel’s temporal position based on its center of mass causes them all to align precisely given a delta function as input. We adopt this convention to help illustrate the temporal precision of the spike code.
Efficient Coding of Time-Relative Structure Using Spikes
25
!
+ERNEL#&(Z
"
MS
Figure 3: An illustration of generative model and its spikegram representation. The signal (A) is represented in the spikegram (B) as a set of ovals whose size and intensity indicate the amplitude of the spike. The position of the oval indicates the kernel center frequency (CF, y-axis) and timing (x-axis). The gammatone functions corresponding to the spikes (represented by each oval) are overlaid in gray.
3.1 Encoding Algorithms. Equation 3.1 specifies the generative form of the model but does not provide an encoding algorithm, that is, how to compute the optimal values of τim and sm i for a given signal. The computational objective is to minimize the error (t) while maximizing coding efficiency. As is the case with most coding algorithms, there is a trade-off between the error of the representation and the computational complexity of the algorithm. For the results here, we used three different encoding algorithms to select values for τim and sm i . These show a clear trade-off between complexity and accuracy, but we can gain some flexibility along these dimensions by hybridizing, using the simpler algorithms to initialize the most complex. 3.1.1 Filter Threshold. One approach to efficient audio coding has been to use filter banks based on the human cochlea (Baumgarte, 2002; Lyon, 1982; Shamma, 1985; Gitza, 1988; Patterson, Holdsworth, Nimo-Smith, & Rice, 1988). The filter-threshold algorithm is a computationally simple approximation of cochlear processing. This is a causal approach, and it begins by convolving the signal with the full set of kernel functions from the gammatone ERB filter bank. (Note that for all of the algorithms described here, the kernels are restricted to have unit norm.) The encoded coefficients and m times, sm i and τi , are chosen based on the values and positions of all convolution peaks that exceed a preset threshold. This greatly reduces the size of the observable information rate compared to the convolutional representation, but some degree of (threshold-dependant) temporal and spectral redundancy remains. Filter banks with more than 16 gammatones kernel functions are highly overcomplete, but the filter threshold algorithm does not take the correlations between kernel functions into account during cod-
26
E. Smith and M. Lewicki
ing. As a result, it tends to be a poor estimate of the signal given our linear superposition model. We compensate for this to some degree by adding a single parameter to scale the coefficients. Despite its shortcomings under our model, filter threshold is relatively fast and resilient to noise due to its inherent redundancy. These could be desirable properties depending on the task the system must perform. 3.1.2 Matching Pursuit. An obvious improvement on filter threshold would be to account explicitly for the correlations between kernels, iteratively regressing the signal onto the kernels. This is a noncausal approach, but our goal here is to determine the optimal signal representation. One well-studied formalization of this approach is the matching pursuit algorithm (Mallat & Zhang, 1993). We employ it here to produce a more efficient estimate of the τim and sm i values for a given signal. Our goal is to decompose the signal, x(t), over a set of kernels selected from the gammatone filter bank so as to best capture the structure of the signal. Matching pursuit’s approach to this problem is to iteratively approximate the input signal with successive orthogonal projections onto some basis (in this case the unit-normed gammatone kernels). The signal can be decomposed into x(t) = x(t)φm φm + Rx (t),
(3.3)
where x(t)φm is the inner product between the signal and the kernel and is equivalent to sm in equation 3.1. The final term in equation 3.3, Rx (t), is the residual signal after approximating x(t) in the direction of φm . The projection with the largest inner product will minimize the power of Rx (t), thereby capturing the most structure possible given a single kernel. Equation 3.2 can be rewritten more generally as Rnx (t) = Rnx (t)φm φm + Rn+1 x (t),
(3.4)
with R0x (t) = x(t) at the start of the algorithm. With each iteration, the current residual is projected onto the gammatones. A single kernel is selected such that φm = arg maxRnx (t)φm . m
(3.5)
This best-fitting projection is subtracted out, and its coefficient and time are recorded. This projection and subtraction leaves Rnx (t)φm φm orthogonal to the residual signal, Rn+1 x (t). It is relatively straightforward to see that each projection is orthogonal to all previous and future projections (Mallat & Zhang, 1993). As a result, matching pursuit codes are composed of mutually orthogonal signal structures.
Efficient Coding of Time-Relative Structure Using Spikes
27
Assuming the kernels span the signal space, the power of the residual, Rnx (t), is guaranteed to decrease on each iteration of the algorithm (Mallat & Zhang, 1993; Goodwin & Vetterli, 1999), and so, in the limit, matching pursuit codes will have arbitrarily small error. For most practical purposes, however, some halting criteria should be defined. The simplest is a lower bound on the inner product between the signal and the kernels. We can also track the signal-to-noise ratio of the code over time and stop at a desired fidelity, or halt when some number of spikes has been recorded. More sophisticated criteria are also possible. We reduce some of the computational overhead of the algorithm by defining local neighborhoods among the kernels via cross-correlation. If the maximal inner product between two kernels across all time shifts was greater than some value θ, then they were included in each other’s neighborhood. Typically, θ was set to 0.001 (all kernels were normalized to have an L2 norm of 1). These neighborhoods are used for reconvolution with the residual signal (i.e., if the last spike involved kernel φn , then a new residual was calculated only for the neighborhood around n). This can introduce very low magnitude distortion in the code, but the computational cost is significantly reduced as most of the kernels in the filter bank are orthogonal to one another at all time shifts. 3.1.3 MAP Optimization. A probabilistic method for inferring spike amplitudes and times was described in Lewicki and Sejnowski (1999) and Lewicki (2002b). This approach makes no heuristic assumptions about where spikes should occur, for example, selecting convolution maxima as in the previous two algorithms. Instead, the problem is recast in a Bayesian probabilistic framework in which we attempt to maximize the a posteriori distribution of coefficients. To describe this approach, we begin by expressing the model in matrix form using a discrete sampling of the continuous time series: x = As + .
(3.6)
The rows of the basis matrix, A, contain each gammatone kernel replicated at each sample position making the basis highly overcomplete. The optimal set of τim and sm i for a signal is found by maximizing the posterior distribution of coefficients given the signal and the gammatones, sˆ = arg max P(s|x, A) = arg max P(x|A, s)P(s). s
s
(3.7)
We make two assumptions in modeling the distributions in equation 3.7. First, the noise, , is gaussian and so the data likelihood, P(x|A, s), is also gaussian. Second, the prior, P(s), a function of the spike times and amplitudes, is very sparse. Given these assumptions, the gradient of equation 3.7
28
E. Smith and M. Lewicki
is given by ∂ log P(s|A, x) ∝ AT (x − As) + z(s), ∂s
(3.8)
where z(s) = (log P(s)) . P(s) was assumed to follow a Laplacian, but other distributions are possible. The assumption of sparseness of the kernel coefficients means that optimizing equation 3.7 essentially selects out the minimal set of gammatones that best accounts for the structure of the sound signal for a given noise level. Although optimally efficient codes are possible in theory, in practice only the briefest sounds can be encoded in this manner. For example, using 64 kernels to encode a signal sampled at 44.1 kHz requires approximately 2.8 million coefficients to be optimized per second of signal. We can reduce most of the computational overhead by using filter threshold or matching pursuit to initialize the maximum a posteriori (MAP) optimization. Instead of optimizing over the entire parameter space, these hybrid algorithms search for optimal amplitude values, s, over a set of spike times τ selected by one of the two approximative algorithms. The departure from optimality is a function of the number and “quality” of the spike times selected by the initializing algorithm. In the results that follow, these hybrid algorithms are evaluated alongside the other algorithms as approximations of the true optimally efficient code. 4 Spike Code Signal Representation The sparse, shiftable kernel representation and a set of decomposition algorithms have now been formalized. To evaluate the model, we will present both examples of the codes it generates and an objective comparison between those and other codes. The following section contains specific examples of spike codes to illustrate its qualities as a method for signal representation and some benefits of time-relative coding. 4.1 Comparison of Encoding Algorithms. There are five possible encoding algorithms described in the previous section: filter threshold, matching pursuit, MAP optimization, optimized filter threshold, and optimized matching pursuit. Figure 4 shows the spike code for a short section of speech (three glottal pulses from the vowel /a/ sampled at 16 kHz) using four of these different encoding algorithms. Even at timescales of 480 to 800 samples, the optimization problem is prohibitive, and an example is not presented here. These spikegrams are formatted identically to Figure 3, with ovals representing the time, center frequency, and magnitude of a spike; only the kernel function overlays are removed. For each, we measure the quality of its representations in terms of signal-to-noise ratio (SNR). To compute this, a reconstruction of the input is generated from the code, and a
Efficient Coding of Time-Relative Structure Using Spikes
29
residual error is computed between the original and reconstruction. The SNR in decibels (dB) is then SNR = 10 log10 (Po /Pe ), where Po is the power of the original and Pe is the power of the residual error. The spikegram in Figure 4A is generated using filter threshold. A high degree of redundancy in both time and frequency is quite evident in the correlated waves of spikes that code each glottal pulse. This redundancy may serve to enhance structural similarity between sound events (e.g., the glottal pulses) and increase the representation’s resistance to noise, but it lacks a succinct description of the temporal and spectral characteristics of the sound. Filter threshold encodes the sound to 18 dB SNR (using the scaling parameter mentioned earlier). Perceptually, the input sound is noticeably distorted in the reconstruction, though the speech content is quite clear. Optimizing the filter threshold code has a dramatic effect on the quality of the encoding, pushing the SNR to 90.1 dB, well beyond the point where the original and reconstructed signals are perceptually discriminable. Given that the original .wav file had 16 bits of precision and assuming coding noise on the order of 1 bit, the estimated SNR of the original signal is about 90 dB. In the example shown in Figure 4B, we assumed a very low level of noise in the model. This results in the majority of spike amplitudes being shifted up or down but few pushed to zero; all of the available information is used to encode the signal accurately. Although few spikes are pruned given an assumption of very low noise, the distribution of spike amplitudes does become sparser as a result of optimization. As progressively high noise levels are assumed, the resulting codes become increasingly sparse, sacrificing SNR in order to prune spikes. Figure 4C shows an example of the spike code produced by matching pursuit. It is obviously vastly less redundant than the filter threshold code in Figure 4A. There is relatively little obvious structure within the representation of each glottal pulse, implying that primarily independent events are being represented; however, the similarity between pulses is evident. Despite the much more compact representation, the signal is encoded to 30.4 dB SNR, with only very subtle distortions perceivable. The code generated by a matching pursuit–MAP optimization hybrid (see Figure 4D) is nearly identical to that produced by matching pursuit alone. It is likely for a 30 dB SNR code that the optimization simply corrects some of the error introduced by our use of kernel neighborhoods when computing residuals on each iteration. One possible reason for the limited effect of optimizing is that matching pursuit codes represent a deep local minimum in the parameter space and the gradient method fails to find a global optimum. Another factor concerns the nature of signal decomposition using matching pursuit. This will be discussed further in a later section. 4.2 Convergence of Fidelity. When encoding a signal with matching pursuit, MAP optimization or any hybrid, the SNR of the code increases monotonically with the number of spikes (this is not necessarily true of
30
E. Smith and M. Lewicki
!
"
&ILTER#&(Z
MS
#
MS
MS
$
&ILTER#&(Z
MS
Figure 4: Spikegrams created from an input signal (top) using each of the four algorithms: (A) filter threshold encoded to 18.7 dB SNR (see text), (B) optimized filter threshold encoded to 90.1 dB SNR, (C) matching pursuit encoded to 30.4 dB SNR, and (D) optimized matching pursuit encoded to 33.0 dB SNR.
filter threshold). For the optimized codes, the amount of noise assumed in the model defines the trade-off between sparseness and accuracy. Because these codes are globally optimal, their specific form (the precise location of spikes) may be altered given different noise levels. For matching pursuit, the trade-off is much clearer: lowering the threshold for accepting a spike (or otherwise varying the halting criterion) simply adds additional spikes that code further residual structure. Figure 5 shows the effect of varying the number of spikes in a matching pursuit code. The input signal is a segment of speech (the word can sampled at 16 kHz). The spikegram in Figure 5A reflects a very high threshold, producing only 92 actual spikes (about 400 spikes/sec) and a relatively poor representation (10 dB SNR). Above the spikegram is the residual signal from the final iteration of the algorithm. It is apparent that a great deal of structure remains to be coded, although the onset of the consonant, /k/, and the periodicity of the /a/ and /n/ are already revealed. Perceptually, the sound is strongly distorted from the original, but the speech content is quite clear. Figures 5B through 5D show spikegrams and residuals for signals encoded to 20 dB (1600 spikes/sec), 30 dB (3100 spikes/sec), and 40 dB SNR (5500 spikes/sec). By 30 dB, Figure 5C, the distribution of residual amplitudes is not significantly different from a gaussian (based on Lilliefors statistical test to reject gaussian assumption, p-value > 0.2).
Efficient Coding of Time-Relative Structure Using Spikes
31
&ILTER#&(Z
MS
MS
MS
MS
&ILTER#&(Z
Figure 5: As the spike rate (spikes/sec) increases, the fidelity of the representation increases. The spikegrams above show the improvement of an optimized matching pursuit code with increasing spike rate: (A) 10 dB SNR at 400 spikes/sec, (B) 20 dB SNR at 1600 spikes/sec, (C) 30 dB SNR at 3100 spikes/sec, and (D) 40 dB SNR at 5500 spikes/sec. The residual error is plotted above each spikegram.
4.3 Effect of Kernel Number. Another parameter to be selected for any encoding algorithm is the number of kernel functions. Relatively few gammatones are needed to form a complete basis (i.e., a basis that spans the frequency space of the sounds used), but increasing the number allows greater spectral precision. Figure 6 shows the effect of using matching pursuit with 8, 16, 32, or 64 kernel functions (see Figures 6A–6D, respectively). To be certain that the four sets spanned the frequency space, they were generated independently using Slaney’s Matlab toolbox (Slaney, 1998). In each case, the signal is encoded to approximately 40 dB SNR, but the form of the code changes drastically. With relatively few gammatones (see Figures 6A and 6B), the code lacks both spectral and temporal precision. The time-relative coding is largely lost, and the representation becomes nearly convolutional. Using 32 or 64 kernels more clearly segments the acoustic events and begins to show invariant signal structure. Very similar findings were made testing MAP optimization and the hybrid algorithms. In contrast, increasing the number of kernels with filter threshold only shows enhanced spectral precision as spike times are selected independent of one another.
32
E. Smith and M. Lewicki
!
"
&ILTER#&(Z
MS
MS
MS
$
# &ILTER#&(Z
MS
Figure 6: The number of kernel functions affects both the spectral resolution and the temporal sparseness of the spike codes. The input signal (top) was encoded using matching pursuit with 8, 16, 32, or 64 kernel functions (A–D, respectively). The total number of spikes in each is (A) 12011, (B) 1167, (C) 497, and (D) 479.
4.4 Comparison to Spectrograms. Having described some of the details of the spike code, we now look more broadly at its representation. A comparison of a spectrogram and spikegram illustrates many properties of the model. In Figure 7, the upper plot shows the waveform of pizzerias spoken by an adult female and sampled at 16 kHz. The spikegram and spectrogram of this signal are shown in the middle and lower plots, respectively. The spikegram was constructed using an optimized filter threshold spike code with 128 ERB-spaced gammatone kernels. Both show the formant and harmonic structure of the vowels (e.g., 320–700 msec). Both also reveal the broad spectral and temporal characteristics of the signal, such as the diffuse energy of the /s/ from 700 to 800 msec. However, while the spectrogram is composed of 10 msec shifted windows (as illustrated in Figure 1), the spikegram possesses precise timing information to the sampling rate of the original signal and retains phase information. This allows it to reveal finely grained synchronous activity across bands. It also possesses a nonlinear frequency axis based on the cochlea. This axis emphasizes the range important to human hearing and is used in many auditory models and speech “front-ends.” 4.5 Sparse Representation of Transients. Though the “pizzerias” example demonstrates the large-scale features of the spike code, the fine structure
Efficient Coding of Time-Relative Structure Using Spikes
33
Figure 7: Three representations of the spoken word pizzerias. (A) Time-varying waveform. (B) Spikegram. (C) Spectrogram. They are presented on the same timescale (indicated at the bottom). Note that the spikegram and spectrogram use different frequency axes.
is more clearly revealed in a shorter speech segment. The waveform and spikegram of first half of the word wealth appear in Figure 8. Here we can see the time-relative coding of nonstationary structure. One hundred msec into the word (about 45 msec from the start of the spikegram), the period between glottal pulses begins to elongate. The spike code maintains a consistent representation of the individual pulses during this period, despite the time dilation. Although there is some slight variability in the representation of each pulse (appropriately reflecting the changes in the underlying signal), the spikes essentially align with the peak of each glottal pulse. Figure 8 also shows the efficiency of the code in representing harmonic structures. The spikegram shown can reconstruct the original signal to 30 db SNR, but it requires only a small number of spikes per pulse. Perceptually, the code contains only very subtle distortions of the original signal. Although the two are distinguishable, it is difficult to judge whether the original or reconstructed sound is the “true” signal. This demonstrates the efficiency of the spikegram with respect to nonstationary harmonic structures. One of the particularly desirable properties of the model is the efficient coding of transients, where precise temporal coding is most important. Dis-
34
E. Smith and M. Lewicki
+ ERNEL #& (Z
MS
Figure 8: A spikegram shows time-relative coding in the syllable /el/.
tinguishing consonants in continuous speech, for example, requires the detection of rapid, broadband transients. Figure 9 presents an example of a transient, /t/ sound from the word Vietnamese. The input signal in Figure 9A consists of an extended vowel with an embedded transient. The entire signal was encoded using matching pursuit and then optimized. The small set of spikes corresponding to the transient /t/ sound is easily distinguishable from the other spikes. We were able to segment them by hand from the rest of the representation. In Figure 9B, a spike code of only four events (magnified in the inset) is sufficient to encode the transient (see Figure 9A, Reconstruction), leaving only the vowel component (see Figure 9A, Residual). These two events are precise in time to within 0.06 msec (the sampling rate of the signal.) In a spectrogram (see Figure 9C), the same transient is smeared over 10 msec of time and a large region of frequency space. Note that the timescale (x-axis) is the same in Figures 9A, 9B, and 9C. Although this particular consonant is unusually short in duration, this example still illustrates the precise timing and localization achievable with a spike code. 5 Coding Efficiency The previous section shows that spike coding allows time-relative representation of sound structure, but it is not yet demonstrated that this produces an efficient representation of the signal. A complete evaluation of the spike code model requires some objective measure by which to compare the various algorithms and to compare the model against other representational techniques. Shannon’s rate distortion theory offers an objective measure of coding efficiency that is widely used in signal coding research (Shannon, 1948). The idea is to vary the rate of a code (typically in terms of bits per second) while measuring the effect that has on some measure distortion, such as mean squared error. For the comparisons between spike coding algorithms, we can start simply, varying the rate in terms of spikes per second and measuring the fidelity of the code (the inverse of distortion) in terms
Efficient Coding of Time-Relative Structure Using Spikes
35
Figure 9: Efficient representation of a speech consonant. An input signal (A, input) is represented as both a spikegram (B) and spectrogram (C). We can reconstruct the signal based on only the four spikes shown in the inset (B) to segment the /t/ sound from the vowel (A, reconstruction and residual).
of dB SNR. We will then address the issue of quantifying coding efficiency more precisely in terms of bits. 5.1 Coding Efficiency in Terms of Spikes. The within-model comparison of encoding algorithms will focus on filter threshold, matching pursuit, and the hybrids, allowing four different algorithm combinations. To generate a measure of coding efficiency, each algorithm was used to encode a large corpus of short (50–200 msec) segments of speech (Garofolo et al., 1990), music, and other “natural” sounds (e.g., birdsong, music and environmental sounds) at various spike rates. All of the stimuli were in .wav format, sampled at 16 kHz, and band-limited to 80 to 6000 Hz. The leading and trailing portions of the stimuli were multiplied by half-Hanning windows to prevent edge artifacts. The left panel in Figure 10 shows a simplified rate fidelity curve for each algorithm across the entire database. The x-axis indicates the spike rate on a log scale. The y-axis indicates the fidelity in terms of the mean SNR. The computationally simple filter threshold produces a highly redundant, relatively low-fidelity code (less than 11 dB SNR 10,000 spikes/second.) Its decomposition overrepresents large-amplitude components of signals while devoting relatively few spikes to lower-amplitude components, which may represent distinct sound structure. At large spike rates (more than
36
E. Smith and M. Lewicki
/PTIM -ATCHING 0URSUIT /PTIM &ILTER´4HRESHOLD -ATCHING 0URSUIT &ILTER´4HRESHOLD
-ATCHING 0URSUIT &ILTERS 3.2 D"
3.2 D"
&ILTERS
#OST SPIKESSEC
#OST SPIKESSEC
Figure 10: Spike coding efficiency curves. Plotted on the left is the increase in fidelity with increasing spike rate for the four spike code algorithms. Plotted on the right is the increase in fidelity with increasing spike rate for matching pursuit using different numbers of gammatones in the filter bank (8, 16, 32, 64, 128, 256).
50,000 spikes/second) a mean reconstruction fidelity of about 20 dB SNR is possible. Codes produced by matching pursuit have much greater fidelity at all rates than those from filter threshold. By decomposing signals into sets of orthogonal components, it eliminates all spectral and temporal redundancies in its representation. At low spike rates, where the code tends to represent nonoverlapping signal structures, matching pursuit appears to generate a near-optimal code. However, at higher rates, the fidelity of the greedy algorithm tends to reach a ceiling, with a mean SNR of less than 60 dB. Although the filter threshold produces a relatively inefficient code, MAP optimization of its spike amplitudes can result in an extremely high-fidelity representation. To achieve this, the filter threshold must generate a set of spike times sufficient to span the signal space. There are two factors responsible for the increased efficiency of optimized codes. First, the gradientdescent optimization eliminates redundant spikes by driving spike amplitudes to zero in accordance with the sparse prior. Second, in minimizing the expected error, the optimization makes use of correlations between kernels, subtly adjusting spike amplitudes rather than eliminating them. The relative contribution of each factor is largely dependent on the amount of noise, ε(t), assumed in the model. Low-noise models preserve spikes and rely on precise signal fitting; high-noise models eliminate most spikes (pushing their amplitudes to zero). While optimization of the filter threshold codes greatly increases their efficiency, further optimization of matching-pursuit spike amplitudes leads to relatively small increases in efficiency at lower bit rates. Equation 3.4
Efficient Coding of Time-Relative Structure Using Spikes
37
shows that the algorithm decomposes signals into orthogonal components. As such, increases in efficiency cannot result from redundancy reduction thorough spike elimination. Instead, spike amplitudes are adjusted to make use of correlations between kernels. Using these correlations can prevent the ceilings in coding efficiency found in the raw matching pursuit codes at higher spike rates. The right panel in Figure 10 plots the rate fidelity curves for matching pursuit using different numbers of gammatones. ERB filter banks of 8, 16, 32, 64, 128, and 256 gammatones were produced using Slaney’s Matlab toolbox (Slaney, 1998). Generating each filter bank separately rather than using subsets of some fixed large filter bank allows each component filter’s bandwidth to vary and better tile the frequency space. The sound ensemble (the same used in the between-algorithms comparison) was encoded with each kernel set while varying the spike rate. The resulting curves show a clear relation between coding efficiency and filter bank size. The progression of lowest to highest curves on the plot exactly follows the number of kernel functions used. Although efficiency increases monotonically with the number of size of the filter bank (i.e., number of kernels), the relative gain beyond 64 is extremely small. Additionally, as shown by example earlier in Figure 6, codes produced by 32 or fewer kernels lack a sparse temporal structure in addition to their relatively course spectral representation. Figure 10 shows that matching pursuit is highly efficient at low spike rates but is surpassed by the hybrid optimized filter threshold beyond about 25 dB SNR. The reason for this inefficiency at high rates is that matching pursuit often fails to accurately describe true signal structure (Gribonval, Depalle, Rodet, Bacryf, & Mallatt, 1994; Goodwin & Vetterli, 1999). Because each component of its code is constrained to be orthogonal (by equation 3.4); see also Mallat & Zhang, 1993), it cannot capture independent signal structure, which closely overlaps in time-frequency space. To test matching pursuit’s ability to separate overlapping signal structure, a test signal was created by summing pairs of gammatone kernels separated systematically in time (first third of the signal) and in frequency (latter two-thirds of the signal). The spikegrams in Figure 11 show the potential for time-frequency separability using both matching pursuit and MAP optimization. The ideal sparse representation would consist of pairs of spikes at each of the signal “events” except in the two instances where the kernels perfectly overlap. The MAP optimization algorithm generates just such an encoding (top panel). Looking closely at the representation of one pair of clicks separated 20 msec (top panel, inset), it is clear that two independent events have been coded (allowing perfect reconstruction). In contrast, matching pursuit cannot separate kernels that closely overlap in time and frequency (bottom panel, around 400 and 3600 msec). Matching pursuit’s representation of the same 20 msec separated click pair (bottom panel, inset) is clearly very different from the optimal. On the first iteration of the algorithm, it selects a kernel that is lower frequency than the gam-
38
E. Smith and M. Lewicki
+ERNEL #& (Z
MS
+ERNEL #& (Z
MS
MS
Figure 11: Spikegrams for a signal made of overlapping gammatones. (Top) MAP optimization finds the underlying structure. An example from one pair of clicks (circled) is shown in the inset. The thick gray curve shows the signal, two approximately 2.8 kHz gammatones separated by 20 msec. The thin dark lines are the kernels found by MAP optimization. (Bottom) Matching pursuit cannot separate kernels that closely overlap in time frequency. Given the same two clicks, it generates a single high-amplitude event centered between the chirps and numerous low amplitude events.
matones used to make the signal and centers it between them in time. This means that the representation underestimates the frequency and describes an event when none actually took place. To compensate for this inaccuracy, a large number of additional, low-amplitude kernels are selected on subsequent iterations. With six spikes, it still produces only a 17 dB SNR representation. Nonetheless, the first spike is the single best choice to reduce the residual power. As such, matching pursuit is extremely efficient at low rates. The signal structure initially encoded is typically well separated in time. For example, Figure 5 showed that the initial encoding involves structure that was largely well separated in time and frequency. Tracking the decomposition spike by spike, the kernel that satisfies equation 3.5 on each iteration tends to occur once at each glottal pulse, capturing the largest amount of signal structure possible with a single spike, before returning to encode residual local structure around the same time point. This efficient, low-rate representation can also be generated by MAP optimization by assuming an appropriate degree of noise in the model, but this parameter can be difficult to determine a priori, and the algorithm is much slower.
Efficient Coding of Time-Relative Structure Using Spikes
39
5.2 Coding Efficiency in Terms of Bits. The sparse, shiftable kernel model and a set of algorithms for spike coding have been described in some detail. We now want to quantify the coding efficiency in bits so as to evaluate the model objectively and compare it quantitatively to other signal representations. Rate fidelity again provides a useful objective measure for comparison. Computing the rate-fidelity curves begins with the associated m pairs of coefficients and time values, {sm i , τi }, which are initially stored as double-precision variables. Storing the original time values referenced to the start of the signal is costly because their range can be arbitrarily large and the distribution of time points is essentially uniform. Storing only the time since the last spike, δτim , greatly restricts the range and produces a variable that approximately follows a gamma distribution. Rate fidelity curves are generated by varying the precision of the code, m {sm i , δτi }, and computing the resulting fidelity through reconstruction. A uniform quantizer is used to vary the precision of the code between 1 and 16 bits. At all levels of precision, the bin widths for quantization are selected m so that equal numbers of values fall in each bin. All sm i or δτi that fall within a bin are recoded to have the same value. We use the mean of the unquantized m values that fell within the bin. sm i and δτi are quantized independently. m We found that δτi for gammatones with low center frequencies required much less precision than for higher-frequency gammatones. Accordingly, temporal precision for the kernel functions was normalized with respect to its wavelength so that the same error during quantization would produce the same relative displacement with respect to a kernel’s wavelength. Treating the quantized values as samples from a random variable, we estimate a code’s entropy (bits/coefficient) from histograms of the values. Rate is then the product of the estimated entropy of the quantized variables and the number of coefficients per second for a given signal. At each level of precision, the signal is reconstructed based on the quantized values, and an SNR for the code is computed. This process was repeated across a set of signals, and the results were averaged to produce rate fidelity curves. m Matching pursuit was used to estimate the {sm i , δτi } pairs for these rate fidelity curves. Coding efficiency can be measured in nearly identical fashion for other signal representation. In addition to spike codes, rate fidelity curves were generated for four other signal representation methods using the same set of sounds. The two most common methods for signal processing are Fourier and wavelet transform. Fourier coefficients were obtained for each signal via fast Fourier transform. The real and imaginary parts were quantized independently, and the rate was based on the estimated entropy of the quantized coefficients. Reconstruction was simply an inverse Fourier transform on the quantized coefficients. Similarly, coding efficiency using eighth-order Daubechies wavelets was estimated using Matlab’s discrete wavelet transform and inverse wavelet transform functions. As a baseline for comparison, rate fidelity curves were produced for the waveform of time-varying am-
40
E. Smith and M. Lewicki 3PEECH
3.2 D"
3.2 D"
2AW #OMPRESSED &OURIER 7AVELET 3PIKECODE
-USIC
2AW #OMPRESSED &OURIER 7AVELET 3PIKECODE
2ATE +BPS
2ATE +BPS
Figure 12: Rate fidelity curves for the raw (uncompressed) time-varying signals, compressed signals, Fourier transform, discrete wavelet transform, and spike coding using matching pursuit for speech (left) and music (right).
plitude values. The fidelity was determined by quantizing the amplitude values and computing the SNR directly from the resulting signal. Two different methods were used to determine the rate. In the compressed case, the rate was based on the estimated entropy of the quantized values, just as above. In the raw case, the cost of each coefficient was equal to the quantization level rather than the estimated entropy. Figure 12 shows the rate fidelity curves calculated for two classes of sounds: speech (from the TIMIT database; Garofolo et al., 1990) and music (a wide variety of instrumental pieces collected from the Internet.) Compressed, Fourier, wavelet, and spike representations are all far more efficient than the raw signal over the range of rate fidelity parameters we analyzed. At rates below 40 Kbps, spike codes produce the most efficient representations of both speech and music. For example, between 10 and 20 Kbps, the fidelity of the spike representation of speech is approximately twice that of either Fourier of wavelet transformations. At higher bit rates (above 60 Kbps), the Fourier and wavelet representations produce much higher rate fidelity curves than either spike codes or compressed signals. At high rates (off the scale shown), the wavelet representation is most efficient followed by Fourier. In Figure 10, the filter threshold hybrid code also exceeded the matching pursuit–generated code, in this case beyond 25 dB SNR. This offers some evidence that optimized spike codes might exceed the efficiency of Fourier and wavelets throughout the range of perceptually discriminable fidelities.
Efficient Coding of Time-Relative Structure Using Spikes
41
6 Discussion We have presented a theoretically motivated model for sound coding in which the computational goal is to form an efficient, time-relative representation of the time-varying-amplitude signal. Signals are represented by decomposing them into a minimal number of gammatone acoustic events, each with an associated time and amplitude. We have shown that this yields efficient representations of transient signals, allowing highly precise representations of sound onset. We have also shown that over a large range of fidelities, this representation is more efficient than Fourier or wavelet representations. The research presented here adds to the literature on efficient sound representation in several ways. It expands on the original spike coding model (Lewicki & Sejnowski, 1999; Lewicki, 2002b) by presenting a new encoding algorithm, objectively measuring efficiency of spike codes, and comparing them against Fourier and wavelets, and providing a detailed analysis of the representation with specific examples of its strengths and weaknesses. This work also addresses a number of issues relating to matching pursuit. Goodwin and Vetterling (1999) proposed using damped sinusoids with matching pursuit to avoid problems arising from using symmetric wavelets to represent sound. We demonstrated the potential of using a physiologically derived set of gammatone kernels for sound representation rather than abstract sinusoids. The success of this approach lends support to the idea that these functions are adapted to the statistics of natural sounds (Lewicki, 2002a). Additionally, we presented an analysis of the relationship between matching pursuit codes and those derived via MAP optimization. Finally, we used rate distortion to compare matching pursuit against other approximative spike coding algorithms. Another view of these results is that we have shown a method for achieving significantly greater coding efficiency using what is in effect an overcomplete basis. A motivating goal for so-called overcomplete representations or atomic decompositions with overcomplete dictionaries (Mallat & Zhang, 1993; Chen, Donoho, & Saunders, 1996; Goodwin & Vetterli, 1999; Lewicki & Sejnowski, 2000) is to draw from a rich vocabulary of signal descriptors to find compact signal representations. The method we have described here can be viewed in these terms, because the kernel functions φm (t) can be arbitrarily numerous in their center frequencies and can be placed at arbitrary points in time (Lewicki & Sejnowski, 1999; Lewicki, 2002b). In previous studies, it has been difficult to show that the sparse representations made possible with overcomplete representations could actually yield demonstrably more efficient codes. The general problem is that the added cost of coding an overcomplete set of coefficients often outweighs any gain achieved in representation sparseness (Lewicki & Olshausen, 1999; Lewicki & Sejnowski, 2000). The advantage offered by the approach here is that it is not necessary for the code to describe each implicit coefficient (i.e., at every
42
E. Smith and M. Lewicki
sample position). Instead, it is sufficient to describe the time intervals between the spikes. This yielded a much more efficient code for two reasons. The first is that the use of gammatones is well matched to the underlying structure of speech and music. The second is that the matching-pursuit algorithm achieves highly sparse representations. This is crucial, because optimization with filter threshold yields a highly redundant code that does not show increased coding efficiency. It is possible that improvements in either the kernel functions or the encoding algorithms could yield spike codes with even greater efficiencies, and thus provide improved methods for representing the underlying signal structure. We will address a few assumptions made in our model. The first is that we have assumed an explicit generative model and assessed performance by computing the fidelity of the reconstructed signal. It is possible, for example, that the simple filter threshold algorithm does achieve a high-fidelity encoding of the signal, but we do not know how to reconstruct it. It was our motivation, however, to develop efficient codes that can reveal the intrinsic information of a signal. Our assumption of a linear superposition of kernel functions is a simple means of evaluating the degree of redundancy in the representation. Another assumption concerns our choice of distortion measure for evaluating coding efficiency. The reported signal-to-noise values are based on the sum-squared error between the original signal and the reconstruction. It is well known that this measure of distortion does not agree closely with human perception. For example, gaussian white noise signals would tend to appear very dissimilar using this measure, yet they all sound the same to a listener. Although a perceptual distortion measure will offer greater insight into the relationship between the algorithms described here and human perception, the sum-squared error is adequate to show their relative coding efficiency as general signal representations. Beyond the questions of general signal processing, we are particularly interested in the use of spike coding algorithms as models of neural auditory processing. The analog amplitude values in the model can also be interpreted as representing a local population of auditory nerve spikes. As a theoretic model of auditory coding, this posits that the purpose of the (binary) spikes at the auditory nerve is to encode as accurately as possible the temporal position and amplitude of underlying acoustic features, which are compactly described by gammatones. Analog spikes are a useful theoretical abstraction, and it is simple to convert these individual spikes into a population of probabilistically firing, binary units that can carry the same information. Some possible neural network architectures along these lines are described in Lewicki (2002b). One potential concern for describing the cochlear processing with the model presented here is that this model lacks the explicit representation of the nonlinearities found in detailed cochlear models. Our goal, however, was to construct a coding algorithm motivated by higher-level principles.
Efficient Coding of Time-Relative Structure Using Spikes
43
We have not yet investigated whether any cochlear nonlinearities can be interpreted in terms of the assumed computational objective. For example, matching pursuit and gradient optimization can be viewed as means of selecting out a subset of the spikes that removes interspike redundancy and yields an efficient representation. This could offer a novel theoretical interpretation of two-tone inhibition. The addition of nonlinearities may still not help describe the intrinsic coding processes of the system. For example, the model by Yang, Wang, and Shamma (1992) is spike based and contains numerous biophysically motivated nonlinearities, but the fidelity of their reconstructions falls in the same range as filter threshold and is much lower than that of matching pursuit or the hybrids. They reported SNR ranging from 11 to 25 dB, depending on the extent of processing. This underscores the complexity of the signal encoding problem solved by the peripheral auditory system. We have sought to develop an algorithm that extracts the intrinsic information in a signal from the stream of observable information. As was stated earlier, information reduction can be achieved by either selecting only the desired information or removing redundancy. The choice of kernel functions biases our model to “select” certain types of signal structure over others. Our choice of a gammatone filter bank, a highly efficient basis for natural sounds, biases the model to select the underlying structures of natural sounds. The algorithms for generating spike codes take the approach of redundancy reduction. Reducing the temporal correlations shows promise for yielding better methods to extract intrinsic structure from raw acoustic signals.
Acknowledgments E.S. was supported by NIH training grant MH19983. This material is based on work supported by the National Science Foundation under grant 0238351 to M.L.
References Baumgarte, F. (2002). Improved audio coding using a psychoacoustic model based on a cochlear filter bank. IEEE Transactions on Speech and Audio Processing, 10(7), 495–503. Blauert, J. (1997). Spatial hearing (rev. ed.). Cambridge, MA: MIT Press. Chen, S. S., Donoho, D. L., & Saunders, M. A. (1996). A composite model of the auditory periphery for the processing of speech. SIAM Review, 43(1), 129–159. Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., Dahlgren, N. L., & Zue, V. (1990). TIMIT acoustic-phonetic continuous speech corpus. Audio CD.
44
E. Smith and M. Lewicki
Gitza, O. (1988). Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in noisy environments. J. Phonetics, 16, 109–124. Goodwin, M. M., & Vetterli, M. (1999). Matching pursuit and atomic signal models based on recursive filter banks. IEEE Transactions on Signal Processing, 47(7), 1890–1902. Gribonval, R., Depalle, P., Rodet, X., Bacryf, E., & Mallatt, S. (1994). Sound signals decomposition using a high resolution matching pursuit. In Proceedings of International Computer Music Conference (pp. 293–296). Lewicki, M. S. (2002a). Efficient coding of natural sounds. Nature Neuroscience, 5(4), 356–363. Lewicki, M. S. (2002b). Efficient coding of time-varying patterns using a spiking population code. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 241–255). Cambridge, MA: MIT Press. Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for the adaptation and comparison of image codes. Journal of the Optical Society of America, 16(7), 1587–1601. Lewicki, M. S., & Sejnowski, T. J. (1999). Coding time-varying signals using sparse, shift-invariant representations. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 730–736). Cambridge, MA: MIT Press. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12, 337–365. Liberman, A., Delattre, P., & Cooper, F. (1958). Some cues for the distinction between voiced and voiceless stops in initial position. Language and Speech, 1, 905–917. Lyon, R. F. (1982). A computational model of filtering, detection and compression in the cochlea. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (pp. 1282–1285). Mallat, S. G., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12), 3397–3415. Oertel, D. (1999). The role of timing in the brain stem auditory nuclei of vertebrates. Annual Review of Physiology, 61, 497–519. Olshausen, B. A. (2002). Sparse codes and spikes. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 257–272). Cambridge, MA: MIT Press. Patterson, R., Holdsworth, J., Nimo-Smith, I., & Rice, P. (1988). Implementing a gammatone filterbank (Tech. Rep. No. 2341). Cambridge, UK: MRC Applied Psychology Unit. Rabiner, L. R., & Levinson, S. E. (1981). Isolated and connected word recognition: Theory and selected applications. IEEE Trans. Communications, COM-25(5), 621–659. Shamma, S. (1985). Speech processing in the auditory system: II. Lateral inhibition and the central processing of speech-evoked activity in the auditory nerve. Journal of the Acoustical Society of America, 78, 1622–1632.
Efficient Coding of Time-Relative Structure Using Spikes
45
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Slaney, M. (1998). Auditory toolbox (Tech. Rep. No. 1998-010). Palo Alto, CA: Interval Research Corporation. Slaney, M., & Lyon, R. F. (1993). On the importance of time—A temporal representaton of sound. In M. Cooke, S. Beet, & M. Crawford (Eds.), Visual representations of speech signals (pp. 95–116). New York: Wiley. Yang, X., Wang, K., & Shamma, S. (1992). Auditory representations of acoustic signals. IEEE Trans. Inf. Theory, 38, 824–839. Received March 11, 2004; accepted June 4, 2004.
LETTER
Communicated by Bruno Olshausen
Bilinear Sparse Coding for Invariant Vision David B. Grimes
[email protected] Rajesh P. N. Rao
[email protected] Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, U.S.A
Recent algorithms for sparse coding and independent component analysis (ICA) have demonstrated how localized features can be learned from natural images. However, these approaches do not take image transformations into account. We describe an unsupervised algorithm for learning both localized features and their transformations directly from images using a sparse bilinear generative model. We show that from an arbitrary set of natural images, the algorithm produces oriented basis filters that can simultaneously represent features in an image and their transformations. The learned generative model can be used to translate features to different locations, thereby reducing the need to learn the same feature at multiple locations, a limitation of previous approaches to sparse coding and ICA. Our results suggest that by explicitly modeling the interaction between local image features and their transformations, the sparse bilinear approach can provide a basis for achieving transformation-invariant vision. 1 Introduction Algorithms for redundancy reduction and efficient coding have been the subject of considerable attention (Olshausen & Field, 1996, 1997; Bell & Sejnowski, 1997; Hinton & Ghahramani, 1997; Rao & Ballard, 1999; Lewicki & Sejnowski, 2000; Schwartz & Simoncelli, 2001). Although the basic ideas can be traced to earlier work (Attneave, 1954; Barlow, 1961), recent techniques such as independent component analysis (ICA) and sparse coding have helped formalize these ideas and have demonstrated the feasibility of efficient coding through redundancy reduction. These techniques produce an efficient code by using appropriate constraints to minimize the dependencies between elements of the code. One of the most successful applications of ICA and sparse coding has been in the area of image coding. Olshausen and Field (1996, 1997) showed that sparse coding of natural images produces localized, oriented basis filters that resemble the receptive fields of simple cells in primary visual cortex. Neural Computation 17, 47–73 (2005)
c 2004 Massachusetts Institute of Technology
48
D. Grimes and R. Rao
Bell and Sejnowski (1997) obtained similar results using their algorithm for ICA. However, these approaches do not take image transformations such as translation into account. Thus, the model cannot take advantage of the fact that certain basis features model the same image features but under different transformations. As a result, for each oriented feature, a number of independent units must code for the same feature at different locations, making it difficult to scale the approach to large image patches and hierarchical networks. In this letter, we propose an approach to sparse coding that explicitly models the interaction between image features and their transformations. A bilinear generative model is used to learn both the independent features in an image as well as their transformations. Our approach extends Tenenbaum and Freeman’s (2000) work on bilinear models for learning content and style by casting the problem within a probabilistic sparse coding framework. Thus, whereas prior work on bilinear models used global decomposition methods such as singular value decomposition (SVD), the approach presented here emphasizes the extraction of local features by removing higher-order redundancies through sparseness constraints. We show that for natural images, this approach produces localized, oriented filters that can be translated by different amounts to account for image features at arbitrary locations. Our results demonstrate how an image can be factored into a set of basic local features and their transformations, providing a basis for transformation-invariant vision. In particular, we focus on the problem of invariance to transformations caused by moving objects or smooth selfmotion. We assume that for a given object, the goal is to estimate the motion of the object and learn bilinear features for both the object and its transformation. We conclude by discussing related work and suggest an extension of the approach to parts-based object recognition, wherein an object is modeled as a collection of local features (or “parts”) and their relative transformations. 2 Bilinear Generative Models We begin by considering the standard linear generative model used in algorithms for ICA and sparse coding (Bell & Sejnowski, 1997; Olshausen & Field, 1997; Rao & Ballard, 1999): z=
m
wi xi = Wx,
(2.1)
i=1
where z is a k-dimensional input vector (for instance, an image), wi is a k-dimensional basis vector, and xi is its scalar coefficient. Given the linear generative model above, the goal of ICA is to learn the basis vectors wi (i.e., the matrix W) such that the xi are as independent as possible, while the goal in sparse coding is to make the distribution of xi highly kurtotic given equation 2.1.
Bilinear Sparse Coding for Invariant Vision
49
Now consider adding a transformation parameterized by λ to the generative process of an image z, so that z = Tλ (Wx). Given the linear model as described above, as λ varies, the x will need to change accordingly. Transformation-invariance methods seek to model the image formation process in such a way that x is independent (or at least uncorrelated) with λ. A simple method for achieving invariance is to introduce another variable y, which accounts for the changes in the image due to the transformation Tλ . Invariance is achieved once y is known because x and λ are conditionally independent given y. Thus, the key requirement of any such model is that y can easily be inferred. Using a probabilistic approach, we specify the form of the image likelihood function P(z|x, y). To model this likelihood, we introduce an interaction function f (x, y) that models the interactions between x and y in the image formation process. For an additive gaussian noise model, the likelihood becomes P(z|x, y) = G (z; f (x, y), σ 2 ). The function f (x, y) should not only be able to represent the transformations of interest, but must also be invertible in the sense that x and/or y can be inferred given z and possibly one of x or y. Perhaps the simplest function f is the linear function: f (x, y) = Wx + W y. Unfortunately, this model is too impoverished to represent most common classes of transformations such as affine transformations in the image plane. A logical next step is to consider multiplicative interactions between x and y. In this work, we explore the use of the bilinear function, which is the simplest form of f allowing multiplicative interactions. The linear generative model in equation 2.1 can be extended to the bilinear case by using two sets of coefficients xi and yj (or equivalently, two vectors x and y) (Tenenbaum & Freeman, 2000): z = f (x, y) =
n m
wij xi yj .
(2.2)
i=1 j=1
The coefficients xi and yj jointly modulate a set of basis vectors wij to produce an input vector z. For this study, the coefficient xi can be regarded as encoding the presence of object feature i in the image while the yj values determine the transformation present in the image. In the terminology of Tenenbaum and Freeman (2000), x describes the content of the image, while y encodes its style. Equation 2.2 can also be expressed as a linear equation in x for a fixed y: m n m y z = f (x)|y = wij yj xi = wi xi . (2.3) i=1 y
j=1
i=1
The notation wi signifies a transformed feature computed by the weighted sum shown above of the bilinear features wi,∗ by the values in a given y
50
D. Grimes and R. Rao
Figure 1: Examples of linear and bilinear features. A comparison of learned features between a standard linear model and a bilinear model, both trained using sparseness constraints to obtain localized, independent features. The two y rows in the bilinear case depict the translated object features wi (see equation 2.3) for different y vectors corresponding to translations of −3, . . . , 3 pixels.
vector. Likewise, for a fixed x, one obtains a linear equation in y. Indeed, this is the definition of bilinear: given one fixed factor, the model is linear with respect to the other factor. The power of bilinear models stems from the rich nonlinear interactions that can be represented by varying both x and y simultaneously. Note that the standard linear generative model (see equation 2.1) can be seen as a special case of the bilinear model when n = 1 and y = 1. A comparison between examples of features used in the linear generative model and the bilinear model is given in Figure 1. The features in the linear model represent a single instance within the range of features that can be learned by the bilinear model. 3 Learning Sparse Bilinear Models 3.1 Learning Bilinear Models. Our goal is to learn from image data an appropriate set of basis vectors wij that effectively describe the interactions between the feature vector x and the transformation vector y. A commonly used approach in unsupervised learning is to minimize the sum of squared pixel-wise errors over all images: 2 m n E1 ({wij }, x, y) = z − wij xi yj i=1 j=1 T n n m m = z − wij xi yj z − wij xi yj , i=1 j=1
i=1 j=1
(3.1)
(3.2)
Bilinear Sparse Coding for Invariant Vision
51
where · denotes the L2 norm of a vector. A standard approach to minimizing such a function is to use gradient descent and alternate between minimization with respect to {x, y} and minimization with respect to wij . Unfortunately, the optimization problem as stated is underconstrained. The function E1 has many local minima, and results from our simulations indicate that convergence is not obtainable with data drawn from natural images. There are many different ways to represent an image, making it difficult for the method to converge to a basis set that can effectively represent images that were not in the training set. A related approach is presented by Tenenbaum and Freeman (2000) in their article dealing with style and content separation. Rather than using gradient descent, their method estimates the parameters directly by computing the SVD of a matrix A containing input data corresponding to each content class in every style. Their approach can be regarded as an extension of methods based on principal component analysis (PCA) applied to the bilinear case. The SVD approach avoids the difficulties of convergence that plague the gradient-descent method and is much faster in practice. Unfortunately, the learned features tend to be global and nonlocalized, similar to those obtained from PCA-based methods based on second-order statistics. As a result, the method is unsuitable for the problem of learning local features of objects and their transformations. The underconstrained nature of the problem can be remedied by imposing constraints on x and y. In particular, we cast the problem within a probabilistic framework and impose specific prior distributions on x and y with higher probabilities for values that achieve certain desirable properties. We focus here on the class of sparse prior distributions for several reasons: (1) by forcing most of the coefficients to be zero for any given input, sparse priors minimize redundancy and encourage statistical independence between the various xi and between the various yj (Olshausen & Field, 1997); (2) there is some evidence for sparse representations in the brain (Foldi´ ¨ ak & Young, 1995): the distribution of neural responses in visual cortical areas is typically highly kurtotic, that is, cells exhibit little activity for most inputs but respond vigorously for a few inputs, causing a distribution with a high peak near zero and long tails; (3) previous approaches based on sparseness constraints have obtained encouraging results (Olshausen & Field, 1997); and (4) enforcing sparseness on the xi encourages the parts and local features shared across objects to be learned while imposing sparseness on the yj allows object transformations to be explained in terms of a small set of basis vectors.
3.2 Probabilistic Bilinear Sparse Coding. Our probabilistic model for bilinear sparse coding follows a standard Bayesian MAP (maximum a posteriori) approach. Thus, we begin by factoring the posterior probability of
52
D. Grimes and R. Rao
the parameters given the data as P(x, y, {wij }|z) = P(z|x, y, wij )
m
P(xi )
i=1
∝ P(z|x, y, wij )
m
n
P(yj )P({wij })
(3.3)
P(yj ).
(3.4)
j=1
P(xi )
i=1
n j=1
Equation 3.3 assumes independence between x, y, and {wij } as well as independence within the individual dimensions of x and y. Equation 3.4 assumes a uniform prior for P({wij }), which is thus ignored. We assume the following priors for xi and yj : 1 −αS(xi ) e Qα 1 −βS(yj ) P(yj ) = e , Qβ
P(xi ) =
(3.5) (3.6)
where Qα and Qβ are normalization constants, α and β are parameters that control the degree of sparseness, and S is a “sparseness function.” For this study, we used S(a) = log(1 + a2 ). As shown in Figure 2, our choice of S(a) corresponds to a Cauchy prior distribution, which exhibits a useful nonlinearity in the derivative S (a). The squared error function E1 in equation 3.2 can be interpreted as representing the negative log likelihood (− log P(z|x, y, wij )) under the assumption of gaussian noise with unit variance (see, e.g., Olshausen & Field, 1997). (a)
(b)
(c)
−6
x 10 4
6
2 1 0 −1
0
a
1
4
S’(a)
S(a)
P(a)
3
1
2 0 −1
0
a
1
0
−1 −1
0
1
a
Figure 2: A probabilistic sparse coding prior. (a) The probability distribution function for the Cauchy sparse coding prior. Although the distribution appears similar to a gaussian distribution, the Cauchy is supergaussian (highly kurtotic). (b) The derived sparseness error function. (c) The nonlinearity introduced in the derivative of the sparseness function. Note that the function differentially forces small coefficients toward zero, and only at some threshold are large coefficients made larger.
Bilinear Sparse Coding for Invariant Vision
53
Maximizing the posterior in equation 3.3 is thus equivalent to minimizing the following log posterior function over all input images: 2 m m n n E({wij }, x, y) = z − wij xi yj + α S(x ) + β S(yj ). (3.7) i i=1 j=1 i=1 j=1 The gradient of E can be used to derive update rules at time t for the components xa and yb of the feature vector x and transformation vector y, respectively, for any image z, assuming a fixed basis wij : n n m 1 ∂E dxa = wTaq z − wij xi yj yq + =− dt 2 ∂xa q=1 i=1 j=1 m m n 1 ∂E dyb T = wqb z − wij xi yj xq + =− dt 2 ∂yb q=1 i=1 j=1
α S (xa ) 2
(3.8)
β S (yb ). 2
(3.9)
Given a training set of inputs zl , the values for x and y for each image after convergence can be used to update the basis set wij in batch mode according to q m n dwab 1 ∂E zl − = wij xi yj xa yb . (3.10) =− dt 2 ∂wab i=1 j=1 l=1 One difficulty in the sparse coding formulation of equation 3.7 is that the algorithm can trivially minimize the sparseness function by making x or y very small and compensate by increasing the wij basis vector norms to maintain the desired output range. Therefore, as previously suggested (Olshausen & Field, 1997), in order to keep the basis vectors from growing without bound, we adapt the L2 norm of each basis vector in such a way that the variance of xi (and yj , as discussed below) were maintained at a fixed desired level (σg2 ). Simply forcing the basis vectors to have a certain norm can lead to instabilities; therefore, a “soft” variance normalization method was employed. The element-wise variance of the x vectors inferred during a single batch iteration was tracked in the vector xvar and adapted at a rate given by the parameter ε (see algorithm 1, line 18). A gain term gx is computed (see lines 19–20 in algorithm 1), which determines the multiplicative factor for adapting the norm of a particular basis vector: ˆ ij = gx,i w
wij . wij 2
(3.11)
An additional complication in the bilinear case is that wij 2 is related to the variance of both xi and yj . One possible solution is to compute a joint
54
D. Grimes and R. Rao
gain matrix G (which specifies a gain Gi,j for each wij basis vector) as the geometric mean of the elements in the gain vectors gx and gy : G=
gx gy T .
(3.12)
However, in the case where sparseness is desired for x but not y (i.e., β = 0.0), the variance of y will rapidly increase, as the variance of x rapidly decreases, and no perturbations to the norm of the basis vectors wij will solve this problem. To avoid this problem, the algorithm performs soft variance normalization directly on the evolving y vectors, and scales the basis vectors wij based only on the variance of xi (see algorithm 1, lines 12–14). 3.3 Algorithm for Learning Bilinear Models of Translating Image Patches. This section describes our unsupervised learning algorithm that uses the update rules (see equations 3.8–3.10) to learn localized bilinear features in natural images for two-dimensional translations. Figure 3 presents
Figure 3: Example training data from natural scene images. Training data are formed by randomly selected patch locations from a set of natural images (a). (b) The patch is then transformed to form the training set of patches zij . In this case, the patch is shifted using horizontal translations of ±2 pixels. To learn a model of style/content separation, a single x vector is used to represent each image in a column, and a single y vector represents each image in a row.
Bilinear Sparse Coding for Invariant Vision
55
a high-level view of the training paradigm in which patches are randomly selected from larger images and subsequently transformed. We initially tested the application of the gradient-descent rules simultaneously to estimate {wij }, x, and y. Unfortunately, obtaining convergence reliably was rather difficult in this situation, due to a degeneracy in the model in the form of an unconstrained degree of freedom. Given a constant c, there is ambiguity in the model since P(z|cx, 1c y) = P(z|x, y). Our use of the priors P(x), P(y) largely mitigates problems stemming from this degeneracy, yet oscillations are still possible when both x and y are adapted simultaneously. Fortunately, we found that minimizing E({wij }, x, y) with respect to a single variable until near convergence yields good results, particularly when combined with a batch derivative approach. This approach of iteratively performing MAP estimation with respect to a single variable at a time is known within the statistics community as iterated conditional modes (ICM) (Besag, 1986). ICM is a deterministic method shown to generally converge quickly, albeit to a local minimum. In our implementation, we use a conjugate gradient method to speed up convergence, minimizing E with respect to x and y. The algorithm we have developed for learning the model parameters {wij } is essentially an incremental expectation-maximization (EM) algorithm applied to randomly selected subsets (“batches”) of training image patches. The algorithm is incremental in the sense that we use a single step in the parameter spaces, increasing the log likelihood of the model but not fully maximizing it. We have observed that an incremental M-step often produces better results than a full M-step (equivalent to fully minimizing equation 3.7 with respect to {wij } for fixed x and y)). We believe this is because performing a full M-step on each batch of data can potentially lead to many shallow local minima, which may be avoided by taking the incremental M-steps. The algorithm can be summarized as follows (see also the pseudocode labeled algorithm 1). First, randomly initialize the bilinear basis W, and the matrix Y containing a set of vectors describing each style (ys ). For each batch of training image patches, estimate the per patch (indexed by i) vectors xi and yi,s for a set of transformations (indexed by s). This corresponds to the E-step in the EM algorithm. In the M-step, take a single gradient step with respect to W using the estimated xi and yi,s values. In order to regularize Y and avoid overfitting to a particular set of patches, we slowly adapt Y over time as follows. For each particular style s, adapt ys toward the mean of all inferred vectors y∗,s corresponding to patches transformed according to style s (line 11 in algorithm 1). Averaging ys across all patches for a particular transformation encourages the style representation to be invariant to particular patch content. Without additional constraints, the algorithm above would not necessarily learn to represent content only in x and transformations only in y. In order to learn style and content separation for transformation invariance,
56
D. Grimes and R. Rao
Algorithm 1: LearnSparseBilinearModel(I, T, l, α, β, , η, ε)
1: W ⇐ RandNormalizedVectors(k, m, n) 2: Y ⇐ RandNormalizedVectors(n, r) 3: for iter ⇐ 1, · · · , maxIter do 4: P ⇐ SelectPatchLocations(I, q) 5: Z ⇐ ExtractPatches(I, P) 6: X ⇐ InferContent(W, Y(:, c), Z, α) 7: dW ⇐ Zeros(k, m, n) 8: for s ⇐ 1, · · · , r do 9: Z ⇐ ExtractPatches(I, Transform(P, T(:, s))) 10: Ybatch ⇐ InferStyle(W, X, Z, β) 11: Y(:, s) ⇐ (1 − )Y(:, s) + · SampleMean(Ybatch ) 12: yvar ⇐ (1 − var + ε · SampleVar(Ybatch ) ε)y
γ yvar 13: gy ⇐ gy σ 2 g
14: 15: 16: 17: 18:
Y(:, s) ⇐ NormalizeVectors(Y(:, s), gy ) dW ⇐ dW + 1r dEdW(Z, W, X, Y(:, s)) end for W = W + ηdW xvar ⇐ (1 − var + ε · SampleVar(X) ε)x
19:
gx ⇐ gx
xvar σg2
γ
20: W ⇐ NormalizeVectors(W, gx ) 21: end for Algorithm 2: InferStyle(W, X, Z, β)
1: Y ⇐ ConjGradSparseFit(W, X, Z, β)
we estimate x and y in a constrained fashion. We first infer xi . Finding the initial x vector relies on having an initial y vector. Thus, we refer to one of the transformations as “canonical,” corresponding to the identity transformation. This transformation vector yc is used for initially bootstrapping the content vector xi , but besides this use is adapted exactly like the rest of the style vectors in the matrix Y. We then use the same xi to infer all yi,s vectors for each transformed patch zi,s . This ensures that a single content vector xi codes for the content in the entire set of transformed patches zi,s . Algorithm 1 presents the pseudocode for the learning algorithm. It makes use of algorithms 2 and 3 for inferring style and content, respectively. Table 1 describes the variables used in the learning of the sparse bilinear model from natural images. Capital letters indicate matrices containing column vectors. Individual columns are indexed using Matlab style “slicing,” for example, Y(:, s) is the ith column of Y yi . The indicates the element-wise product.
Bilinear Sparse Coding for Invariant Vision
57
Algorithm 3: InferContent(W, Y, Z, α)
1: X ⇐ ConjGradSparseFit(W, Y, Z, α) Table 1: Variables for Learning and Inference. Name
Size
m k I α η ε σg2 r
scalar scalar [k , l ] scalar scalar scalar scalar scalar
Description x (content) dim. z (patch) dim. Full-sized images xi prior weight W adaptation rate yvar adaptation rate xi goal variance Number of transforms
Name
Size
n l Z β γ c T
scalar scalar [k, l] scalar scalar scalar scalar [2, r]
Description y (style) dim. Number of patches in batch Image patches yj prior weight Y adaptation rate Variance penalty Canon. transform idx. Transform parameters
4 Results 4.1 Training Methodology. We tested the algorithms for bilinear sparse coding on natural image data. The natural images we used are distributed by Olshausen and Field (1997) along with the code for their algorithm. Except where otherwise noted in an individual experiment, the training set of images consisted of 10 × 10 pixel patches randomly extracted from 10 512×512 pixel source images. The images are prewhitened to equalize large variances in frequency, helping to speed convergence. To assist convergence, all learning occurs in batch mode, where each batch consists of l = 100 image patches. The effects of varying the number of bilinear basis units m and n were investigated in depth. An obvious setting for m, the dimension of the content vector x, is to use the value corresponding to a complete basis set in the linear model (m equals the number of pixels in the patch). As we will demonstrate, this choice for m yields good results. However, one might imagine that because of the ability to merge representations of features that are equivalent with respect to the transformations, m can be set to a much lower value and still effectively learn the same basis. As we discuss later, this is true only if the features can be transformed independently. An obvious choice for n, the number of dimensions of the style vector y, is simply the number of transformations in the training set. However, we found that the model is able to perform substantial dimensionality reduction, and we present the case where the number of transformations is 49 and a basis of n = 10 is used to represent translated features. Experiments were also performed to determine values for the sparseness parameters α and β. Settings between 25 to 35 for α and between 0 and 5 for β appear to reliably yield localized, oriented features. Ranges are given
58
D. Grimes and R. Rao
here because optimum values seem to depend on n for two reasons. First, as n increases, the number of prior terms in equation 3.8 is significantly less than the mn likelihood terms. Thus, α must be increased to force sparseness on the posterior distribution on x. This intuitive explanation is reinforced by noting that α is approximately a factor of n larger than those found previously (Olshausen & Field, 1997). Second, if dimensionality reduction is desired (i.e., n is reduced), β must be lowered or set to zero, as the elements of y cannot be coded in an independent fashion. For instance, in the case of dimensionality reduction from 49 to 10 transformation basis vectors β = 0. The W step size η for gradient descent using equation 3.10 was set to 0.25. Variance normalization used ε = 0.25, γ = 0.05, and σg2 = 0.1. These parameters are not necessarily the optimum parameters, and the algorithm is robust with respect to significant parameter changes. Generally, only the amount of time required to find a good sparse representation changes with untuned parameters. In the cases presented here, the algorithm converged after approximately 2500 iterations, although to ensure that the representations had converged, we ran several runs for 10,000 iterations. The transformations for most of the experiments were chosen to be twodimensional translations in the range [−3, 3] pixels in both the axes. The experiments measuring transformation invariance (see section 4.4) considered one-dimensional translations in the range of [−8, 8] in 12 × 12 sized patches. 4.2 Bilinear Sparse Coding of Natural Images. Experimental results are analyzed as follows. First, we study the qualitative properties of the learned representation, then look quantitatively at how model parameters affect these and other properties, and finally examine the learned model’s invariance properties. Figures 4 and 5 show the results of training the sparse bilinear model on natural image data. Both show localized oriented features resembling Gabor filters. Qualitatively, these are similar to the features learned by the model for linear sparse coding. Some features appear to encode more complex features than is common for linear sparse coding. We offer several possible explanations here. First, we believe that some deformations of a Gabor response are caused by the occlusion introduced by the patch boundaries. This effect is most pronounced when the feature is effectively shifted out of the patch based on a particular translation and its canonical (or starting) location. Second, we believe that closely located and similarly oriented features may sometimes be representable in the same basis feature by using slightly different transformations representations. In turn, this may “free up” a basis dimension for representing a more complex feature. The bilinear method is able to model the same features under different transformations. In this case, horizontal translations in the range [−3, 3] were used for training the model. Figure 4 provides an example of how
Bilinear Sparse Coding for Invariant Vision
59
Figure 4: Representing natural images and their transformations with a sparse bilinear model. The representation of an example natural image patch and of the same patch translated to the left. Note that the bar plot representing the x vector is indeed sparse, having only three significant coefficients. The code for the translation vectors for both the canonical patch and the translated one is likewise sparse. The wij basis images are shown for those dimensions that have nonzero coefficients for xi or yj .
the model encodes a natural image patch and the same patch after it has been translated. Note that in this case, both the x and y vectors are sparse. Figure 5 displays the transformed features for each translation represented by a learned ys vector. Figure 6 shows how the model can account for a given localized feature at different locations by varying the y vector. As shown in the last column of the figure, the translated local feature is generated by linearly combining a sparse set of basis vectors wij . This figure demonstrates that the bilinear form of the interaction function f (x, y) is sufficient for translating features to different locations. 4.3 Effects of Sparseness on Representation. The free parameters α and β play an important role in deciding how sparse the coefficients in the vectors x and y are. Likewise, the sparseness of the vectors is intertwined with the desired local and independent properties of the wij bilinear basis features. As noted in other research on sparseness (Olshausen & Field, 1996), both the attainable sparseness and independence of features also depend on the model dimensionality—in our case, the parameters m and n. In all of our experiments, we use a complete basis (in which m = k) for content representation, assuming that the translations do not affect the number of basis features needed for representation. We believe this is justified also by the very idea that changes in style should not change the intrinsic content.
60
D. Grimes and R. Rao
Figure 5: Localized, oriented set of learned basis features. The transformed basis features learned by the sparse bilinear model for a set of five horizontal ys translations. Each block of transformed features wi is organized with values of i = 1, . . . , m across the rows and values of s = 1, . . . , r down the columns. In this case, m = 100 and r = 5. Note that almost all of the features exhibit localized and oriented properties and are qualitatively similar to Gabor features.
In theory, the style vector y could also use a sparse representation. In the case of affine transformations on the plane, using a complete basis for y means using a large value on n. From a practical perspective, this is undesirable, as it would essentially equate to the tiling of features at all possible transformations. Thus, in our experiments, we set β to a small or zero value and also perform dimensionality reduction by setting n to a fraction of the number of styles (usually between four and eight). This configuration allows
Bilinear Sparse Coding for Invariant Vision
61
Figure 6: Translating a learned feature to multiple locations. The two rows of eight images represent the individual basis vectors wij for two values of i. The yj values for two selected transformations for each i are shown as bar plots. y(a, b) denotes a translation of (a, b) pixels in the Cartesian plane. The last column shows the resulting basis vectors after translation.
learning of sparse, independent content features while taking advantage of dimensionality reduction in the coding of transformations. We also analyzed the effects of the sparseness weighting term α. Figure 7 illustrates the effect of varying α on the sparseness of the content representation x and on the log posterior optimization function E (see equation 3.7). The results shown are based on 1000 x vectors inferred by presenting the learned model with a random sample of 1000 natural image patches. For each value of α, we also average over five runs of the learning algorithm. Figure 8 illustrates the effect of the sparseness weighting term α on the kurtosis of x and the basis vectors learned by the algorithm. 4.4 Style and Content Invariance. A series of experiments were conducted to analyze the invariance properties of the sparse bilinear model. These experiments examine how transformation (y) and content representations (x) change when the input patch is translated in the plane. After learning a sparse bilinear model on a set of translated patches, we select a new test patch z0 and estimate a reference content vector x0 using InferContent(W, yc , z0 ). We then shift the patch according to transformation i: zi = Ti (z0 ). Next, we infer the new yi using InferStyle(W, x0 , zi ) (without using knowledge of the transformation parameter i). Finally, we infer the new content representation xi using a call to the procedure InferContent(W, yi , zi ). To quantify the amount of change or variance in a given transformation or content representation, we use the L2 norm of the vector difference between the reestimated vector and the initial vector. To normalize our metric and account for different scales in coefficients, we divide by the norm of the
62
D. Grimes and R. Rao 0 −100
Log likelihood
−200 −300 −400 −500 −600
Prior (sparseness) term Reconstruction (Gaussian) term Sum of terms
−700 −800 0
5
10 15 Sparseness parameter value
20
Figure 7: Effect of sparseness on the optimization function. As the sparseness weighting value is increased, the sparseness term (log P(x)) takes on higher values, increasing the posterior likelihood. Note that the reconstruction likelihood (log P(z|x, y, {wij )) is effectively unchanged, even as the sparseness term is weighted 20 times more heavily, thus illustrating that the sparse code does not cause a loss of representational fidelity.
reference vector: xi =
|xi − x0 | |x0 |
yi =
|yi − y0 | . |y0 |
(4.1)
For testing invariance, the model was trained on 12 × 12 patches and vertical shifts in the range [−8, 8], with m = 144 and n = 15. Figure 9 shows the result of vertically shifting a particular content (image patch at a particular location) and recording the subsequent representational changes. Figure 10 shows the result of selecting random patches (different content vectors x) and translating each in an identical way (same translational offset). Both figures show a strong degree of invariance of representation: the content vector x remains approximately unchanged when subjected to different translations, while the translation vector y remains approximately the same when different content vectors are subject to the same translation. While Figures 9 and 10 show two sample test sequences, Figures 11 and 12 show the transformation-invariance properties for the average of 100 runs on shifts in the range [−8, 8] in steps of 0.5. The amount of variance in x to translations of up to three pixels is less than a 2 percent change. These
Bilinear Sparse Coding for Invariant Vision
63
Figure 8: Sparseness prior yields highly kurtotic coefficient distributions. (a) The effect of weighting the sparseness prior for x (via α) on the kurtosis (denoted by k) of xi coefficient distribution. (b) A subset of the corresponding bilinear basis vectors learned by the algorithm.
64
D. Grimes and R. Rao
Figure 9: Transformation-invariance property of the sparse bilinear model. (a) Randomly selected natural image patch transformed by an arbitrary sequence of vertical translations. (b) Sequence of vertical pixel translations applied to the original patch location. (c) Effects of the transformations on the transformation (y) and patch content (x) representation vectors. Note that the magnitude of the change in y is well correlated to the magnitude of the vertical translation, while the change in x is relatively insignificant (mean x = 2.6), thus illustrating the transformation-invariance property of the sparse bilinear model.
results suggest that the sparse bilinear model is able to learn an effective representation of translation invariance with respect to local features. Figure 13 compares the effects of translation in the sparse linear versus sparse bilinear model. We first trained a linear model using the corresponding subset of parameter values used in the bilinear model. The same metric for measuring changes in representation was used by first estimating x0 for a random patch and then reestimating xi on a translated patch. As expected, the bilinear model exhibits a much greater degree of invariance to transformation than the linear model. 4.5 Interpolation for Continuous Translations. Although transformations are learned as discrete style classes, we found that the sparse bilinear model can handle continuous transformations. Linear interpolation in the style space was found to be sufficient for characterizing translations in the continuum between two discrete learned translations. Figure 14a shows
Bilinear Sparse Coding for Invariant Vision
65
Figure 10: Invariance of style representation to content. (a) Sequence of randomly selected patches (denoted “Can.”) and their horizontally shifted versions (denoted “Trans.”). (b) Plot of the the L2 distance in image space between the two canonical images. (c) The change in the inferred y for the translated version of each patch. Note that the patch content representation fluctuates wildly (as does the distance in image space), while the translation vector changes very little.
the values of the six-dimensional y vector for each learned translation. The filled circles on each line represent the value of the learned translation vector for that dimension. Plus symbols indicate the resulting linearly interpolated translation vectors. Note that generally the values vary smoothly with respect to the translation amount, allowing simple linear interpolation between translation vectors, similar to the method of locally linear embedding (Roweis & Saul, 2000). Figure 14b shows the style vectors for a model trained with only the transformations −4, −2, 0, +2, +4. Figure 14c shows examples of three reconstructed patches for two interpolated style vectors (for translations, −3 and +0.5 pixels). Also shown is the mean squared error (MSE) over all image pixels between each reconstructed patch and the actual translated patch. The MSE values for the interpolated cases are somewhat higher than those for translations in the training set (labeled “learned” in the figure) but within the range of MSE values across all image patches.
66
D. Grimes and R. Rao
Percent Norm Change
10
8 ∆x 6
4
2
0 −10
−5
0 Translation (pixels)
5
10
Figure 11: Effects of translation on patch content representation. The average relative change in the content vector x versus translation over 100 experiments in which 12 × 12 image patches were shifted by varying degrees. Note that for translations in the range (−3, +3), the relative change in x is small, yet as more and more features are shifted into the patch, the content representation must change. Error bars represent the standard deviation over the 100 experiments.
5 Discussion 5.1 Related Work. Our work is based on a synthesis of two extensively studied tracks of vision research. The first is transformation-invariant object representations, and the second is the extraction of sparse, independent features from images. The key observation is that the combination of the two tracks of research can be synergistic, effectively making each problem easier to solve. A large body of work exists on transformation invariance in image processing and vision. As discussed by Wiskott (2004), the approaches can be divided into two rough classes: (1) those that explicitly deal with different scales and locations by means of normalizing a perceived image to a canonical view and (2) those that find simple localized features that become transformation invariant by pooling across various scales and regions. The first approach is essentially generative, that is, given a representation of “what” (content) and “where” (style), the model can output an image. The bilinear model is one example of such an approach; others are discussed below. The second approach is discriminative in the sense that the model operates by extracting object information from an image and discards posi-
Bilinear Sparse Coding for Invariant Vision
Percent Norm Change
10
67
∆y
8
6
4
2
0
D4 D2 L2 L1.5 L1 L0.5 0 R0.5 R1 R1.5 R2 U2 U4 Translation (direction,amount)
Figure 12: Effects of changing patch content on translation representation. The average relative change in y for random patch content over 100 experiments in which various transformations were performed on a randomly selected patch. No discernible pattern seems to exist to suggest that some transformations are more sensitive to content than others. The bar shows the mean relative change for each transformation.
tional and scale information. Generation of an image in the second approach is difficult because the pooling operation is noninvertible and discards the “where” information. Examples of this approach include weight sharing and weight decay models Hinton, 1987; Foldi´ ¨ ak, 1991), the neocognitron (Fukushima, Miyake, & Takayukiito, 1983), LeNet (LeCun et al., 1989), and slow feature analysis (Wiskott & Sejnowski, 2002). A key property of the sparse bilinear model is that it preserves information about style and explicitly uses this information to achieve invariance. It is thus similar to the shifting and routing circuit models of Anderson and Van Essen (1987) and Olshausen, Anderson, and Van Essen (1995). Both the bilinear model and the routing circuit model retain separate pathways containing “what” and “where” information. There is also a striking similarity in the way routing nodes in the routing circuit model select for scale and shift to mediate routing, and the multiplicative interaction of style and content in the bilinear model. The y vector in the bilinear model functions in the same role as the routing nodes in the routing circuit model. One important difference is that while the parameters and structure of the routing circuit are selected a priori for a specific transformation, our focus is on learning
68
D. Grimes and R. Rao
250
Percent Norm Change
200
150
100 ∆ xi (linear) ∆ xi (bilinear)
50
0 −10
−5
0 Translation (pixels)
5
10
Figure 13: Effects of transformations on the content vector in the bilinear model versus the linear model. The solid line shows, for a linear model, the relative content representation change due to the input patch undergoing various translations. The points on the line represent the average of 100 image presentations per translation, and the error bars indicate the standard deviation. For reference, the results from the bilinear model (shown in detail in Figure 11) are plotted with a dashed line. This shows the high degree of transformation invariance in the bilinear model in comparison to the linear model, whose representation changes steeply with small translations of the input.
arbitrary transformations directly from natural images. The bilinear model is similarly related to the Lie group–based model for invariance suggested in Rao and Ruderman (1999), which also uses multiplicative interactions but in a more constrained way by invoking a Taylor-series expansion of a transformed image. Our emphasis on modeling both “what” and “where” rather than just focusing on invariant recognition (“what”) is motivated by the belief that preserving the “where” information is important. This information is critical not simply for recognition but for acting on visual information, in both biological and robotic settings. The bilinear model addresses this issue in a very explicit way by directly modeling the interaction between “what” and “where” processes, similar in spirit to the “what” and “where” dichotomy seen in biological vision (Ungerleider & Mishkin, 1982). The second major component of prior work on which our model is based is that of representing natural images through the use of sparse or statisti-
Bilinear Sparse Coding for Invariant Vision
69
Figure 14: Modeling continuous transformations by interpolating in translation space. (a) Each line represents a dimension of the tranformation space. The values for each dimension at each learned translation are denoted by circles, and interpolated subpixel coefficient values are denoted by plus marks. Note the smooth transitions between learned translation values, which allow interpolation. (b) Tranformation space values as in a when learned on coarser translations (steps of 2 pixels). (c) Two examples of interpolated patches based on the training in b. See section 4.5 for details.
70
D. Grimes and R. Rao
cally independent features. As discussed in section 1, our work is strongly based on the work of Olshausen and Field (1997), Bell and Sejnowski (1997), and Hinton and Ghahramani (1997) in forming local, sparse distributed representations directly from images. The sparse and independent nature of the learned features and their locality enables a simple, essentially linear model (given fixed content) to efficiently represent transformations of that content. Given global eigenvectors such as those resulting from principal component analysis, this would be more difficult to achieve. A second benefit of combining transformationinvariant representation with sparseness is that the multiplicity of the same learned features at different locations and transformations can be reduced by explicitly learning transformations of a given feature. Finally, our use of a lower-dimensional representation based on the bilinear basis to represent inputs and to interpolate style and content coefficients between known points in the space has some similarities to the method of locally linear embedding (Roweis & Saul, 2000). 5.2 Extension to a Parts-Based Model of Object Representation. The bilinear generative model in equation 2.2 uses the same set of transformation values yj for all the features i = 1, . . . , m. Such a model is appropriate for global transformations that apply to an entire image region such as a shift of p pixels for an image patch or a global illumination change. Consider the problem of representing an object in terms of its constituent parts (Lee & Seung, 1999). In this case, we would like to be able to transform each part independent of other parts in order to account for the location, orientation, and size of each part in the object image. The standard bilinear model can be extended to address this need as follows: m n z= wij yji xi . (5.1) i=1
j=1
Note that each object feature i now has its own set of transformation values yji . The double summation is thus no longer symmetric. Also, note that the standard model (see equation 2.2) is a special case of equation 5.1 where yji = yj for all i. We have tested the feasibility of equation 5.1 using a set of object features learned for the standard bilinear model. Preliminary results (Grimes & Rao, 2003) suggest that allowing independent transformations for the different features provides a rich substrate for modeling images and objects in terms of a set of local features (or parts) and their individual transformations relative to an object-centered reference frame. Additionally, we believe that the number of basis features needed to represent natural images could be greatly reduced in the case of independently transformed features. A single “template” feature could be learned for a particular orientation and then
Bilinear Sparse Coding for Invariant Vision
71
translated to represent a localized, oriented image feature. 6 Summary and Conclusion A fundamental problem in vision is to simultaneously recognize objects and their transformations (Anderson & Van Essen, 1987; Olshausen et al., 1995; Rao & Ballard, 1998; Rao & Ruderman, 1999; Tenenbaum & Freeman, 2000). Bilinear generative models provide a tractable way of addressing this problem by factoring an image into object features and transformations using a bilinear function. Previous approaches used unconstrained bilinear models and produced global basis vectors for image representation (Tenenbaum & Freeman, 2000). In contrast, recent research on image coding has stressed the importance of localized, independent features derived from metrics that emphasize the higher-order statistics of inputs (Olshausen & Field, 1996, 1997; Bell & Sejnowski, 1997; Lewicki & Sejnowski, 2000). This paper introduces a new probabilistic framework for learning bilinear generative models based on the idea of sparse coding. Our results demonstrate that bilinear sparse coding of natural images produces localized oriented basis vectors that can simultaneously represent features in an image and their transformation. We showed how the learned generative model can be used to translate a basis vector to different locations, thereby reducing the need to learn the same basis vector at multiple locations, as in traditional sparse coding methods. We also demonstrated that the learned representations for transformations vary smoothly and allow simple linear interpolation to be used for modeling transformations that lie in the continuum between training values. Finally, we showed that the object representation (the “content”) in the sparse bilinear model remains invariant to image translations, thus providing a basis for invariant vision. Our current efforts are focused on exploring the application of the model to learning other types of transformations such as rotations, scaling, and view changes. We are also investigating a framework for learning parts-based object representations that builds on the bilinear sparse coding model presented in this article. Acknowledgments This research is supported by NSF Career grant 133592, ONR YIP grant N00014-03-1-0457, and fellowships to R.P.N.R. from the Packard and Sloan foundations. References Anderson, C. H., & Van Essen, D. C. (1987). Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Sciences, 84, 1148–1167.
72
D. Grimes and R. Rao
Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61(3), 183–193. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217– 234). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Besag, J. (1986). On the Statistical Analysis of Dirty Pictures. J. Roy. Stat. Soc. B, 48, 259–302. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation., 3(2), 194–200. Foldi´ ¨ ak, P., & Young, M. P. (1995). Sparse coding in the primate cortex. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 895–898). Cambridge, MA: MIT Press. Fukushima, K., Miyake, S., & Takayukiito (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 826–834. Grimes, D. B., & Rao, R. P. N. (2003). A bilinear model for sparse coding. In S. Becker, S. Thrun, ¨ & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Hinton, G. E. (1987). Learning translation invariant recognition in a massively parallel network. In G. Goos & J. Hartmanis (Eds.), PARLE: Parallel architectures and languages Europe (pp. 1–13). Berlin: Springer-Verlag. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions Royal Society B, 352, 1177–1190. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning Overcomplete Representations. Neural Computation, 12(2), 337–365. Olshausen, B., Anderson, C., & Van Essen, D. (1995). A multiscale routing circuit for forming size- and position-invariant object representations. Journal of Computational Neuroscience, 2, 45–62. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 33113325. Rao, R. P. N., & Ballard, D. H. (1998). Development of localized oriented receptive fields by learning a translation-invariant code for natural images. Network: Computation in Neural Systems, 9(2), 219–234. Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive field effects. Nature Neuroscience, 2(1), 79–87.
Bilinear Sparse Coding for Invariant Vision
73
Rao, R. P. N., & Ruderman, D. L. (1999). Learning Lie groups for invariant visual perception. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 810–816). Cambridge, MA: MIT Press. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8), 819–825. Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation, 12(6), 1247–1283. Ungerleider, L., & Mishkin, M. (1982). Two cortical visual systems. In D. Ingle, M. Goodale, & R. Mansfield (Eds.), Analysis of visual behavior (pp. 549–585). Cambridge, MA: MIT Press. Wiskott, L. (2004). How does our visual system achieve shift and size invariance? In J. L. van Hemmen & T. J. Sejnowski (Eds.), Problems in systems neuroscience. New York: Oxford University Press. Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Received November 4, 2003; accepted June 4, 2004.
LETTER
Communicated by Jaap van Pelt
A Probabilistic Framework for Region-Specific Remodeling of Dendrites in Three-Dimensional Neuronal Reconstructions Rishikesh Narayanan
[email protected] Anusha Narayan
[email protected] Sumantra Chattarji
[email protected] National Centre for Biological Sciences, Bangalore 560065, India
Dendritic arborization is an important determinant of single-neuron function as well as the circuitry among neurons. Dendritic trees undergo remodeling during development, aging, and many pathological conditions, with many of the morphological changes being confined to certain regions of the dendritic tree. In order to analyze the functional consequences of such region-specific dendritic remodeling, it is essential to develop techniques that can systematically manipulate three-dimensional reconstructions of neurons. Hence, in this study, we develop an algorithm that uses statistics from precise morphometric analyses to systematically remodel neuronal reconstructions. We use the distribution function of the ratio of two normal distributed random variables to specify the probabilities of remodeling along various regions of the dendritic arborization. We then use these probabilities to drive an iterative algorithm for manipulating the dendritic tree in a region-specific manner. As a test, we apply this framework to a well-characterized example of dendritic remodeling: stress-induced dendritic atrophy in hippocampal CA3 pyramidal cells. We show that our pruning algorithm is capable of eliciting atrophy that matches biological data from rodent models of chronic stress. 1 Introduction Dendrites are the primary sites for receiving synaptic inputs from other neurons. The structure and biophysical properties of the dendritic arbor critically modulate synaptic integration. The discovery of the presence of numerous voltage-gated ion channels in dendrites has increased the importance of the role of dendritic arbor in neuronal function (Stuart, Spruston, & H¨ausser, 1999; Magee, Hoffman, Colbert, & Johnston, 1998; Johnston, Magee, Colbert, & Cristie, 1996; Migliore & Shepherd, 2002). Theoretical and experimental studies point to a significant role for dendritic morphology in modulating neuronal firing patterns (Mainen & Sejnowski, 1996; Neural Computation 17, 75–96 (2005)
c 2004 Massachusetts Institute of Technology
76
R. Narayanan, A. Narayan, and S. Chattarji
Krichmar, Nasuto, Scorcioni, Washington, & Ascoli, 2002; van Ooyen, Duijnhouwer, Remme, & van Pelt, 2002), calcium dynamics (Regehr & Tank, 1994), propagation of action potentials (Vetter, Roth, & H¨ausser, 2001), axonal competition (van Ooyen, Willshaw, & Ramakers, 2000), and other important biophysical properties of neurons (Segev & London, 2000; H¨ausser & Mel, 2003). The functional importance of dendritic morphology is also reflected in its strict regulation during development and in the adult brain (Stuart et al., 1999). During development, dendrites undergo dramatic changes in arborization in order to formulate appropriate synaptic contacts with other neurons (Cline, 2001; Wong & Ghosh, 2002). Structural plasticity in dendrites also plays a prominent role in changes elicited by aging (Duan et al., 2003; Pyapali & Turner, 1996), hibernation (Popov, Bocharova, & Bragin, 1992), Alzheimer’s disease (Anderton et al., 1998; Brizzee, 1987; Geula, 1998), temporal lobe epilepsy (Bothwell et al., 2001), brain injury and lesions (Jones & Schallert, 1994; Kevyani & Schallert, 2002), syndromes related to mental retardation (Kaufmann & Moser, 2000; Ramakers, 2002), retinal degeneration (Marc, Jones, Watt, & Strettoi, 2003), and chronic stress (McEwen, 1999; Vyas, Mitra, Rao, & Chattarji, 2002). Detailed morphometric analyses of dendritic plasticity indicate that such remodeling is often confined to specific regions of the dendrites. For instance, stress-induced dendritic atrophy in hippocampal CA3 pyramidal cells is restricted largely to the stratum radiatum, with other regions undergoing little or no atrophy (McEwen, 1999; Vyas et al., 2002). The possibility that the functional correlates of such localized structural changes also remain localized offers exciting roles for dendritic computation and signal processing. Further, such localized dendritic remodeling promises to provide interesting and important insights into neuronal function for a number of reasons (Johnston & Amaral, 1997; Migliore & Shepherd, 2002; Segev & London, 2000; Stuart et al., 1999): • Inputs to different regions along the dendritic tree arrive from different brain areas (see Table 1). • Inputs to different regions along the dendritic tree have different synaptic properties (see Table 1). • Dendritic ion channels are not uniformly distributed (e.g., the A-type transient potassium channel, cation non-specific hyperpolarizationactivated channel, Ih , and the calcium-dependent channels in CA1 pyramidal cells). This could leave similar magnitudes of dendritic remodeling with different effects on various neuronal conductances, depending on the region of remodeling. • The effects of excitatory postsynaptic potentials (EPSP) originating from a remote dendritic location, on voltage changes at the soma, depend on the distance of that dendritic region from the soma.
Region-Specific Remodeling of Dendrites
77
Table 1: Region Specificity of Inputs to the Dendritic Tree of a CA3 Pyramidal Neuron. Dendritic Region Stratum
oriensc,d
Stratum pyramidalec,d Stratum lucidumb
Distance from Soma (µm)
Receives Inputs from
Receptors
0–400
Commissural/associational, interneurons Interneurons Dentate gyrus, interneurons Commissural/associational, interneurons Entorhinal cortex, interneurons
NMDA, AMPA GABAA GABAA Kainate, AMPA GABAA NMDA, AMPA GABAA , GABAB NMDA, AMPA GABAA , GABAB
Soma 0–100
Stratum radiatumc,d
100–350
Stratum lacunosum molecularea
350–550
Notes: Stratum oriens forms the basal side of the tree. All other strata, except for pyramidale, form the apical side. a Berzhanskaya, Urban, and Barrionuevo (1998). b Cossart et al. (2002). c Johnston and Amaral (1997). d Traub, Jefferys, and Whittington (1999). AMPA = α-amino-3-hydroxy-5methyl-4-isoxazole propionate. GABA = gamma aminobutyric acid. NMDA = N-methylD-aspartate.
• Remodeling of different dendritic regions leads to different effects on synaptic integration, which also depends critically on the distance from the cell body and on the distribution of dendritic ion channels. • Dendrites are capable of eliciting localized changes in excitability (Frick, Magee, & Johnston, 2004) and spatial integration (Wang, Xu, Wu, Duan, & Poo, 2003) by modifying dendritic ion channels. The expression of such localized intrinsic mechanisms can be modulated by localized dendritic remodeling. These issues highlight the need for a formalism that allows us to systematically manipulate dendritic morphology in three-dimensional reconstructions of neurons. Hence, the goal of this study is to develop an algorithm that would systematically manipulate neuronal reconstructions to match biological data on the modulation of dendritic architecture. To this end, we use experimental data from a well-established model of structural plasticity of dendrites in the hippocampus, dendritic atrophy in CA3 pyramidal neurons induced by chronic or repeated stress, to specify the extent and location of dendritic remodeling. 2 Methodological Overview The flow diagram in Figure 1 provides an overview of the framework used to develop and implement the analysis presented here. Previous studies (Vyas et al., 2002) from our laboratory, using Sholl’s analysis of Golgi-impregnated CA3b pyramidal neurons in the rat hippocampus, have demonstrated that
R. Narayanan, A. Narayan, and S. Chattarji
Data from Vyas et al., 2002
78
Control Animals
CIS Animals
2D Sholl’s Analysis on CA3b Pyramidal Neurons Shell-wise DL and BP Atrophy Statistics
3D Sholl’s Analysis
Shell-wise DL and BP Pruning Specifications
Section 5
Section 6 Shell-wise DL and BP Pruning Probabilities Section 7
3D CA3b Pyramidal Neuron Reconstructions
Pruning Algorithm Section 8
DSArchive Pruned CA3b Pyramidal Neurons
Figure 1: Overview of the methodological framework used in the study.
chronic immobilization stress (2 hours a day for 10 days) elicits specific patterns of region-specific atrophy (reduction in dendritic length, DL) and debranching (reduction in the number of branch points, BP). Thus, our experimental data (see the dotted box in Figure 1) provide region-wise statistics, for both control and stress-treated neurons, on atrophy and debranching in each concentric shell that constitutes the basic unit of two-dimensional morphometric analysis in Sholl’s method (see Figure 2). These provide us with the reductions in DL and BP observed in neurons from stress-treated animals with respect to control animals. The goal of the proposed algorithm is to enforce these experimentally observed reductions on three-dimensional (3D) neuronal reconstructions in a region-specific manner. We use digital reconstructions of CA3b pyramidal neurons from the Duke-Southampton Archive (DSArchive) as inputs to our algorithm (Cannon, Turner, Pyapali, & Wheal, 1998). There are two problems
Apical k=A
Region-Specific Remodeling of Dendrites
79
7
50 Pm
6
SAB(6)=2
5
SAD(5)
4 3 2 1
Basal k=B
0
50 Pm
Figure 2: Illustration of Sholl’s analysis and notations used. Two-dimensional projection of 3D neuronal reconstruction overlaid with concentric shells radiating out in steps of 50 µm from the center of gravity (CoG) of the soma. The number within each shell corresponds to n; the apical and basal sides of the dendritic tree are differentiated by k = A and k = B, respectively. In this example, SAD (5) corresponds to the dendritic length in shell number 5 along the apical side of the tree and is represented by the shaded region. SAB (6) corresponds to the number of BPs in shell number 6 along the apical side of the tree and is equal to 2 in this case (the locations of the two BPs are indicated by arrows).
80
R. Narayanan, A. Narayan, and S. Chattarji
that we face in specifying the actual reductions in each concentric shell used in Sholl’s analysis. First, there are considerable differences in values of DL and BP for neurons from the experimental data (two-dimensional, 2D) and DSArchive (3D). Second, each dendritic tree, for a given neuron from either database, has its own intershell variations in actual values for DL and BP. We overcome these problems by implementing an algorithm that uses the experimentally observed ratios of reduction in DL and BP as the key parameter. This ratio in turn is enforced by the algorithm on 3D neuronal reconstructions from the DSArchive. In implementing this, in section 5, we first subject digital reconstructions of CA3b pyramidal neurons, from the DSArchive, to 3D Sholl’s analysis to obtain the DLs and BPs at various distances. Next, we combine this output with experimental statistics to arrive at the pruning specifications for this neuron. We accomplish this in section 6 by arriving at DL and BP pruning specifications at various distances by sampling the ratio distribution of corresponding random variables, which are specified by experimental statistics. These specifications are then used to set the probabilities of pruning along various distances (see section 7). Finally, an iterative algorithm (see section 8) employs these probabilities to subject the 3D reconstruction to levels of dendritic atrophy and debranching that are the same as those observed in our experiments (Vyas et al., 2002). We present the results of the proposed algorithm in section 9 and discuss the implications of the study in section 10.
3 Notations As a first step in our algorithm, we divide the dendritic tree into concentric spherical shells in order to perform Sholl’s analysis on the neuron to be pruned (see section 5). Figure 2 illustrates this diagrammatically, along with providing a reference to the notations. Nk , with k ∈ {A, B}, represents the number of spherical shells along apical (k = A) and basal (k = B) sides of the dendritic tree. For the neuron shown in Figure 2, NA = 8 and NB = 4. skl (n) and ckl (n) represent discrete random processes defining the BP and DL statistics in stress-treated and control animals, respectively. k, as above, denotes the class of the dendrite (apical or basal), whereas l represents whether the random process corresponds to BPs (l = B) or to DL (l = D), that is, l ∈ {B, D}, n = 0, 1, . . . , Nk − 1. For instance, sAD (5) represents the random variable defining the DL in the fifth shell (250–300µm) along the apical branches in neurons from stress-treated animals (see Figure 2). µskl (n) and µckl (n) represent the mean value of the process skl (n) and ckl (n), respectively, and σkls (n) and σklc (n) represent their respective standard deviations. These are obtained directly from the DL and BP statistics of stresstreated and control animals (Vyas et al., 2002).
Region-Specific Remodeling of Dendrites
81
Skl (n) represents the DL, obtained by subjecting the neuron to Sholl’s analysis (see below), of a given 3D neuronal reconstruction with l = D and l = B, respectively (see Figure 2 for an illustration). As above, k ∈ {A, B} represents the side of the shell (apical or basal), and n = 0, 1, . . . , Nk − 1 gives the shell number. ζkl (n), n = 0, 1, . . . , Nk − 1, k ∈ {A, B}, and l ∈ {B, D} are the pruning specifications, giving the amount of DL or BP to be removed from each given shell on the apical and basal sides of the neuron. Explicitly, running Sholl’s analysis at the end of the pruning process should yield Skl (n)−ζkl (n) ∀ k, l, n. The probabilities for pruning DL and BP are denoted by ρkl (n), k ∈ {A, B}, l ∈ {B, D}, n = 1, . . . , Nk − 1, which will be derived from the pruning specifications (see section 7). As the algorithm is an iterative one, it reduces 1 µm of dendritic length or one branching point at a time. In order to update the specifications for and probabilities of pruning at each iteration, it is necessary to keep track of the current levels of pruning, which is referred to as φkl (n), k ∈ {A, B}, l ∈ {B, D}, n = 0, 1, . . . , Nk − 1. 4 Assumptions 1. Radial isotropy is maintained in stress-induced dendritic atrophy, that is, statistics of atrophy observed in 2D morphometric data extends to the 3D case. 2. The random processes skl and ckl , corresponding to stress and control statistics, respectively, are independent. 3. The random variables skl (n), n = 0, 1, . . . , Nk − 1 are independent, as are ckl (n), n = 0, 1, . . . , Nk − 1. This assumption, given the nature of dendritic arborization, does not hold. For instance, if the number of points in shell Nk − 2 is zero, then the number of points in shell Nk − 1 has to be zero as well. However, the impact of this assumption is reduced because the pruning process takes the connectivity of the dendrites into account. 4. The random variables skl (n) and ckl (n), n = 0, 1, . . . , Nk − 1 conform to a normal distribution. The normal distribution with mean µ and standard deviation σ is denoted by N(x; µ, σ ): N(x; µ, σ ) = √
(x − µ)2 . exp − 2σ 2 2πσ 1
(4.1)
5 Three-Dimensional Sholl’s Analysis The first step in reproducing dendritic remodeling of model neurons similar to experimentally observed atrophy involves performing Sholl’s analysis (Sholl, 1953) on digitally reconstructed neurons obtained from the
82
R. Narayanan, A. Narayan, and S. Chattarji
DSArchive. Neuronal morphology in the DSArchive is stored in the SWC format (Cannon et al., 1998), which consists of seven field data lines, each defining a neuronal compartment. Each SWC data line has information about the 3D coordinates of the compartment, its radius, its parent in the dendritic tree, and a type code specifying the location of the compartment (in soma, axon, basal dendrite, or apical dendrite). The 3D coordinates of the various compartments (taken from the corresponding SWC data lines) are used to reconstruct the neuron by connecting straight lines between them. Once this is done, we have access to the exact coordinates of all points along the soma and the dendritic tree. We then perform 3D Sholl’s analysis on these neurons as follows (see Figure 2): 1. Calculate center of gravity (CoG) of the soma. Calculate the mean of the 3D coordinates of points associated with the soma and assign that as the center of gravity of the soma. 2. Assign shells. Construct a set of concentric spheres radiating out in steps of 50 µm from the CoG of the soma. The annular regions between these spheres form the shells, which form the basic unit of morphometric analysis using Sholl’s analysis (see Figure 2). In keeping with the 3D nature of our analysis (see assumption 1, above), we employ spherical shells instead of circular shells, which are used in the traditional 2D version of Sholl’s analysis (Sholl, 1953; Vyas et al., 2002). 3. Measure DL. Measure the length of dendrites within a given shell, n, by superimposing the dendritic tree on these set of concentric spheres. Assign this as SkD (n). 4. Count BPs. Count the number of points in a given shell, n, having more than one child, and assign that as SkB (n). In both of the above cases, the counts for the apical (k = A) and basal (k = B) sides are done separately. 6 Specification of the Pruning Schedule The goal of the specification process is to ensure that the loss of dendritic arborization in model neurons reflects experimental data on stress-induced dendritic atrophy, where apical dendrites undergo greater atrophy than their basal counterparts and each shell along both apical and basal dendrites undergoes atrophy depending on their distance from the soma. For example, it has previously been reported (Watanabe, Gould, & McEwen, 1992; McEwen, 1999; Vyas et al., 2002) that CA3 apical dendrites in the stratum radiatum layer undergo maximal atrophy, which is in sharp contrast to little or no atrophy of basal dendrites in the stratum oriens layer. The specification process involves setting the portion of DL to be removed from SkD and the number of BPs to be removed from SkB , in each of the concentric shells used in Sholl’s analysis. These reductions, denoted
Region-Specific Remodeling of Dendrites
83
as ζkl (n), n = 0, 1, . . . , Nk − 1, k ∈ {A, B}, l ∈ {B, D}, are set as samples of the ratio distribution (Marsaglia, 1965) involving the DL and the BP statistics of stressed (sAD (n)) and control (cAD (n)) animals (Vyas et al., 2002). For instance, to specify the amount of DL to be pruned in shell n on the apical (n) side, we take a sample z of the distribution csAD (see equation 6.3) and set AD (n) ζAD (n) = (1 − z)SAD (n). The following algorithm elaborates on this specification procedure:1 Algorithm SetSpecifications Inputs: µs , µc , σ s , σ c Output: ζ 1. Obtain random numbers ζ s and ζ c corresponding to the (independent) distributions N(x; µs , σ s ) and N(x; µc , σ c ), respectively, using s the Polar method (Knuth, 1997). Find the ratio ζζ c . 2. The ratio r=
ζs ζc
is an instance of the random variable,
µs + σ s x , µc + σ c y
(6.1)
where x and y are independent standard normal distributed random variables. Map r to the representation of Marsaglia’s ratio distribution (Marsaglia, 1965) as follows: r=k
a+x ; b+y
k=
µs µc σs , a = s, b = c. c σ σ σ
(6.2)
a+x is given as (Marsaglia, The probability density function of kr = b+y 1965) q q exp(−0.5(a2 + b2 )) g(z)dz , 1 + f (t) = π(1 + t2 ) g(q) o at + b , (6.3) q= √ 1 + t2
where g(z) is the density function of the standard normal deviate. s As ζζ c is an instance of r, to obtain the corresponding instance of kr = s a+x , scale ζ c by k and set t as: b+y
ζ
1 ζs σ cζ s t= = . k ζc σ sζ c
(6.4)
1 For brevity, a general method to generate ζ , given S, µs , µc σ s , and σ c , is presented. This holds for all ζkl (n) given Skl (n), µskl (n), µckl (n) σkls (n) and σklc (n); k ∈ {A, B}, l ∈ {B, D}, n = 0, 1, . . . , Nk − 1.
84
R. Narayanan, A. Narayan, and S. Chattarji
3. Find f (t = t) from equations 6.3 and 6.4. 4. if (0 ≤ f ( t) ≤ 1 & f ( t) ≥ 0.75 ∗ maxt ( f (t))) accept
ζs ζc
as a representative sample of r. s return ζ = S 1 − ζζ c as the amount to be pruned. else goto Step 1. Step 4 of the above algorithm ensures the following: • f (t = t) lies within the [0, 1] range, such that there is no negative pruning and there is no specification greater than 100% of actual BP or DL. • f (t = t) is always greater than 75% of the maximum value of f (t). This is to make sure that the pruning specification approaches the mean distribution of the original data. The problem of matching DLs arises when the algorithm is applied to neurons obtained from the DSArchive. Experimental data provide pruning statistics for apical DLs up to 400 µm and 300 µm for basal dendrites (cf. Figure 1 of Vyas et al., 2002). However, CA3b pyramidal neurons from the DSArchive have DLs greater than these experimentally reported values. In order to circumvent this problem, statistics for shells with DL less than and equal to these values are set with statistics from Vyas et al. (2002). The statistics of the last shell (shell 8 on the apical side and shell 6 on the basal side) are replicated for shells with DL greater than these values. 7 Specification of Pruning Probabilities As outlined in Figure 1, once the pruning specifications are set, we need to set probabilities of pruning for DL or BP in any given shell based on the specifications obtained from the SetSpecifications algorithm. Intuitively, if the specification mentions that a given shell has to undergo higher pruning with respect to the other shells, then the probability of pruning for that shell has to be higher. Because the pruning algorithm we employ is an iterative one, within a given shell, we also take into account the current pruning values to specify the probabilities of subsequent pruning. 7.1 DL Pruning Probabilities. These are computed by normalizing the current pruning specifications such that the sum of the probabilities is one. Specifically, the DL pruning probabilities are set as follows: ρkD (n) =
ζkD (n) − φkD (n) , n = 1, . . . , Nk − 1. k,n (ζkD (n) − φkD (n))
(7.1)
7.2 BP Pruning Probabilities. Although DL and BP pruning are different numbers in terms of specifications and experimental statistics, they are
Region-Specific Remodeling of Dendrites
85
physically linked, as they constitute the same dendritic tree. An important goal of our algorithm is to ensure that the reduction schedules of BP and DL remain in synchrony through all the iterations of the algorithm. Given that for a specific shell n, ζkB (n) number of BPs have to be pruned for a ζkD (n) reduction in DL, we enforce this by setting the BP pruning prob(n) abilities such that, on average, one BP is removed for every ζζkD reduction kB (n) in DL: ρkB (n) = ρkD (n)P
with
P=
ζkD (n)φkB (n) . ζkB (n)φkD (n)
(7.2)
In order to explain the motivation behind the above equation, we rewrite it as the following set of equations: ρ kB (n) = ρkD (n)P1
with
ρkB (n) = ρ kB (n)P2
with
ζkD (n) ζkB (n) φkB (n) P2 = . φkD (n) P1 =
(7.3) (7.4)
The expression for ρ kB (n), equation 7.3, states that for one BP to be pruned, as per the specifications, on average, P1 units of DL are required to be pruned. Equation 7.4 modulates this base expression with current pruning values φkl (n), ensuring that BP pruning runs according to schedule with respect to DL pruning. It may be noted here that the value of P in equation 7.2 determines whether BP pruning is on par or not with respect to DL pruning. This can be easily confirmed if P is seen as the ratio of P1 and P12 . If P > 1, then BP pruning is ahead of schedule; in this case, ρkB (n) is set to be less than ρkD (n) as per equation 7.2, which means that the probability of pruning a BP has been decreased with respect to that of pruning a DL, thus forcing BP pruning to slow down. Similarly, if P > 1, BP pruning would be running behind schedule and ρkB (n) > ρkD (n), meaning that BP pruning becomes more probable than DL pruning. Finally, BP pruning is on schedule with respect to DL pruning if P = 1. Effectively, P, which gets updated at each iteration of the algorithm (being dependent on φkl (n)), ensures the requirement that, (n) on average, one BP is removed for every ζζkD reduction in DL. kB (n) 8 Pruning the Dendritic Tree Our pruning algorithm is an iterative process that selects a given shell in each iteration, based on the pruning probabilities derived above, and prunes either 1 µm of DL or one BP. An adaptation of the rejection method for generating random numbers (Knuth, 1997) is employed to select the shells within which pruning of DL and BP is to be executed. Similar methods have been used in models of dendritic growth and for pruning simple, virtual dendritic trees (van Pelt, Dityatev, & Uylings, 1997; van Pelt, 1997).
86
R. Narayanan, A. Narayan, and S. Chattarji
The steps involved in pruning DL (in steps of 1 µm) or a BP, as part of a single iteration, are: 1. Select a BP or a terminal dendritic point randomly using a uniformly distributed random number generator. 2. Find the shell number n to which the point belongs. 3. Generate a random number rand from a uniform distribution. 4. DL pruning: If (rand < ρkD ( n)), prune 1 µm length starting from the terminal point, reduce φkD ( n) by unit value, and update ρkD ( n) and ρkB ( n) as in equations 7.1 and 7.2, respectively. n) m∈S ρkD (m)Rm ), prune the dendrite 5. BP pruning: If (rand < ρkB ( that connects the BP to a terminal point, reduce φkB ( n) by one, and update φkD ( n). Modify ρkD ( n) and ρkB ( n) as in equations 7.1 and 7.2, respectively. kl (n) The algorithm converges when maxk,l,n ζkl (n)−φ < T (with a default ζkl (n) value of T = 0.001), where T gives the allowance threshold for the maximum error in specifications. If the algorithm does not converge satisfactorily in certain shells due to the organization of the dendrites in some trees, postprocessing is done on those shells alone to make sure that the pruned tree meets the specifications. In the above scheme, we use the fact that a BP ceases to exist if one of the two subtrees originating from it is completely removed. We remove a BP by removing all the dendritic points present between it and its corresponding terminal point (with no other BP in between). For instance, if the BP represented by BPt in Figure 3 is to be pruned, one has to remove either DLtt1 or DLtt2 , so that BPt ceases to be a BP. This removal of dendritic length should also be taken into account in setting the probability of pruning a BP. This is done by revising the probability for pruning the BP as (step 5 above):
ρkB (n) = ρkB ( n)
ρkD (m)Rm ,
(8.1)
m∈P
n) is as in equation 7.2. The set {Rm : m ∈ P } refers to the denwhere ρkB ( dritic points that are present between the pruned BP and the corresponding terminal point, where P corresponds to the set of all the Sholl shells through which these sets of points traverse. With reference to the illustration in Figure 3, if DLtt1 is chosen for pruning, then P = {3, 4, 5} as DLtt1 traverses through shells 3, 4, and 5 (see Figure 3). Pruning a nonterminal branch point involves more constraints than pruning a terminal branch point. The difference between pruning a terminal and nonterminal branches is depicted in Figure 3. The probability of pruning the terminal branch BPt is a product of the probability of pruning BPt alone and pruning DLtt1 through P , as given by equation 8.1. On the other hand,
Region-Specific Remodeling of Dendrites
87
DLtt2
5 4
1
tt DL
BPt
3 BPn
t
n DL
2
1 Figure 3: Illustration of pruning terminal and nonterminal BPs. Dendritic subtree overlaid on concentric shells, used in Sholl’s analysis, indicating typical terminal (BPt ) and nonterminal (BPn ) BPs. DLnt corresponds to the DL between BPn and BPt . DLtt1 and DLtt2 represent DLs between BPt and its terminal points. The numerical indices on the left correspond to the shell numbers.
the probability of pruning the nonterminal branching point BPn would be a product of the probabilities of pruning BPn alone, pruning DLnt , pruning BPt , and pruning DLtt1 and DLtt2 . This product turns out to have a very low value for all nonterminal branch points, and omitting these did not cause any difference in the pruning procedure. Hence we do not consider nonterminal branching points for pruning unless they eventually become terminal branches during the pruning process. 9 Results A Linux system running on a Pentium IV processor is used for all computations. The pruning algorithm is implemented in C++. A typical execution of the algorithm takes less than 1 minute for a CA3b pyramidal cell from the DSArchive. The total DL of typical CA3 pyramidal neurons is in the range of 1.2 mm to 1.5 mm, with the apical tree covering around 0.7 mm of the total length (approximately 60–65% of the total DL). The total number of BPs is around 60 to 75 in these neurons.
88
R. Narayanan, A. Narayan, and S. Chattarji
In order to test the algorithm, we generate 100 sets of specifications using the SetSpecifications algorithm, and run the pruning algorithm with these specifications. We then perform Sholl’s analysis on these pruned neurons and plot their statistics (see Figure 4). Figure 4A illustrates, as a function of the radial distance from the soma, the region-specific reduction in DL of the pruned neurons (equivalent to an experimental neuron that has undergone stress-induced atrophy) with respect to the original neuron (equivalent to a control unstressed neuron). Figure 4B displays the effects on the number of BPs. Several features of the results depicted in Figure 4 are particularly relevant with respect to experimental data on stress-induced dendritic atrophy. First, the most significant pruning is evident at a distance of 100 to 300 µm from the soma on apical dendrites, that is, in the stratum radiatum of area CA3. This is in agreement with experimental observations in several studies (Vyas et al., 2002; McEwen, 1999; Watanabe et al., 1992). Second, we carried out a more quantitative analysis to further validate this qualitative agreement between experimental data and results obtained from our pruning algorithm (see Figure 5). As described in section 6, the algorithm is designed to emulate the ratio of pruning BP and DL along all dendritic shells. Thus, we compare the modes of the reduction distribution2 obtained from experimental data and from the outcomes of our algorithm for each shell along the apical and basal sides. The methods used to compute the experimental and algorithmic modes of the reduction distribution are explained in Figure 5A. This illustration employs the dendritic length of a shell located at 400 µm from the soma on the apical side as an example. We use biological atrophy data (Vyas et al., 2002) and equation 6.3 to generate the distribution of percentage reductions in the dendritic length. The percentage value at which this distribution attains its maximum corresponds to the experimental mode of the distribution (see Figure 5A, left). The distributions that are used for computing these modes are 1 − csklkl (n) (n) n = 0, 1, . . . , Nk − 1, k ∈ {A, B}, l ∈ {B, D}, and
(8) . for the given example, it is 1 − csAD AD (8) We then determine the algorithm modes by following these steps (see Figure 5A, right): (1) generate 100,000 different specifications on a given model neuron, (2) prune the given neuron exactly to these specifications using the algorithm, (3) obtain Sholl’s analysis for each of the 100,000 outcomes, (4) plot the histograms of each of Skl (n) for n = 0, 1, . . . , Nk − 1, k ∈ {A, B}, l ∈ {B, D} (i.e., the output of Sholl’s analysis) using all the 100,000 outcomes, and (5) compute the mode of the histograms. Figure 5A also
s (n)
2 kl ckl (n) corresponds to the ratio distribution of statistics from the stressed and control animals. Reduction distribution corresponds to the distribution of reduction in DL/BP in s (n) stressed neurons with respect to control neurons and is represented as 1 − ckl (n) . kl
Region-Specific Remodeling of Dendrites
A
89
Dendritic length (Pm)
Apical
Basal 2000
Original Pruned
1500 1000 500
500
400
300
200
100
0
100
200
300
Distance from soma (Pm)
Number of branching points
B
500
Apical
Basal 15
Original Pruned 10
5
400
300
200
100
0
100
200
Distance from soma (Pm) Figure 4: Pruning algorithm replicates experimental dendritic atrophy in 3D reconstructions of CA3 pyramidal cells. (A) Dendritic length (µm) as a function of the radial distance from the soma (µm). Open circles: Shell-wise DL for unpruned neuron; filled circles: mean and SEM of shell-wise DL of 100 neurons pruned with specifications set in the range of changes experimentally observed following chronic stress. (B) Branching points as a function of the radial distance from the soma (µm). Open circles: Shell-wise count of BPs on apical and basal sides for unpruned neuron; Filled circles: mean and SEM of shell-wise BPs of 100 neurons used in A.
90
R. Narayanan, A. Narayan, and S. Chattarji
0.3 0.2 0.1 0.0
73.3%
-100 0 100 200 Percentage of reduction in DL
Mode of dendritic length reduction distribution
B
Experimental 0.4
Normalized frequency
Probability of occurrence
A
Mode of branching point reduction distribution
C
Algorithm
0.008
MAX
0.006 0.75*MAX 0.004 0.002 0.000
74.2%
-100 0 100 200 Percentage of reduction in DL 90
Apical
80 70
Basal
60 50 40 30 20 10 0 50 100 150 200 250 400 350 300 250 200 150 100 50 Distance from soma (µm) Experimental 80 Algorithm Apical 70 Basal 60 50 40 30 20 10 0 50 100 150 250 200 150 100 50 Distance from soma (µm)
brings out the accuracy of the pruning procedure by showing that within the valid specification range (indicated by the solid vertical lines, Figure 5A, right), there is complete overlap between the pruning specifications (gray curve) and the outcomes of the algorithm (black curve). Using the process depicted in Figure 5A, modes of percentage reduction distribution for DL (see Figure 5B) and BP (see Figure 5C) are plotted for each shell as a function of its distance from the soma. As Figures 5B and 5C depict, this analysis confirmed that the modes of the reduction distributions
Region-Specific Remodeling of Dendrites
91
obtained from biological data (see Figures 5B and 5C, “experimental,” open bars) match the outcomes of our pruning algorithm (see Figures 5B and 5C, “algorithm,” gray bars) in all dendritic regions. The maximum discrepancy is found to be less than 5% in all comparisons involving the modes of DL and BP reduction along all shells on the apical and basal sides. Finally, a further point of agreement between our results and experimental data is with reference to the reduction in total dendritic length. Stressinduced atrophy in total dendritic length has been reported to be at around 30% in CA3 pyramidal cells (Watanabe et al., 1992; Vyas et al., 2002). This is in agreement with the outcome of our algorithm, which prunes 30% to 45% of the total dendritic length, while maintaining region specificity in the process (see Figure 6). It may also be observed from Figure 6 that the stratum radiatum (the region between the two dotted lines) has undergone maximal pruning relative to the other regions of the neuron, which is also consistent with the experimentally obtained data on stress-induced dendritic atrophy (Watanabe et al., 1992; Vyas et al., 2002). While the pruning algorithm is capable of eliciting specific patterns of dendritic remodeling that match experimental observations rather well, it should also be noted that our algorithm can prune a given tree to various degrees of atrophy under a specified probabilistic regime (see Figure 6). This helps in obtaining trees of any desired level of atrophy with the relative reductions in BP and DL across various shells maintained as per the specifications. This feature is a direct outcome of the iterative nature of the
Figure 5: Facing page. Agreement between shell-specific modes of percentage reduction distributions observed biologically and calculated from the algorithmic outcomes. (A) Illustration of the calculation of modes of the reduction distribution for DL, using the example of a shell located at 400 µm from the soma on the apical side. Experimental mode (left) is computed by finding the point at which the reduction distribution (obtained from equation 6.3) attains its maximum value. The mode of reduction distribution corresponding to the algorithm (right) is obtained by finding the histogram of Sholl’s analysis outcomes of the pruned neurons (curve highlighted in black). The plot in gray depicts the samples of the reduction distribution that were used to arrive at the pruning specifications. The 0.75*MAX value (dotted horizontal line) indicates the constraint imposed by the SetSpecifications algorithm (step 4) to determine the samples that are considered valid specifications. This, along with the other constraint that the sample has to lie between 0 and 100% (step 4 of the SetSpecifications algorithm), constitutes the difference between the gray and black curves. Within the valid specification range (indicated by the solid vertical lines), there is complete overlap between the pruning specifications (gray curve) and the outcomes of the algorithm (black curve). Using the process depicted in A, modes of reduction distribution for DL (B) and BP (C) are computed for each shell as a function of distance from the soma.
92
R. Narayanan, A. Narayan, and S. Chattarji
Stratum Radiatum
75%
60%
30%
Experimentally observed atrophy
45%
15%
0% 50 Pm
Figure 6: A wide range of dendritic atrophy can be achieved using the pruning algorithm. Projections of 3D reconstructions of a CA3 pyramidal cell and its various pruned versions. The percentage of pruning is given to the left of each projection. Pruning in the range of 30% to 45% reduction (reconstructions within the gray box) corresponds to the outcomes of the algorithm with specifications set in the range of atrophy observed experimentally after chronic stress in animal models (Vyas et al., 2002; Watanabe et al., 1992). The stratum radiatum (delineated by parallel dotted lines) undergoes maximal dendritic pruning and debranching, which is consistent with experimental data (Vyas et al., 2002; Watanabe et al., 1992).
Region-Specific Remodeling of Dendrites
93
algorithm, which can be stopped at any given iteration to obtain the tree of desired length. The relative reduction across shells is kept constant because the choice to prune a DL or BP is guided by the probabilistic regime through all the iterations. 10 Discussion In this study, we have employed chronic stress-induced dendritic atrophy in the hippocampus as a model to develop an algorithm to elicit systematic, region-specific remodeling of 3D reconstructions of CA3 pyramidal neurons. While it is standard practice to find the ratio of the means of two distributions to determine the amount of reduction, in our model, we have employed the actual ratio distribution to arrive at the statistics of the ratio. Using this ratio distribution, we have calculated region-based distributions of reduction in dendritic arborization, which are used to characterize the region-specific pruning probabilities at any given distance from the soma. These probabilities then drive an iterative algorithm, which probabilistically reduces either 1 µm of dendritic length or removes one branching point in each iteration. There are two distinct advantages of this algorithm. First, it prunes BP and DL in parallel, both within the same probabilistic framework, thus ensuring that the average pruning in DL per BP is maintained through all iterations (see equation 7.2). Second, it provides precise control over the remodeling process at any point of interest with the relative pruning across various regions still conforming to biological data. This helps in analyzing the causal structure-function relationship over a range of remodeling by maintaining region specificity throughout, thereby allowing us to monitor the evolution of the neuron’s biological function as the remodeling proceeds. It is also important to emphasize that although we used stress-induced dendritic atrophy as a test case, the algorithm presented here is a generalized one that can be applied to a wide range of biological problems involving dendritic remodeling. In other words, this algorithm can be applied to induce both growth and pruning of neuronal reconstructions. In problems involving dendritic growth, the only difference would be to add a BP or 1 µm of DL once a specific region is probabilistically selected. Further, for simulating dendritic growth, the diameter of the newly formed dendrites should also be fixed, which can be done using Rall’s branching rule (Rall, 1977). A related class of algorithms for generating virtual dendritic trees to match a given statistical framework is the L-Neuron package (Ascoli & Krichmar, 2000; Ascoli, Krichmar, Scorcioni, Nasuto, & Senft, 2001). Our algorithm is different from these because here we remodel real 3D reconstructions based on experimental data, whereas L-Neuron generates virtual neurons based on morphometric data. Further, for analyzing the effects of localized remodeling, our algorithm has the advantage of generating neu-
94
R. Narayanan, A. Narayan, and S. Chattarji
rons for possible analysis of causal structure-function relationships. This is possible because we remodel dendrites of a given neuron to various levels of atrophy, and the extent and location of the reduction are well known. This is in contrast to, say, having two groups of neurons generated to match the stress and control statistics and using them for analyzing the structurefunction relationship. Such an analysis can provide us with only correlative relationships rather than precise causal relationships, as the analysis is not confined to remodeling a given neuron. Given that systematic manipulation of dendritic morphology is currently impossible with biological neurons, the framework presented here provides us with a tool to quantitatively address questions related to specific contributions of dendritic structure to neuronal function during both neuronal development and experience-induced plasticity in adults. The region specificity of the outputs of the algorithm opens new avenues to address the exciting possibility that localized changes in dendritic structure translate to localized changes in neuronal function. It also allows us to analyze the effects of possible local modulation of ionic and synaptic channels as a result of dendritic remodeling. Such analysis may have significant implications for recent findings related to intrinsic plasticity in neurons (Frick et al., 2004). Acknowledgments This work was supported by a research grant from the Department of Biotechnology, India. We express grateful thanks to Ajai Vyas and Rupshi Mitra for helpful discussions. We also thank the anonymous reviewer for the positive comments and suggestions, which helped substantially in improving the presentation of this article. References Anderton, B. H., Callahan, L., Coleman, P., Davies, P., Flood, D., Jicha, G. A., Ohm, T., & Weaver, C. (1998). Dendritic changes in Alzheimer’s disease and factors that may underlie these changes. Progress in Neurobiology, 55, 595–609. Ascoli, G., & Krichmar, J. L. (2000). L-Neuron: A modeling tool for the efficient generation and parsimonious description of dendritic morphology. Neurocomputing, 32–33, 1003–1011. Ascoli, G., Krichmar, J., Scorcioni, R., Nasuto, S., & Senft, S. (2001). Computer generation and quantitative morphometric analysis of virtual neurons. Anatomy and Embryology, 204, 283–301. Berzhanskaya, J., Urban, N. N., & Barrionuevo, G. (1998). Electrophysiological and pharmacological characterization of the direct perforant path input to hippocampal area CA3. Journal of Neurophysiology, 79, 2111–2118. Bothwell, S., Meredith, G. E., Phillips, J., Staunton, H., Doherty, C., Grigorenko, E., Glazier, S., Deadwyler, S., O’Donovan, C. A., & Farrell, M. (2001). Neuronal hypertrophy in the neocortex of patients with temporal lobe epilepsy. Journal of Neuroscience, 21, 4789–4800.
Region-Specific Remodeling of Dendrites
95
Brizzee, K. R. (1987). Neurons numbers and dendritic extent in normal aging and Alzheimer’s disease. Neurobiology of Aging, 8, 579–580. Cannon, R., Turner, D., Pyapali, G., & Wheal, H. (1998). An on-line archive of reconstructed hippocampal neurons. Journal of Neuroscience Methods, 84, 49–54. Cline, H. T. (2001). Dendritic arbor development and synaptogenesis. Current Opinion in Neurobiology, 11, 118–126. Cossart, R., Epsztein, J., Tyzio, R., Becq, H., Hirsch, J., Ben-Ari, Y., & Crepel, V. (2002). Quantal release of glutamate generates pure kainate and mixed AMPA/kainate EPSCs in hippocampal neurons. Neuron, 35(1), 147– 159. Duan, H., Wearne, S. L., Rocher, A. B., Macedo, A., Morrison, J. H., & Hof, P. R. (2003). Age-related dendritic and spine changes in corticocortically projecting neurons in macaque monkeys. Cerebral Cortex, 13, 950–961. Frick, A., Magee, J., & Johnston, D. (2004). LTP is accompanied by an enhanced local excitability of pyramidal neuron dendrites. Nature Neuroscience, 7, 126– 135. Geula, C. (1998). Abnormalities of neural circuitry in Alzheimer’s disease: Hippocampus and cortical cholinergic innervation. Neurology, 51, S18–S29. H¨ausser, M., & Mel, B. (2003). Dendrites: Bug or feature? Current Opinion in Neurobiology, 13, 372–383. Johnston, D., & Amaral, D. G. (1997). Hippocampus. In G. Shepherd (Ed.), Synaptic organization of the brain (4th ed.). New York: Oxford University Press. Johnston, D., Magee, J. C., Colbert, C. M., & Cristie, B. R. (1996). Active properties of neuronal dendrites. Annual Review of Neuroscience, 19, 165–186. Jones, T. A., & Schallert, T. (1994). Use-dependent growth of pyramidal neurons after neocortical damage. Journal of Neuroscience, 14, 2140–2152. Kaufmann, W. E., & Moser, H. W. (2000). Dendritic anomalies in disorders associated with mental retardation. Cerebral Cortex, 10, 981–991. Keyvani, K., & Schallert, T. (2002). Plasticity-associated molecular and structural events in the injured brain. Journal of Neuropathology and Experimental Neurology, 61, 831–840. Knuth, D. E. (1997). The art of computer programming, Vol. 2: Seminumerical algorithms (3rd ed.). Reading, MA: Addison-Wesley. Krichmar, J. L., Nasuto, S. J., Scorcioni, R., Washington, S. D., & Ascoli, G. A. (2002). Effects of dendritic morphology on CA3 pyramidal cell electrophysiology: A simulation study. Brain Research, 941, 11–28. Magee, J., Hoffman, D., Colbert, C., & Johnston, D. (1998). Electrical and calcium signaling in dendrites of hippocampal pyramidal neurons. Annual Review of Physiology, 60, 327–346. Mainen, Z., & Sejnowski, T. (1996). Influence of dendritic structure on firing pattern in model neocortical neurons. Nature, 382, 363–366. Marc, R. E., Jones, B. W., Watt, C. B., & Strettoi, E. (2003). Neural remodeling in retinal degeneration. Progress in Retinal and Eye Research, 22, 607–655. Marsaglia, G. (1965). Ratios of normal variables and ratios of sums of uniform variables. American Statistical Association Journal, 60, 193–204. McEwen, B. (1999). Stress and hippocampal plasticity. Annual Review of Neuroscience, 22, 105–122.
96
R. Narayanan, A. Narayan, and S. Chattarji
Migliore, M., & Shepherd, G. M. (2002). Emerging rules for the distributions of active dendritic conductances. Nature Reviews Neuroscience, 3, 362–370. Popov, V. I., Bocharova, L. S., & Bragin, A. G. (1992). Repeated changes of dendritic morphology in the hippocampus of ground squirrels in the course of hibernation. Neuroscience, 48, 45–51. Pyapali, G. K., & Turner, D. A. (1996). Increased dendritic extent in hippocampal CA1 neurons from aged F344 rats. Neurobiology of Aging, 17, 601–611. Rall, W. (1977). Core conductor theory and cable properties of neurons. In E. R. Kandel (Ed.), Handbook of physiology (Sec. 1, Vol. 1, pp. 39–97). Bethesda, MD: American Physiological Society. Ramakers, G. J. (2002). Rho proteins, mental retardation and the cellular basis of cognition. Trends in Neuroscience, 25, 191–199. Regehr, W. G., & Tank, D. W. (1994). Dendritic calcium dynamics. Current Opinion in Neurobiology, 4, 373–382. Segev, I., & London, M. (2000). Untangling dendrites with quantitative models. Science, 290, 744–750. Sholl, D. A. (1953). Dendritic organization in the neurons of the visual and motor cortices of the cat. Journal of Anatomy, 87, 387–406. Stuart, G., Spruston, N., & H¨ausser, M. (Eds.). (1999). Dendrites. New York: Oxford University Press. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. van Ooyen, A., Duijnhouwer, J., Remme, M. W., & van Pelt, J. (2002). The effect of dendritic topology on firing patterns in model neurons. Network: Computation in Neural Systems, 13, 311–325. van Ooyen, A., Willshaw, D. J., & Ramakers, G. J. A. (2000). Influence of dendritic morphology on axonal competition. Neurocomputing, 32–33, 255–260. van Pelt, J. (1997). Effect of pruning on dendritic tree topology. Journal of Theoretical Biology, 186, 17–32. van Pelt, J., Dityatev, A. E., & Uylings, H. B. M. (1997). Natural variability in the number of dendritic segments: Model-based inferences about branching during neurite outgrowth. Journal of Comparative Neurology, 387, 325–340. Vetter, P., Roth, A., & H¨ausser, M. (2001). Propagation of action potentials in dendrites depends on dendritic morphology. Journal of Neurophysiology, 85, 926–937. Vyas, A., Mitra, R., Rao, B. S. S., & Chattarji, S. (2002). Chronic stress induces contrasting patterns of dendritic remodeling in hippocampal and amygdaloid neurons. Journal of Neuroscience, 22, 6810–6818. Wang, Z., Xu, N. L., Wu, C. P., Duan, S., & Poo, M. M. (2003). Bidirectional changes in spatial dendritic integration accompanying long-term synaptic modifications. Neuron, 37, 463–472. Watanabe, Y., Gould, E., & McEwen, B. S. (1992). Stress induces atrophy of apical dendrites of hippocampal CA3 pyramidal neurons. Brain Research, 588(2), 341–345. Wong, R. O. L., & Ghosh, A. (2002). Activity-dependent regulation of dendritic growth and patterning. Nature Reviews Neuroscience, 3, 803–812. Received February 3, 2004; accepted June 2, 2004.
LETTER
Communicated by Helge Ritter
Analysis of Cyclic Dynamics for Networks of Linear Threshold Neurons H. J. Tang
[email protected] K. C. Tan
[email protected] Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576
Weinian Zhang
[email protected] Department of Mathematics, Sichuan University, Chengdu, Sichuan, P.R. China 610064
The network of neurons with linear threshold (LT) transfer functions is a prominent model to emulate the behavior of cortical neurons. The analysis of dynamic properties for LT networks has attracted growing interest, such as multistability and boundedness. However, not much is known about how the connection strength and external inputs are related to oscillatory behaviors. Periodic oscillation is an important characteristic that relates to nondivergence, which shows that the network is still bounded although unstable modes exist. By concentrating on a general parameterized two-cell network, theoretical results for geometrical properties and existence of periodic orbits are presented. Although it is restricted to two-dimensional systems, the analysis can provide a useful contribution to analyze cyclic dynamics of some specific LT networks of high dimension. As an application, it is extended to an important class of biologically motivated networks of large scale: the winner-take-all model using local excitation and global inhibition. 1 Introduction Networks of linear threshold (LT) neurons with nonsaturating transfer functions have attracted growing interest (Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Hahnloser, 1998; Wersing, Beyn, & Ritter, 2001; Xie, Hahnloser, & Seung, 2002; Yi, Tan, & Li, 2003). The prominent properties of an LT network are embodied at least in two categories. First, it is believed to be more biologically plausible, for example, in modeling cortical neurons, which rarely operate close to saturation (Douglas, Koch, Mahowald, Martin, & Suarez, 1995). Second, the LT network is ready for performing multistability analysis. A multistable network is not as computationally restrictive as a Neural Computation 17, 97–114 (2005)
c 2004 Massachusetts Institute of Technology
98
H. Tang, K. Tan, and W. Zhang
monostable network, such as in decision making. Its computational abilities have been explored in a wide range of applications. A class of multistable winner-take-all (WTA) networks was presented in Hahnloser (1998), and an LT network composed of competitive layers was employed to accomplish perceptual grouping (Wersing, Steil, & Ritter, 2001). An LT network designed in microelectronic circuits demonstrated both computation of analog amplification and digital selection (Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). The transfer function is an important factor in bounding the network dynamics. Unlike the sigmoid function in Hopfield networks (Hopfield, 1984) and the piecewise linear sigmoid function in cellular networks (Chua & Yang, 1988), both of which ensure bounded dynamics and at least one equilibrium, nonsaturating transfer function may give rise to unbounded dynamics and even the nonexistence of equilibrium (Forti & Tesi, 1995). The existence and uniqueness of equilibrium of LT networks have been investigated (Hadeler & Kuhn, 1987; Feng & Hadeler, 1996). A stability condition such that local inhibition is sufficient to achieve nondivergence of LT networks was established (Wersing, Beyn et al., 2001). A wide scope of dynamics analysis, including boundedness, global attractivity, and complete convergence, was conducted in Yi et al. (2003). However, in analysis of the dynamics of the LT networks, little attention has been paid to theoretically analyzing the geometric properties of equilibria and periodic oscillations. It is often difficult to perform theoretical analysis for the cyclic dynamics of a two-dimensional system and impossible to study that of high-dimensional systems. It was reported that slowing the global inhibition produced periodic oscillations in a WTA network and the epileptic network approached a limit cycle by computer simulations (Hahnloser, 1998). Nevertheless, there was a lack of theoretical proof to the existence of periodic orbits. It also remains unclear what factors affect the amplitude and period of the oscillations. This article provides an in-depth analysis of these issues based on parameterized networks with two LT neurons. Generally, the analytical result is applicable only to two-cell networks; however, it can be extended to a special class of high-dimensional networks, such as a generalized WTA network using local excitation and global inhibition. The principal analysis of the planar system would provide a useful framework to study the dynamical behaviors and give insight into such LT networks. 2 Preliminaries The network with n linear threshold neurons is described by the additive recurrent model, ˙ = −x(t) + Wσ (x(t)) + h, x(t)
(2.1)
Analysis of Cyclic Dynamics
99
where x, h ∈ Rn are the neural states and external inputs and W = (wij )n×n the synaptic connection strengths. σ (x) = (σ (xi )) is called the LT function defined by σ (xi ) = max{0, xi }, i = 1, . . . , n. Due to the simple nonlinearity of threshold, network 2.1 can be divided into pieces of linear systems, and there are 2n such partitions for n neurons. In each partition, we are able to study the qualitative behavior of the dynamics in details. In general, equation 2.1 can be linearized as ˙ = −x(t) + W + x(t) + h, x(t)
(2.2)
and it must correspond to each partition. For a given partition, W + is com+ puted simply by w+ ij = wij for xj ≥ 0 and wij = 0 for xj < 0, for all i, j. Then it is easy to compute the equilibria of equation 2.2 by solving linear equation Jx∗ + h = 0, where J = (W + − I) is the Jacobian and I is the identity matrix. If J is singular, the network may possess infinite equilibria, for example, a line attractor (necessary conditions for line attractors have been studied in Hahnloser, Seung, & Slotine, 2003). If J is nonsingular, a unique equilibrium exists, which is explicitly given by x∗ = −J−1 h. It should be noted that if x∗ lies in this region, it is called a true equilibrium; otherwise, it is called a virtual equilibrium. 3 Geometrical Properties of Equilibria For a planar system, the phase plane is divided into four quadrants: R2 = D1 ∪ D2 ∪ D3 ∪ D4 , where D1 = {(x1 , x2 ) : x1 ≥ 0, x2 ≥ 0}, D2 = {(x1 , x2 ) : x1 < 0, x2 ≥ 0}, D3 = {(x1 , x2 ) : x1 < 0, x2 < 0}, and D4 = {(x1 , x2 ) : x1 ≥ 0, x2 < 0}. Obviously, D1 is a linear region, which is unsaturated. D2 and D4 are partially saturated, but D3 is saturated. The Jacobians of the two-cell network in each region are formulated explicitly, w11 − 1 w21
−1 w12 , w22 − 1 0
w12 −1 , 0 w22 − 1
w11 − 1 0 , −1 w21
0 , −1
adding the subscript 1, . . . , 4 when it is necessary to indicate their corresponding regions. Define δ =det J, µ =trace J, and = µ2 − 4δ. The dynamical properties of the network are determined by the relations among δ, µ, and according to the linear system theory in R2 (Perko, 2001): (1) If δ < 0 then x∗ is a saddle. (2) If δ > 0 and ≥ 0, then x∗ is a node that is stable if µ < 0 and unstable if µ > 0. (3) If δ > 0, < 0, and µ = 0, then x∗ is a focus that is stable if µ < 0 and unstable if µ > 0. (4) If δ > 0 and µ = 0, then x∗ is a center. Based on the theory, for the sake of conciseness, the following theorem is presented without detailed analysis:
100
H. Tang, K. Tan, and W. Zhang
Table 1: Properties and Distributions of the Equilibria. Distribution
Conditions
D1
(w11 − 1)(w22 − 1) < w12 w21 (w22 − 1)h1 ≥ w12 h2 (w11 − 1)h2 ≥ w21 h1 (w11 − 1)(w22 − 1) > w12 w21 (w11 − w22 )2 ≥ −4w12 w21 d < 2 (w22 − 1)h1 ≤ w12 h2 d>2 (w11 − w22 )2 < −4w12 w21 d < 2 (w11 − 1)h2 ≤ w21 h1 d>2 d=2 w22 < 1, h1 < w12 h2 /(w22 − 1), h2 ≥ 0 w22 > 1, h1 < w12 h2 /(w22 − 1), h2 ≤ 0 h1 < 0, h2 < 0 w11 < 1, h1 ≥ 0, h2 < w21 h1 /(w11 − 1) w11 > 1, h1 ≤ 0, h2 < w21 h1 /(w11 − 1)
D2 D3 D4
Properties Saddle
s-node u-node s-focus u-focus center s-node saddle s-node s-node saddle
Note: d = w11 + w22 , s: stable, u: unstable.
Theorem 1. A distribution of equilibria of the two-cell LT network, equation 2.1, and their geometrical properties are given for various cases of connection strengths and external inputs in Table 1.
4 Existence and Boundary of Periodic Orbits According to the results in theorem 1, the equilibrium x∗ is an unstable focus or a center if it satisfies that w w < − 14 (w11 − w22 )2 , 12 21 (w22 − 1)h1 < w12 h2 , (4.1) (w11 − 1)h2 < w21 h1 , w11 + w22 > 2(unstable focus), w11 + w22 = 2(center). If network 2.1 has a topological center in D1 , then it always exhibits oscillations around the equilibrium in D1 . However, the outmost of the periodic orbits needs further study. When x∗ is an unstable focus, the network may tend to be unstable, and it is nontrivial to determine if the periodic orbits exist. Remark 1. The Poincar´e-Bendixson theorem (Zhang, Ding, Huang, & Dong, 1992) holds that if there exists an annular region ⊂ R2 , containing no equilibrium, such that both orbits from its outer boundary γ2 and from its inner boundary γ1 enter into , then there exists a (may not be unique) periodic orbit γ in . If the vector field is analytic (can be expanded as a convergent power series), then the periodic orbits in are limit cycles. Since σ is not analytic, our attention is devoted to the existence of periodic orbits
Analysis of Cyclic Dynamics
101
instead of making more efforts to assert that of limit cycles. Actually, they do not make significant difference in the oscillatory behavior, though they are different in geometrical meaning. Let L1 and L2 be the two lines that satisfy x˙ 1 = 0 and x˙ 2 = 0, respectively: L1 : (w11 − 1)x1 + w12 x2 + h1 = 0, L2 : w21 x1 + (w22 − 1)x2 + h2 = 0. When w12 < 0, w11 > 1, w22 < 1, w21 > 0, equation 4.1 describes a vector field as shown in Figure 1. The trajectories that evolve in D1 and D2 are also intuitively illustrated. R is the right trajectory in D1 that starts from point p and ends at point q(0, q). L is the left trajectory in D2 that starts from point q and ends at point r(0, r). The arrows indicate the directions of the vector fields. It can be seen that there exists a rotational vector field around the equilibrium. Theorem 2. The two-cell LT network, equation 2.1, must have one periodic orbit if it satisfies that w12 < 0, w22 < 1 and equation 4.1.
Figure 1: Vector fields described by the oscillation prerequisites. The trajectories in D1 and D2 (R and L , respectively) forced by the vector fields are intuitively illustrated.
102
H. Tang, K. Tan, and W. Zhang
Proof.
The equilibrium in D2 is computed by
x∗ =
1 1 − w22
w12 h2 + (1 − w22 )h1 . h2
It is noted that x∗1 > 0, x∗2 > 0; hence, it is a virtual equilibrium lying in D1 . Although it does not practically exist, it still drives the trajectory to approach it or diverge from it along its invariant eigenspace. When w22 = 0, the Jacobian matrix in D2 has two distinct eigenvalues, λ1 = −1, λ2 = w22 − 1, as well as two linear independent eigenvectors, T 12 e1 = (1, 0)T , e2 = ( w w22 , 1) . First, consider the two cases for w22 < 1, where the subspaces spanned by e1 and e2 are both stable: i. 0 < w22 < 1. In this case, λ1 < λ2 , and let Ess = Span{e1 } and Ews = Span{e2 } denote the strong stable subspace and weakly stable subspace, respectively. Figure 2 shows the phase portrait. Each trajectory except the invariant line Ess approaches the equilibrium along a well-defined tangent line, Ews . Since (0, p) lies in Ess , each trajectory must intersect the x2 -axis at some point (0, r) where r > p. Therefore, a simple closed curve can be constructed such that = R ∪ L ∪ rp, which can be used as an outer boundary of an annular region. ii. w22 < 0. In this case, λ2 < λ1 , and let Ess = Span{e2 } and Ews = Span{e1 } denote the strong stable subspace and weakly stable sub-
Figure 2: Phase portrait for 0 < w22 < 1.
Analysis of Cyclic Dynamics
103
Figure 3: Phase portrait for w22 < 0.
space, respectively. The phase portrait is shown in Figure 3. Each trajectory except the invariant line Ess approaches the equilibrium along the well-defined tangent line, Ews . Since (0, p) lies in Ess , each trajectory that starts from (0, q) must intersect the x2 -axis at some point (0, r) where r > p. Thus, a simple closed curve can be constructed such that = R ∪ L ∪ rp. iii. w22 = 0. In this case, there is only a single eigenvector e1 , which corresponds to the negative eigenvalue of −1. Figure 4 shows there is only a single invariant subspace. Each trajectory approaches the equilibrium along a well-defined tangent line determined by e1 . Thus, the left trajectory intersects the x2 -axis at a point above (0, p). To summarize the results of the above cases, an outer boundary of an annular region can be constructed such that = R ∪ L ∪ rp. According to the Poincar´e-Bendixson theorem, there must be a periodic orbit in the interior of the domain enclosed by . Remark 2. It can be shown that the two-cell network, equation 2.1, can have no periodic orbit if w22 > 1 and w12 < 0. As analyzed in theorem 3, x∗1 < 0, x∗2 < 0, the equilibrium is a virtual equilibrium lying in D3 . Notice that the T 12 system has two linear independent eigenvectors, e1 = (1, 0)T , e2 = ( w w22 , 1) ; s u thus, it has a stable subspace E = Span{e1 } and an unstable subspace E = Span{e2 }. All the trajectories in D2 leave away from x∗ along the invariant space Eu . Therefore, no periodic orbit exits in this case.
104
H. Tang, K. Tan, and W. Zhang
Figure 4: Phase portrait for w22 = 0.
The periodic orbits resulting from a center type equilibrium are investigated below. The proof can be found in the appendix. Theorem 3. If τ = 0, w12 < 0, w22 < 1 and equation 4.1 holds for network 2.1, then the network produces periodic orbits of center type. The amplitude of its outermost periodic orbit is confined in the closed curve = R ∪ L ∪ rp, as shown in Figure 1, where h2 , and 1 − w22 (w11 − 1)h2 − w21 h1 w12 h2 + (1 − w22 )h1 + . q = −w22 (1 − w22 )w12 w12 w21 − (w11 − 1)(w22 − 1)
p=
From the geometrical view, the trajectory can switch only between a stable mode and an unstable mode; switching between two unstable modes or two stable modes is not permitted. It is difficult to calculate the amplitude and period of a periodic orbit. Fortunately, by virtue of the analytical results for center-type periodic orbits in theorem 3 and the trajectory equations in temporal domain (referred to in the appendix), it is possible to estimate their variation trends—how they change with respect to θi , τ , and hi . An approximate relationship describing the amplitude, period, and external inputs can be stated as (1) q ∝ hi , q ∝ τ 2 , q indicating the amplitude, and (2) the period increases as τ increases, while analysis procedures are omitted.
Analysis of Cyclic Dynamics
105
5 Winner-Take-All Network A WTA network is an interesting and meaningful application of LT networks. The study of its dynamics, fixed points, and stability has great merit. A biologically motivated WTA model with local excitation and global inhibition was established and studied by Hahnloser (1998). A generalized model of such a network is formulated as the following differential equations: x˙ i (t) = −xi (t) + θi σ (xi (t)) − L + hi , n ˙ = −L(t) + θj σ (xj (t)), τ L(t)
(5.1a) (5.1b)
j=1
where θi > 0 denotes local excitation, i = 1, . . . , n, L is a global inhibitory neuron, and τ is a time constant reflecting the delay of inhibition. It is noted that the dynamics of L has a property of nonnegativity: if L(0) ≥ 0, then L(t) ≥ 0, and if L(0) < 0, then L(t1 ) ≥ 0 after some transient time t1 > 0. In an explicit and compact form for x = (x1 , . . . , xn )T , the WTA model can be rewritten as
˙ −1 x(t) h x(t) σ (x(t)) , (5.2) + + = − ˙ 0 L(t) L(t) L(t) v 1 − τ1 where = diag(θ1 , . . . , θn ), v= τ1 (θ1 , . . . , θn ), and 1 is a columnar vector of ones. Unlike in Hahnloser (1998), the generalized WTA model, equations 5.1a and 5.1b allow each neuron to have an individual self-excitation θi , which can be different from each other. Hence, it is believed to be more biologically plausible than requiring the same excitation for all neurons. By generalizing 1 the analysis (Hahnloser, 1998), it can be concluded that if θi > 1 and τ < θi −1 for all i, the network performs the WTA computation: xk = hk ,
(5.3a)
xi = hi − θk hk , i = k,
(5.3b)
L = θ k hk ,
(5.3c)
where k indicates the winning neuron. Nevertheless, it may not be the neuron having the largest external input. And all θi = 1 will produce an absolute WTA network, in the sense that the winner xk is the neuron that receives the largest external input hk . Hahnloser (1998) claimed that a periodic orbit may occur by slowing global inhibition somewhat. However, its occurrence is not known a priori. Consider a single excitatory-inhibitory pair of the dynamic system 5.1a and 5.1b, that is,
˙ θ −1 x(t) x(t) h σ (x(t)) + . (5.4) = − + θ ˙ L(t) 0 L(t) L(t) 1 − τ1 τ
106
H. Tang, K. Tan, and W. Zhang
Since each neuron is coupled only with a global inhibitory neuron (though such a global inhibitory neuron reflects the collective activities of all the excitatory neurons), it does not interact directly with other neurons. Therefore, the dynamics of n-dimensional WTA network can be clarified in the light of studying a two-cell network with a single excitatory neuron. The established results for any two-dimensional network in theorem 2 allow us to put forward the theorem on the existence of periodic orbits for the two-cell WTA model. Theorem 4. The single excitatory-inhibitory WTA network, equation 5.4, must have a periodic orbit if its local excitation and global inhibition satisfy θ > 1 and √ 1 θ −1
1 and τ < θi −1 absolute if all θi = 1.
• θi > 1 and
1 θi −1
1 and τ > ( (θθ−1) 2 for any i, implying too slow a global inhibition, i result in unstable dynamics due to a very strong unstable mode (in two dimensions, it is exactly an unstable focus) that give birth to divergent activities. 2
The study on cyclic dynamics of WTA networks is of special interest for cortical processing and short-term memory and efforts have been paid (Ellias & Grossberg, 1975; Ermentrout, 1992). A different WTA model using global inhibition was analyzed (Ermentrout, 1992), and conditions on the existence of periodic orbits were proven for a single excitatory-inhibitory pair by virtue of the Poincar´e-Bendixson theorem. Unlike model 5.4, the neuronal activities were saturated; thus, the outer boundary of an annular region is straightforward. In contrast, it is nontrivial to construct such an outer boundary for network 5.4 since nonsaturating transfer functions may give rise to unbounded activities. Different treatment of applying the Poincar´eBendixson theorem is also required by the fact that the WTA network of LT neurons differs from various models studied in Ermentrout (1992) due
108
H. Tang, K. Tan, and W. Zhang
to the nondifferentiability of LT transfer functions, which imposes another difficulty on cyclic dynamics analysis. Most interesting, observations similar to our analysis were made: the topology of the connections is the main determining factor for such WTA dynamics, and oscillatory behavior begins when the inhibition slows. 6 Examples and Discussions In this section, examples of computer simulations are provided to illustrate and verify the theories developed. Consider an example of WTA network 5.1a and 5.1b with six neurons. When τ = 2, θi = 2 for all i, the network becomes periodically oscillating (see Figures √ 5 and 6). It is verified by proposition 1, since τ is in the region 1 < τ < ( 2 + 1)2 where periodic orbit (undamped oscillation) occurs. As τ < 1, the network is asymptotically stable and performs the WTA computation, whereas it becomes unstable when √ τ > ( 2 + 1)2 .
25 20 15
Neural States
10 5 0 −5 −10 −15 −20 0
10
20
30
40
50
Time Figure 5: Cyclic dynamics of the WTA network with six excitatory neurons. τ = 2, θi = 2 for all i, h = (1, 1.5, 2, 2.5, 3, 3.5)T . The trajectory of each neural state is illustrated in 50 seconds. The dashed curve shows the state of the global inhibitory neuron L.
Analysis of Cyclic Dynamics
109
20
15
L
10
5
0
−5 −5
0
5
x
10
15
6
Figure 6: A periodic orbit constructed in the x6 − L plane. It shows that the trajectories starting from five random points eventually approach the periodic orbit.
To show the periodic orbits of center type, a simple network is taken as an example: dx 2 = −x + 2 dt
−1 0
σ (x1 (t)) 2 . + 1 σ (x2 (t))
Figure 7 gives an illustration of the trajectories starting from three different points: (0, 0.5), (0, 1), and (0, 1.2). As can be seen, each finally approaches a periodic orbit, where the inner closed curve shows a center. It is verified by theorem 3, as a computer illustration of Figure 1, where p = 1 and q = 3. The value of q can also be determined by solving the temporal equations of neural states (see the appendix). Further evaluating the temporal equations, A.10 to A.12, shows exactly 1 < r < 2. The established analytical results imply that slowing global inhibition in a multistable WTA network (Hahnloser, 1998) is equivalent to giving rise to an unstable focus, which would result in periodic orbits. Another important issue about LT networks is continuous attractors, for example, line attractors
110
H. Tang, K. Tan, and W. Zhang 6
5
x
2
4
q
3
2
r 1
p
0 −0.5
0
0.5
1
1.5
2
2.5
3
x1 Figure 7: Periodic orbits of the center type. The trajectories with three different initial points (0, 0.5), (0, 1) and (0, 1.2) approach a periodic orbit that crosses the boundary of D1 . The trajectory starting from (0, 1) constructs an outer boundary = p qr ∪ rp such that all periodic orbits lie in its interior.
as discussed in Seung (1998) and Hahnloser et al. (2003). On the other hand, as shown in this article, a stable periodic orbit is another type of continuous attractor, which encircles an unstable stationary state. 7 Conclusion By concentrating on the analysis of a general parameterized two-cell network, this article studied the geometrical properties of equilibria and oscillation behaviors of the LT networks. The conditions for the existence of periodic orbits were established, and the factors affecting its amplitude and period were also revealed. Generally, the theoretical results are applicable only to a two-cell network; however, they can be extended to a special class of biologically motivated networks of more than two neurons: a WTA network using local excitation and global inhibition. The theory for the cyclic dynamics of such a WTA network was presented. Finally, simulation results illustrated the theory developed.
Analysis of Cyclic Dynamics
111
Appendix A.1 Proof of Theorem 3. Based on vector field analysis for D1 and D2 , it is known that the trajectory starting from a point (0, q) with q > s will intersect the x2 -axis again at a point (0, r) above (0, p). The trajectory starting from (0, r) must approach a periodic orbit at t → ∞. Thus, the right trajectory R , which starts from (0, p), and ends at (0, q), and its continued left trajectory L prescribe an outermost boundary such that every periodic orbit lies in the interior of the closed curve . Define x1 (t) = x1 (t) − x∗1 , x2 (t) = x2 (t) − x∗2 for t ≥ 0. Applying the trajectory equation in D1 , it is obtained that w212 2(w11 − 1)w12 (w11 − 1)2 2 x1 (t)2 x x (t) + (t) x (t) + 1 + 2 1 2 ω2 ω2 ω2 2 w12 w11 − 1 − x1 (0)2 = 0. (A.1) x1 (0) + x2 (0) − ω ω Suppose the trajectory intersects the x2 -axis again at (0, q), where it exists after time tR . Let x1 (tR ) = x1 (0) = 0 − x∗1 = −x∗1 . Then w212 2(w11 − 1)w12 (w11 − 1)2 2 x1 (0)2 x x (t ) + (0) x (t ) + 1 + 2 R 1 2 R ω2 ω2 ω2 w212 (w11 − 1)2 2 x x2 (0)2 (0) − 1 ω2 ω2 2w12 (w11 − 1) − x2 (0) − x1 (0)2 = 0, x1 (0) ω2 −
(A.2)
that is, w212 x2 (tR )2 + 2(w11 − 1)w12 x1 (0) x2 (0)2 x2 (tR ) − w212 − 2w12 (w11 − 1) x1 (0) x2 (0) = 0.
(A.3)
To solve the above equation, let φ = 4(w11 − 1)2 w212 x1 (0)2 + 4w212 (w212 x2 (0)2 + 2w12 (w11 − 1) x1 (0) x2 (0)) 2 2 = 4w12 ((w11 − 1) x2 (0)) , x1 (0) + w12 and it is obtained that x2 (tR ) =
−(w11 − 1) x1 (0) ∓ |(w11 − 1) x1 (0) + w12 x2 (0)| . w12
(A.4)
Since (w11 − 1) x2 (0) x1 (0) + w12 = (w11 − 1)x1 (0) + w12 x2 (0) − ((w11 − 1)x∗1 + w12 x∗2 ),
(A.5)
112
H. Tang, K. Tan, and W. Zhang
and x1 (0) = 0, x2 (0) = p = − w22h2−1 , (w11 − 1)x∗1 + w12 x∗2 + h1 = 0, it holds that h2 x1 (0) + w12 x2 (0) = 0 + w12 − (w11 − 1) − (−h1 ) > 0. (A.6) w22 − 1 Therefore, x2 (tR ) =
x2 (0)) x1 (0) ∓ ((w11 − 1) x1 (0) + w12 −(w11 − 1) . w12
(A.7)
x2 (0) when tR = 0, and as tR > 0, It is noted that one of the roots is x2 (tR ) = x2 (tR ) =
−(w11 − 1)(0 − x∗1 ) −
w12 h2 +(1−w22 )h1 1−w22
w12 w22 h2 w22 h1 =− − 1 − w22 w12 w12 h2 + (1 − w22 )h1 = −w22 . (1 − w22 )w12
(A.8)
Hence, q = x2 (tR ) + x∗2 w12 h2 + (1 − w22 )h1 (w11 − 1)h2 − w21 h1 = −w22 + . (1 − w22 )w12 w12 w21 − (w11 − 1)(w22 − 1)
(A.9)
A.2 Neural States Computed in Temporal Domain. The orbit of the system in D1 is computed by
µ w11 −w22 sin ωt + cos ωt 2ω t w21 2 ω sin ωt e × (x(0) − x ),
x(t) − xe = exp
w22 −w11 2ω
sin ωt sin ωt + cos ωt
w12 ω
(A.10)
√ where ω = 12 | |. The orbit of the system in D2 is computed by exp(−t) x(t) − x = 0 e
w12 w22 (exp((w22
− 1)t) − exp(−t)) exp((w22 − 1)t)
× (x(0) − xe ), w22 = 0,
(A.11)
or else exp(−t) x(t) − x = 0 e
w12 t exp(−t) w22 t exp(−t) + exp(−t)
× (x(0) − xe ), w22 = 0.
(A.12)
Analysis of Cyclic Dynamics
113
References Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Nat. Acad. Sci. USA, 92, 3844–3848. Chua, L. O., & Yang, L. (1988). Cellular neural networks: Theory. IEEE Trans. Circuits Systems, 35, 1257–1272. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Ellias, S. A., & Grossberg, S. (1975). Pattern formation, contrast control, and oscillations in the short term memory of shunting on-center off-surround networks. Biol. Cybernetics, 20, 69–98. Ermentrout, B. (1992). Complex dynamics in winner-take-all neural nets with slow inhibition. Neural Networks, 5, 415–431. Feng, J., & Hadeler, K. P. (1996). Qualitative behavior of some simple networks. Journal of Physics A, 29, 5019–5033. Forti, M., & Tesi, A. (1995). New conditions for global stability of neural networks with application to linear and quadratic programming problems. IEEE Trans. Circuits Systems, 42(7), 354–366. Hadeler, K. P., & Kuhn, D. (1987). Stationary states of the Hartline-Ratliff model. Biological Cybernetics, 56, 411–417. Hahnloser, R. H. R. (1998). On the piecewise analysis of networks of linear threshold neurons. Neural Networks, 11, 691–697. Hahnloser, R. H. R., Seung, H. S., & Slotine, J. J. (2003). Permitted and forbidden sets in symmetric threshold-linear networks. Neural Computation, 15(3), 621– 638. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analog amplification coexist in a cortexinspired silicon circuit. Nature, 405, 947–951. Hopfield, J. J. (1984). Neurons with grade response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Perko, L. (2001). Differential equations and dynamical systems. New York: SpringerVerlag. Seung, H. S. (1998). Continuous attractors and oculomotor control. Neural Networks, 11, 1253–1258. Wersing, H., Beyn, W. J., & Ritter, H. (2001). Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 13(8), 1811–1825. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive layer model for feature binding and sensory segmentation. Neural Computation, 13(2), 357–387. Xie, X., Hahnloser, R. H. R., & Seung, H. S. (2002). Selectively grouping neurons in recurrent networks of lateral inhibition. Neural Computation, 14(11), 2627– 2646. Yi, Z., Tan, K. K., & Lee, T. H. (2003). Multistability analysis for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 15(3), 639–662.
114
H. Tang, K. Tan, and W. Zhang
Zhang, Z., Ding, T., Huang, W., & Dong, Z. (1992). Qualitative theory of differential equations (W. K. Leung, transl.). Providence, RI: American Mathematical Society.
Received October 14, 2003; accepted June 2, 2004.
LETTER
Communicated by Dennis Barbour
Nonlinear and Noisy Extension of Independent Component Analysis: Theory and Its Application to a Pitch Sensation Model Shin-ichi Maeda
[email protected] Nara Institute of Science and Technology, Ikoma, Nara 630–0192, Japan
Wen-Jie Song
[email protected] Osaka University, Suita, Osaka 565–0871, Japan
Shin Ishii
[email protected] Nara Institute of Science and Technology, Ikoma, Nara, 630-0192 Japan
In this letter, we propose a noisy nonlinear version of independent component analysis (ICA). Assuming that the probability density function (p.d.f.) of sources is known, a learning rule is derived based on maximum likelihood estimation (MLE). Our model involves some algorithms of noisy linear ICA (e.g., Bermond & Cardoso, 1999) or noise-free nonlinear ICA (e.g., Lee, Koehler, & Orglmeister, 1997) as special cases. Especially when the nonlinear function is linear, the learning rule derived as a generalized expectation-maximization algorithm has a similar form to the noisy ICA algorithm previously presented by Douglas, Cichocki, and Amari (1998). Moreover, our learning rule becomes identical to the standard noise-free linear ICA algorithm in the noiseless limit, while existing MLE-based noisy ICA algorithms do not rigorously include the noise-free ICA. We trained our noisy nonlinear ICA by using acoustic signals such as speech and music. The model after learning successfully simulates virtual pitch phenomena, and the existence region of virtual pitch is qualitatively similar to that observed in a psychoacoustic experiment. Although a linear transformation hypothesized in the central auditory system can account for the pitch sensation, our model suggests that the linear transformation can be acquired through learning from actual acoustic signals. Since our model includes a cepstrum analysis in a special case, it is expected to provide a useful feature extraction method that has often been given by the cepstrum analysis.
Neural Computation 17, 115–144 (2005)
c 2004 Massachusetts Institute of Technology
116
S. Maeda, W.-J. Song, and S. Ishii
1 Introduction 1.1 Review of Independent Component Analysis. Blind source separation (BSS) is a problem that recovers original sources that are assumed to be independent of each other, from observations that are linear mixtures of the original sources. Let s = [s1 , . . . , sm ]T denote an m-dimensional original source variable, A an n × m linear mixing matrix, and an n-dimensional observation variable x = [x1 , . . . , xn ]T given as x = As.
(1.1)
The probability density function (p.d.f.) of the sources, p(s), is unknown, except that the mean of s is fixed at zero. When the mixing matrix A is square (n = m), a demixing matrix W for recovering the sources can be estimated in which y = Wx is component-wise independent. Due to the insufficiency of constraints in the BSS, however, the demixing matrix W is not uniquely determined and contains ambiguities with respect to permutation and scale of sources, such that W = PDA−1 where P and D are any n × n permutation and diagonal matrices, respectively. In the following we ignore these indeterminacies by identifying a demixing matrix W with PDW or a mixing matrix A with ADP. Various algorithms have been proposed to obtain an n × n square mixing or, equivalently, demixing matrix by defining varieties of cost functions, each based on the corresponding independence measure. Some of the cost functions are based on Kullback-Leibler (KL) divergence between p(y) and p(y) ˜ ˜ p(y) ≡ ni=1 pi (yi ), KL[p(y) p(y)] = p(y) log p(y) ˜ dy, where y = Wx and pi (yi ) = p(y)dy¯ i (Amari, Cichocki, & Yang, 1996). Here, y¯ i is a marginalized random vector, whose ith component yi is removed from the vector y, ˜ and p(y) is a distribution in which the components of y are independent of each other. Since the KL divergence is always positive, or zero if and only if ˜ p(y) is equivalent to p(y), it is appropriate to minimize the KL divergence in order to recover the independent sources. This approach, however, needs to know the marginal distribution pi (yi ). Another approach for obtaining the mixing matrix A is to maximize the entropy of z = f (y), where y = Wx and f is a component-wise monotonically increasing nonlinear function (Bell & Sejnowski, 1995). When the nonlinear function f becomes a cumulative distribution function of y, the entropy maximization of z is equivalent to the minimization of the KL divergence. This approach, however, also needs to ˜ know pi (yi ). The KL divergence between p(y) and p(y) has a relation to “ne˜ gentropy” J(y) ≡ H(yg ) − H(y), namely, KL[p(y) p(y)] = J(y) − ni=1 J(yi ) (Hyv¨arinen, 1998b). Here, H(y) is the entropy of random variable y, and yg is a gaussian random variable whose covariance matrix is the same as that of y. Since the negentropy J(y) is invariant with respect to an application of any invertible linear transformation to y, the minimization of the KL divergence is equivalent to the maximization of the sum of marginal negentropies
Nonlinear and Noisy Extension of Independent Component Analysis
117
n
J(yi ). This optimization problem also requires knowledge of pi (yi ). Accordingly, all of these approaches based on information-theoretical cost functions require knowledge of pi (yi ), though it is computationally difficult to estimate it. If we use kernel density estimation for approximating pi (yi ) from sample points, then, for example, the evaluation of pi (yi ) needs computation that is proportional to the sample number. Moreover, reestimation of pi (yi ) is necessary after every update of the demixing matrix W, which may occur after every data observation. To reduce the computational cost, therefore, the marginal distribution pi (yi ) is often restricted, and the estimation is performed in an on-line manner. For example, if the source’s p.d.f. is close to a gaussian distribution, it is well approximated by using several of the first terms of Gram-Chalier expansion or Edgeworth expansion (Gaeta & Lacoume, 1990; Amari et al., 1996), and those moments can be estimated in an online manner. Alternatively, pi (yi ) is parameterized or fixed, hence is restricted. A supposition that the source distribution p(s) = ni=1 pi (si ) is parameterized or fixed is related to whether p(y) is parameterized or fixed, respectively. When p(s) is parameterized or fixed, the above approaches based on information-theoretical cost functions are similar to maximum likelihood estimation (MLE) of p(x|A), for given observation x. However, the MLE does not guarantee that the obtained parameter is the true one, when the assumed p.d.f. parametric family (model) does not include the true parameter. If the model incorporates not only an unknown parameter but also an unknown function like pi (yi ), suggesting that an infinite number of parameters are required to be free from the unknown function, then such an estimation problem is said to be semiparametric. Therefore, the BSS is a semiparametric problem. One way to deal with a semiparametric problem is to define an estimation function, a function of the parameter, which is zero if the parameter is identical to the true one, or nonzero otherwise, whatever the unknown function is. Although an approach based on an estimation function does not suppose any cost function, making the estimation function zero is a type of MLE in the case of BSS, because it is known that when the source distribution of BSS is parameterized or fixed, the derivative of the likelihood function with respect to the parameter can be regarded as a proper estimation function (Amari & Cardoso, 1997).1 Some other BSS algorithms based on cost functions such as a higher-order cumulant can also be regarded as the optimization of certain estimation functions (Cardoso & Souloumiac, 1993). Although all of the above approaches assume a linear noiseless square mixture, several extensions have been studied. One is a noisy mixture case, i=1
1 Although a proper estimation function guarantees that the optimized parameter is the true one, it describes neither the convergence nor convergence speed. See Amari, Chen, and Cichocki (1997) for the stability analysis.
118
S. Maeda, W.-J. Song, and S. Ishii
in which the observation process is modeled as x = As + n,
(1.2)
where n is a noise, usually obeying a gaussian. To deal with a noisy mixture case, not only an MLE approach with a certain parametrization of the source distribution (Belouchrani & Cardoso, 1995; Hyv¨arinen, 1998a; Bermond & Cardoso, 1999; Lewicki & Sejnowski, 2000; Zibulevsky & Pearlmutter, 2001; Girolami, 2001) but also an approach based on an estimation function that is free from the source distribution assumption (Hyv¨arinen, 1999; Kawanabe & Murata, 2000), have been proposed. The MLE approach treats the source signal s as a hidden variable. In this case, the likelihood function includes an integration over the hidden variable. Although the expectationmaximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is often used for MLE when the model includes hidden variables, the algorithm also requires an integration over the hidden variable to calculate expected likelihood. Since such integrations are often intractable, they are approximated by using point approximation with a delta function (Belouchrani & Cardoso, 1995; Olshausen & Field, 1996; Hyv¨arinen, 1998a; Rao & Ballard, 1999), Laplace approximation (Bermond & Cardoso, 1999; Lewicki & Sejnowski, 2000; Girolami, 2001), or mean-field approximation (HøjenSørensen, Winther, & Hansen, 2002). Another extension is a nonlinear mixture case. Because of indeterminacies in a general nonlinear BSS, an appropriate nonlinearity should be introduced. One such nonlinearity has been introduced in a postnonlinear mixture model (Jutten & Karhunen, 2003): x = f (As),
(1.3)
where the nonlinear function f is a component-wise monotonic function and the mixing matrix A is square (n = m). In this case, the nonlinear function and the mixing matrix are both uniquely determined, because the KL divergence is invariant with respect to the transformation by a componentwise monotonic function. In a nonlinear noise-free case, therefore, either the minimization of the KL divergence or the entropy maximization is used to estimate the demixing matrix W (Lee, Koehler, & Orglmeister, 1997; Taleb & Jutten, 1999; Almeida, 2004). However, such an information-theoretical approach suffers from ill-conditionness, because the independent sources cannot be obtained by a deterministic transformation like A−1 f −1 by itself, when the observation process involves a noise or the transformation is overcomplete (n < m). Estimation functions available for noise-free nonlinear mixture cases are currently unknown. In this letter, we deal with a situation in which the mixture is nonlinear and noisy: x = f (As) + n.
(1.4)
Nonlinear and Noisy Extension of Independent Component Analysis
119
There have been few studies that deal with noisy nonlinear mixture cases excepting a case that the source distribution is gaussian (Lappalainen & Honkela, 2000).2 Estimation functions that can be applied to noisy nonlinear mixture cases are unknown. For simplicity, the source distribution is fixed in this letter, though an extension to the case that the source distribution is a mixture of gaussians (Attias, 1999) will be possible. As a theoretical discussion, we present the relation between the MLE and the entropy maximization based on a deterministic transformation (deterministic entropy maximization), by means of the notion of probabilistic entropy maximization, which is an extension of deterministic entropy maximization to what allows a probabilistic sampling instead of deterministic transformation. Our discussion generalizes the equivalence between the MLE and the (deterministic) entropy maximization, presented by Cardoso (1997). To evaluate the likelihood function in the noisy nonlinear ICA, an integral over the hidden variable s is also necessary. For this reason, we employ an EM algorithm to maximize the likelihood, and the integral is performed by the Laplace approximation with a Taylor series expansion. In a special case that the nonlinear function is linear, an analytical solution for the M-step condition is obtained and the derived algorithm becomes identical to the noisy linear ICA presented by Bermond and Cardoso (1999). In a nonlinear case, on the other hand, an analytical solution is not obtained, though the algorithm is still derived as a generalized EM (GEM) algorithm. When the nonlinear function is linear and the noise is sufficiently small, the GEM algorithm becomes similar to the noisy linear ICA by using a “bias removal” technique (Douglas, Cichocki, & Amari, 1998). Moreover, the GEM algorithm under the assumption of linear mixture becomes identical in the noiseless limit to the standard ICA algorithm, which assumes a square noise-free linear mixture. The important difference is that our noisy nonlinear ICA algorithm based on MLE includes the noise-free linear ICA algorithm, while the other MLE-based noisy linear ICA algorithms do not include the noise-free linear ICA algorithm even in their special cases. We will later discuss why this discrepancy arises. 1.2 Pitch Sensation Model. A possible implication of ICA in the sensory cortex is feature extraction. Since the major function of the sensory cortex is to process its input “information,” it is natural that information theory provides some insights to the processing (e.g., Barlow, 1961; Atick, 1992). As an algorithm to realize information theory, ICA can provide a principle underlying the feature extraction done by the sensory cortex. Actually, ICA and a sparse coding scheme that assumes a sparse distribution for the sources in ICA successfully demonstrated the emergence of orientation-selective neu-
2 Because the standard BSS assumes the source distribution is nongaussian, this existing algorithm was called nonlinear factor analysis.
120
S. Maeda, W.-J. Song, and S. Ishii
rons (e.g., Olshausen & Field, 1996; Bell & Sejnowski, 1997; Rao & Ballard, 1999) and extraclassical receptive-field phenomena (Rao & Ballard, 1999) in the visual cortex. In the auditory system, Lewicki (2002) showed that the cochlear filter can be realized by bases trained from natural sounds within the formulation of ICA. In this article, we apply our nonlinear noisy ICA to reproducing some aspects of pitch sensation. Here, we briefly introduce the history of pitch sensation modeling and why ICA has a relationship with this psychoacoustic topic. Pitch is a psychological quantity that represents how high or low the sound is perceived when compared with the perception of a pure tone composed of a single sinusoidal wave. A sound wave, an air wave of condensation and rarefaction, is essentially a one-dimensional signal. Although there should be arbitrariness in detecting individual auditory sources (auditory streams) from such a onedimensional signal, animals are able to detect several or more auditory streams. This grouping of auditory streams is often called auditory stream segregation (Bregman & Campbell, 1971). Perceiving a pitch is an instance of such an auditory grouping. From its definition, it is easily understood that the perceived pitch is significantly dependent on frequency components of heard sound signals. However, various other factors are also involved with it. For example, one perceives a “middle C3 code” on a piano as a simple pitch, but the code consists of not only the pure tone corresponding to the perceived pitch, but also its harmonic components (Lindsay & Norman, 1977). Seebeck (1843) showed that even if the fundamental frequency of a harmonic combination tone is removed from the combination, one still perceives the same pitch as the tone frequency of the “missing fundamental.” This perception is called “virtual pitch,” “residue pitch,” or “low pitch.” We term this phenomenon “virtual pitch” in the rest of this article. Several hypotheses have been presented to account for the pitch perception. Helmholtz (1863) proposed a place theory in which the perceived pitch corresponds to the place that vibrates most on the basilar membrane of a cochlea, and virtual pitch occurs because the difference in components of a combination tone makes a similar place vibrate to that most vibrated by a missing fundamental tone, in a nonlinear fashion. However, later studies found that place theory cannot explain a pitch-shift phenomenon: one perceives an upward (or downward) shift of virtual pitch proportional to the upward (or downward) shift of a harmonic combination tone (de Boer, 1956; Schouten, Ritsma, & Cardozo, 1962). Because virtual pitch is due to the frequency difference of neighboring components in a harmonic combination, according to place theory, the pitch is not allowed to change when the harmonic combination is shifted with maintaining the combination structure. Place theory was, moreover, unsuccessful in explaining the observation in a masking experiment, in which band-limited noise is imposed on the fundamental frequency. In the experiment, a strong noise was incurred so as to disturb the nonlinear influence around the cochlear membrane vibrated
Nonlinear and Noisy Extension of Independent Component Analysis
121
most by the fundamental frequency. Although virtual pitch should disappear according to place theory, it was still perceived in the experiment. On the other hand, Schouten (1940) proposed a temporal theory, which explained the pitch perception in terms of temporal relationships among components in a combination tone. This theory supposed that one detects peaks of a sonic wave and perceives the interval of the peaks as pitch. However, later studies found that the pitch perception is not actually very sensitive to the phase shift of a combination tone, although the shift greatly changes the temporal pattern (Patterson, 1973), and pitch perception is not impaired when a combination of even-numbered multiple tones is presented to one ear and that of odd-numbered tones to the other ear (Houtsma & Goldstein, 1972). To explain these pitch perception phenomena, various central pattern transformation models were also proposed (Wightman, 1973; Goldstein, 1973; Terhardt, 1974), which fundamentally supposed linear transformation in the central auditory system after a peripheral spectral analysis. For example, in Wightman’s model, matrix the transformation (i−1)(2j−1) 2 is component-wise represented as Wij = n cos π for i = 0 2n and Wij = n1 for i = 0, and an internal representation for pitch is represented as s = Wx, where x denotes the power spectrum of input sounds. When x is obtained in the continuous frequency domain, the above transformation matrix is simply represented by an interval-dependent function ∞ 1 s(τ ) = 2π x(ω) cos ωτ dω or, equivalently, an autocorrelation function −∞ T 1 s(τ ) = limT→∞ 2T −T v(t)v(t − τ )dt, where v(t) denotes the instantaneous sound pressure at time t and T the sound duration. As can be seen in the above expression, Wightman’s pattern transformation model is not influenced by a phase shift of the sounds, which is a desirable property of a pitch perception model. Existing central pattern transformation models usually have this property, because they often involve a peripheral spectrum analysis. Later studies revealed that various central pattern transformation models are equivalent to each other under specific conditions (de Boer, 1977; Yoshioka, 1980). Although the central pattern transformation models successfully accounted for several aspects of virtual pitch and pitch shift phenomena, the transformation matrix, which plays an essential role in the models, must be determined beforehand by the model’s designer. It is more likely that such a transformation matrix is acquired through interaction with environments, that is, through learning from external inputs. The aim of this study is to show that the transformation matrix can be obtained as a stable point through learning from natural speech and music signals, according to our noisy nonlinear ICA. Although the existing studies on the central pattern transformation models assume (noisy) linear transformation matrices, the nonlinearity is important in our learning-based approach to cover the dynamic range of natural sounds, which tend to be distributed in a log-normal
122
S. Maeda, W.-J. Song, and S. Ishii
fashion. Due to this feature of natural sounds, the sound intensity has often been measured in decibels to represent its psychological effect. Noisy setting is advantageous to relaxing the assumption of ICA; for example, it allows dimensional disagreement between input sound data and internal representation. After learning, virtual pitch phenomena were simulated, and the existence region of virtual pitch in the model was examined; the results are consistent with those by psychophysical experiments. Accordingly, our study implies that animals may have a mechanism to perform nonlinear noisy ICA instead of an inherent transformation matrix. This letter is organized as follows. In section 2, we show the equivalence between the MLE and the probabilistic entropy maximization. In section 3, we formulate our noisy nonlinear ICA and derive its learning rules, then discuss the relationship with existing ICA models. In section 4, we show pitch sensation simulations. Section 5 is presents our conclusion. 2 Probabilistic Entropy Maximization First, we show that the entropy maximization is equivalent to MLE in a more general condition than that assumed in the existing study (Cardoso, 1997). MLE is defined as maximize E[log p(x|θ)], θ
(2.1)
where x = [x1 , . . . , xn ]T is an n-dimensional random variable representing an input. E[·] denotes the expectation with respect to the true p.d.f. of x, q(x). We assume that the probabilistic model p(x|θ) has an m-dimensional hidden variable y, p(x|θ ) =
p(y)p(x|y, θ)dy,
(2.2)
where the prior distribution for the hidden variable, p(y), is assumed to be uniform within a certain domain, for example, an m-dimensional hypercube, (0, 1)m . We further assume that x is generated by a parametric transformation φ from y plus a gaussian noise; namely, p(x|y, θ ) is given by T 1 (2.3) exp −β x − φ(y; θ) −1 x − φ(y; θ ) , Z(θ) where Z(θ) is a normalization term such that the integral p(x|y, θ )dx is 1. β > 0 is a constant hyperparameter, and 2 /β is a covariance matrix of the gaussian noise. φ(y; θ) denotes a basis function parameterized by parameter θ. The superscript T denotes the transpose. Probabilistic entropy maximization is defined by means of probabilistic sampling as follows. If y is p(x|y, θ) =
Nonlinear and Noisy Extension of Independent Component Analysis
123
sampled according to the sampling distribution p(y)p(x|y, θ ) p(x|θ ) and the model distribution p(x|θ) is sufficiently close to the true distribution q(x), the distribution of y, q(y), is approximated as q(y) ≡ ∫ q(x)p(y|x, θ )dx = p(x|y,θ ) p(y) ∫ q(x) p(x|θ) dx ≈ p(y). Since we have assumed that p(y) is uniform, q(y) also becomes uniform, implying that the entropy of q(y) is maximized under the condition that y’s support is fixed. The probabilistic entropy maximization maximizes the entropy of q(y) where y is probabilistically sampled as above. Since the above process is mathematically equivalent to the approximation of q(x) by p(x|θ), MLE is a method that performs probabilistic entropy maximization. However, probabilistic entropy maximization does not necessarily employ the same cost function as that of MLE; the solution by the probabilistic entropy maximization corresponds to the MLE solution only when the model p(x|θ ) is able to represent the true distribution q(x). Therefore, an MLE employing equations 2.2 and 2.3 is a special case of the probabilistic entropy maximization. Now we show that the probabilistic entropy maximization is reduced to the deterministic entropy maximization when the basis function φ(y; θ ) is invertible and β approaches infinity. By defining u ≡ φ(y) and ψ ≡ φ −1 , the following relations between u and y hold: dy = |J(u)|du
(2.4)
y = ψ(u),
(2.5)
where
|J(u)| is the absolute determinant of the Jacobian matrix J(u) ≡ ∂y ∂u. Then equation 2.3 becomes p(x|y) =
1 exp −β(x − u)T −1 (x − u) ≡ p(u|x). Z(θ)
(2.6)
p(u|x) ≡ p(x|y) is regarded as a p.d.f. of u because p(u|x)du = 1 and p(u|x) > 0. p(u|x) approaches a Dirac’s delta function, p(u|x) ≈ δ(u − x), as β goes to infinity. This means the sampling by p(y|x, θ ) ∝ p(x|y, θ )p(y) approaches the deterministic transformation y = ψ(x). By substituting equations 2.4, 2.5, and 2.6 into equation 2.2, we derive p(x|θ ) = |J(x)|,
(2.7)
where J(x) ≡ ∂y ∂x and y = ψ(x). Since the entropy of the transformed variable y = ψ(x; θ) is given by H(y) = H(x) + q(x) log |J(x; θ )| dx, maximization of equation 2.7 is identical to the deterministic entropy maximization with respect to y, indicating that the probabilistic entropy maximization is reduced to the deterministic entropy maximization when the basis function φ(y; θ) is invertible and β approaches infinity. Also, an MLE of
124
S. Maeda, W.-J. Song, and S. Ishii
equations 2.2 and 2.3 is identical to the maximization of equation 2.7 when the basis function φ(y; θ) is invertible and β approaches infinity. The above discussion suggests two points: (1) the probabilistic entropy maximization is a generalization of the deterministic entropy maximization, and (2) the deterministic entropy maximization can be solved by MLE. This generalization is useful for understanding the relationship between our model and existing ICA models based on the deterministic entropy maximization, as described later. Since the only assumption we have used is that the basis function φ(y; θ ) is an invertible function, our result about the above equivalence between entropy maximization and MLE is more general than that obtained by Cardoso (1997), who showed the equivalence when ψ(x) is a deterministic function such that ψ(x) = f (Wx) where W is a square and regular matrix and f (x) is a component-wise monotonically increasing nonlinear function. 3 Model and Algorithm 3.1 Probabilistic Model. Here, we describe the details of the probabilistic model we use. Suppose x is an n-dimensional input signal and s is an m-dimensional hidden variable that obeys a component-wise independent distribution p(s) = m i=1 pi (si ). We consider a square or undercomplete situation, that is, n is not smaller than m (n ≥ m). Our probabilistic model is then given more specifically by p(x|A, θ) = p(s)p(x|s, A, θ )ds (3.1) p(x|s, A, θ) =
p(s) =
1 Z(A, θ) T × exp −β g(x; θ) − As (x; θ ) g(x; θ ) − As (3.2) m 1 i=1
2
(1 − tanh2 (si )),
(3.3)
where A is an n×m matrix, g(x; θ) is a component-wise monotonic function, and Z(A, θ) is a normalization term such that the integral p(x|s, θ, A)dx is 1. The stochastic generative process p(x|s, θ, A) is based on a postnonlinear ICA model, namely, x = f (As) + n, where f (·) is the inverse function of g(x; θ ). n is a gaussian noise whose mean and covariance are 0 and a diagonal matrix (2β)−1 I, respectively, where I denotes the identity matrix. When (x − f (As)) is close to zero, it is approximated as x − f (As) ≈ f (g(x; θ ); θ )(g(x; θ ) − As), where f (x; θ) is a diagonal matrix whose (i, i) ∂ f (xi ;θi ) element is [ f (x; θ )]ii = ∂xi . Using this approximation, g(x; θ ) obeys a gaussian distribution whose mean and covariance are 0 and a diagonal matrix (x; θ ), respectively. The (i, i) element of (x; θ ) is [ (x; θ )]ii =
Nonlinear and Noisy Extension of Independent Component Analysis
125
{ f (g(xi ; θ ); θ )}2 . Note that our model cannot be described by a noisy linear ICA after the application of a nonlinear transformation, y = Aˆs + n and y = f (x; θ ), because the covariance of the noise n is dependent on x in our formulation. In this study, we use g(xi ; θ ) = log(xi ) + ci , with which f (yi ; θ ) = exp(yi − ci ) and [ (x; θ)]ii = { f (g(xi ; θ ); θ )}2 = x2i . Parameters of our probabilistic model are A and θ ≡ {ci |i = 1, . . . , n}. β is a hyperparameter. The prior defined by equation 3.3 corresponds to the application of a nonlinear transformation tanh(y) when the model is reduced to a deterministic entropy maximization model. This prior is a gaussian-like unimodal distribution, but more kurtotic than the gaussian distribution. Our model is identical to a cepstrum analysis in a special parameter setting. The cepstrum analysis first transforms a given sound into an amplitude spectrum by applying a Fourier transformation. The amplitude spectrum is then transformed by a logarithm function, and again another Fourier transformation is applied. The first step is also done as a preprocess in our model. The latter step is also realized in our model if the component-wise nonlinear function is given as g(x; θ) = log(x), the mixing matrix A is fixed at a square Fourier transformation matrix, and the hyperparameter β approaches infinity. Our model becomes a deterministic transformation in the limit of β going to infinity. In this case, our model becomes equivalent to the cepstrum analysis in the above-mentioned setting. 3.2 Generalized EM Algorithm. Because there is a hidden variable s, an EM algorithm (Dempster et al., 1977) is used for the maximization of the log likelihood 3.1 with respect to parameters A and θ . ¯ we obtain • E-step: By using the current parameter estimators, A¯ and θ, the expected log likelihood,
¯ θ) ¯ = Q(A, θ |A,
T 1 ¯ θ¯ ) log p(xt , st |A, θ )dst , p(st |xt , A, T t=1
(3.4)
where {xt } is the set of T input samples and {st } is the set of corresponding hidden variables. ¯ θ) ¯ with respect to parameters A and θ . • M-step: Maximize Q(A, θ|A, 3.2.1 E-Step. Since the parameter-dependent terms in log p(x, s|A, θ ) = −β(g(x; θ ) − As)T (x; θ)(g(x; θ) − As) − log Z(A, θ ) − log p(s) include only the first- and second-order statistics of the hidden variable s, the integral in equation 3.4 is calculated by the Laplace approximation. ¯ θ¯ ) appearing in equaAccording to the Laplace approximation, p(s|x, A, tion 3.4 is approximated as a gaussian distribution peaked at the MAP esti-
126
S. Maeda, W.-J. Song, and S. Ishii
¯ θ) ¯ θ¯ ) + log p(s)}: ¯ = arg maxs {log p(x|s, A, mator, sˆ ≡ arg max p(s|x, A,
¯ θ¯ ) ∝ exp − 1 (s − sˆ )T H(ˆs)(s − sˆ ) , p(s|x, A, 2
(3.5)
¯ θ). ¯ where H(ˆs) ≡ −∇∇ log p(ˆs|x, A, To obtain the MAP estimator sˆ , a gradient-ascent method is used: ¯ + ϕ(s), ¯ ¯ − As) s ∝ 2β A¯ T (g(x; θ)
(3.6)
∂ log p(s)
where ϕ(s) ≡ = −2 tanh(s). ∂s Then the first and second moments are calculated as ¯ θ)sds ¯ s ≡ p(s|x, A, = sˆ sT s ≡
¯ θ)ss ¯ T ds = sˆ sˆ T + H(ˆs)−1 , p(s|x, A,
(3.7) (3.8)
where H(ˆs)−1 is estimated under the assumption of low noise such that β is fairly large: −1 1 ≈ (AT A)−1 H(ˆs)−1 = 2βAT A − ∇∇ log p(ˆs) 2β 1 + 2 (AT A)−1 ∇∇ log p(ˆs)(AT A)−1 . 4β
(3.9)
3.2.2 M-Step. Using the statistics, equations 3.7 and 3.8, and neglecting the normalization term (− log Z(A, θ)) under the assumption of low noise, ¯ θ¯ ) are calculated as the derivatives of the expected log likelihood Q(A, θ|A, T ∂Q 2β ≈ t et sˆ Tt − t AH(ˆst )−1 ∂A T t=1
(3.10)
T 2β ∂Q ≈− et,i ( f (g(xt,i ; θi ); θi ))2 , ∂ci T t=1
(3.11)
where et,i and xt,i denote the ith elements of et ≡ (g(xt ; θ ) − Aˆst ) and xt , respectively. The parameter c is analytically obtained from equation 3.11 as m 2 log x − ˆ s x a i,t ij j,t t=1 i,t j=1 , T 2 x t=1 i,t
T ci = −
(3.12)
Nonlinear and Noisy Extension of Independent Component Analysis
127
while some constraint is needed to obtain analytically the matrix A. When the function g(xt ; θ) is linear in equation 3.10, the M-step equation ∂Q/∂A = 0 is analytically solved as T 1 A= g(xt ; θ)ˆsTt T t=1
T 1 sˆ t sˆ Tt + H(ˆst )−1 T t=1
−1 ,
(3.13)
which provides the noisy ICA presented previously (Bermond & Cardoso, 1999; Girolami, 2001). In a nonlinear case, however, the analytical solution is not directly obtained, and a gradient-based optimization method is used in our algorithm. Since the M-step is based on a gradient-ascent method in this case, our EM procedure is an instance of the generalized EM algorithm.
Using the natural gradient AAT ∂Q ∂A instead of the gradient ∂Q/∂A, the learning is much accelerated in practice (Amari et al., 1996; Lewicki & Sejnowski, 2000): AAT
T 2β ∂Q ≈ AAT t g(xt ; θ) − Aˆst sˆ Tt − AAT t AH(ˆst )−1 . ∂A T t=1
Since the MAP estimator sˆ should satisfy s = 0 or, equivalently, 2βAT (g(x; θ ) − As) + ϕ(s) = 0, the following equation holds: AAT
T 1 ∂Q A φ(ˆst )ˆsTt + 2βAT t AH(ˆst )−1 . ≈− ∂A T t=1
(3.14)
When the noise is low, that is, β is large enough, H(ˆs)−1 is approximated as 1 H(ˆs)−1 ≈ 2β (AT A)−1 + 4β1 2 (AT A)−1 ∇∇ log p(ˆs)(AT A)−1 , and this leads to AAT
∂Q ∂W
T ∂Q 1 1 A I + ϕ(ˆst )ˆsTt + ≈− ∇∇ log p(ˆst )(AT t A)−1 . ∂A T t=1 2β
(3.15)
In particular, when A has an inverse matrix W ≡ A−1 , noting the fact that T = −AT ∂Q ∂A A , the natural gradient 3.15 becomes ∂Q T ∂Q T T W W = −AT A W W ∂W ∂A
T 1 1 −1 T T I + φ(ˆst )ˆst + ∇∇ log p(ˆst )W t W W, = T t=1 2β
(3.16)
which is closely related to the noisy ICA which uses the bias removal technique, presented previously by Douglas et al. (1998). In their noisy ICA,
128
S. Maeda, W.-J. Song, and S. Ishii
however, the sign of the third right-hand-side term in equation 3.16 was opposite, and sˆ was given as sˆ = Wx instead of the MAP estimator we use. Note that local convergence is assumed in our generalized EM algorithm, while the convergence of the above bias removal technique remains 1 unclear. If the term ( 2β ∇∇ log p(ˆst )W t−1 W T ) is ignored in equation 3.16, in addition, it becomes the learning rule of noise-free ICA, which has been proposed from various perspectives as described in section 1. Although the noise-free ICA is a special case of the noisy ICA, existing noisy ICA learning rules based on MLE (e.g., Belouchrani & Cardoso, 1995; Hyv¨arinen, 1998a; Bermond & Cardoso, 1999; Højen-Sørensen et al., 2002; Girolami, 2001) do not include the noise-free ICA learning rule, except for the one by Lewicki and Sejnowski (2000). Lewicki and Sejnowski derived a noisefree ICA learning rule by omitting a term that is unrelated to the likelihood ∫ p(s)p(x|s, θ)ds, but our learning rule does not include such a term. Instead, 1 the term ( 2β ∇∇ log p(ˆst )W t−1 W T ), which appears in our learning rule, automatically vanishes in the noiseless limit. They reported that the additional term makes the learning unstable, whereas ours does not. This difference arises probably because Lewicki and Sejnowski directly evaluated the like lihood function p(s)p(x|s)ds by approximating p(s)p(x|s) as a gaussian distribution centered at the MAP estimator sˆ , which is dependent on the estimation of A, whereas our EM formulation introduced the expected log ¯ θ) ¯ in which the MAP estimator sˆ is independent of the likelihood Q(A, θ|A, estimation of parameters A and θ. Although their approximation incorporated how the MAP estimator sˆ is affected by changing the parameter A, that effect was difficult to estimate and hence needed approximation, which might have caused the instability. Many of the conventional noisy ICA algorithms based on MLE calculate the integral over the hidden variable s by using point approximation with a Dirac’s delta function (Belouchrani & Cardoso, 1995; Olshausen & Field, 1996; Hyv¨arinen, 1998a; Rao & Ballard, 1999). For example, if the term t AH(ˆst )−1 is removed from equation 3.10, which corresponds to the case that the posterior is approximated by a delta function at the MAP estimator sˆ , then the updating rule becomes equivalent to the original sparse coding method (Olshausen & Field, 1996): A ∝
T 2β t (g(xt ; θ) − Aˆst )ˆsTt . T t=1
(3.17)
It should be noted that if g(xt ; θ) is linear, t becomes the identity matrix in equation 3.17. In the noiseless limit, equation 3.10 is reduced to noise-free ICA while equation 3.17 is not, and this difference stems from the approximation accuracy of the posterior. Accordingly, our new learning rule not only improves existing noisy ICA algorithms but also gives an appropriate nonlinear extension of those algorithms.
Nonlinear and Noisy Extension of Independent Component Analysis
129
Our learning rule also generalizes a noise-free nonlinear ICA derived by Lee et al. (1997) or Almeida (2004), because those algorithms are based on deterministic entropy maximization which is equivalent to MLE in the noiseless limit as mentioned in section 2. 3.3 Step Size Determination. According to our GEM algorithm, parameter A is obtained by a gradient-based optimization method, A := A + A. Here, we discuss the way to determine an appropriate value for the step size . In the GEM algorithm, the step size should be determined so that ¯ θ) ¯ is maximized. The optimization for the expected log likelihood Q(A, θ|A, ¯ θ¯ ) ≥ Q(A, ¯ θ |A, ¯ θ¯ ). We parameter A in the M-step should achieve Q(A, θ |A, then define a cost function, ¯ θ) ¯ θ|A, ¯ θ), ¯ − Q(A, ¯ F() ≡ Q(A, θ |A,
(3.18)
where A = A¯ + A. is determined such that F() is maximized. We can see F() has a simple quadratic form if either the noise is fairly low or the nonlinear function g(xt ; θ) is linear,
b 2 b2 F() ≈ −a − (3.19) + , 2a 4a T T where a = Tβ Tt=1 (Tr(st sTt AT t A)) and b = 2β t=1 (g(xt ; θ ) t A T T T ¯ Tr(·) denotes the matrix trace. The derivation of st − Tr(st st A t A)). equation 3.19 is described in section A.1. Since a and b are positive (see section A.2), the optimal step size is determined as =
b , 2a
(3.20)
with which the GEM algorithm achieves a line search for parameter A. ¯ θ¯ ) decreases as the The gain of the expected log likelihood Q(A, θ |A, noise decreases, and the optimal step size equals zero in the noiseless limit because the parameter a diverges while the parameter b converges in that limit. This indicates that the EM algorithm is no longer effective in the noiseless limit. Interestingly, although the update of A according to the GEM algorithm does not increase the expected log likelihood in this case, the update is still meaningful because it increases the likelihood itself. This implies that when the noise is small enough, the step size , which is optimal for maximizing the likelihood, is larger than that optimal for maximizing the expected likelihood. 4 Virtual Pitch Simulation 4.1 Condition. Human speech and music were used for training our model. Speech was uttered by four American, four Spanish, and two Japanese speakers; five of them were male and the other five female. The total
130
S. Maeda, W.-J. Song, and S. Ishii
duration of speech was 232 seconds. Music was classified into a category of singing voice and no singing voice. The former consisted of pop music and relaxation music, and was extracted from 25 songs so that the total duration was 116 seconds. The latter consisted of a piano concerto, a violin concerto, and an orchestra and was extracted from 24 tunes so that the total duration was 116 seconds. All the sound data were obtained from commercial compact discs randomly. No period of silence was included. The data set consisted of 10,000 segments, each of which contained 512 points. They were quantized into 16 bits and converted to the sampling frequency of 11,025 Hz to cover the existence region of virtual pitch (Ritsma, 1962). To obtain spectral information, 512-point discrete Fourier transformation (DFT) of the entire acoustic data was employed, which transformed the acoustic signals into 10,000 points of 256-dimensional spectral patterns. Before performing DFT, we applied a Hanning window to the data set for removing wideband artifactual energy. We used Fourier bases for mimicking peripheral spectral analysis for the following reason, though bandpass filters whose center frequencies are arranged in a logarithmic scale are more natural than the Fourier filters; to distinguish high-frequency components with a resolution sufficient for reproducing virtual pitch phenomena, the number of filters required by the logarithmic arrangement becomes much larger than the number of Fourier filters by the linear arrangement. We adopted the definition of pitch matching, used in the existing study (Wightman, 1973). We assume that the peak value of on MAP estimator sˆ represents the internal pitch representation: we suppose two inputs are perceived as having the identical pitch if the same component of sˆ is most activated in response to both of the two inputs. We set the size of the mixing matrix A at 256×30. The learning results were dependent on the hyperparameter β and the size and the initial value of the matrix A. Since the hyperparameter β balances the prior p(s), which makes the internal representation s small with the conditional probability p(x|s), which makes the representation error (g(x; θ) − As)T (x; θ )(g(x; θ ) − As) small, β should be set appropriately. In the simulation, β was set to 500, the initial value of A was set closely to the one used in Wightman’s model (1973), and the parameter c was constrained so that the mean E[g(x; θ )] was zero. 4.2 Results. Figure 1 shows the mixing matrix A after learning, in which the magnitude of all elements of matrix A is depicted. The jth column vector of matrix A, aj , can be interpreted as an n-dimensional spectral basis pattern due to the jth internal representation sˆj , and an input x is approximated m as a linear summation of the column vector aj as x ≈ exp j=1 aj sˆj + c . Then we call each aj a basis vector and regard sˆ as an internal representation of input x. Figure 2 shows six basis vectors out of matrix A after learning, which correspond to the column vectors surrounded by rectangles in Figure 1. Many harmonic peaks can be observed in those basis vectors. A circle
Nonlinear and Noisy Extension of Independent Component Analysis
131
Figure 1: The mixing matrix A after learning, in which the magnitude of all elements of matrix A is shown. The jth column vector of matrix A, aj , can be interpreted as an n-dimensional spectral pattern due to the jth internal representation sˆj , and an input x is approximatedas the exponential of the linear summation of those column vectors, x ≈ exp
m
j=1
aj sˆj + c . Note that both of
the components of a column vector aj and an internal representation sˆj can be negative, because we used an exponential function as a nonlinear function in this simulation. The column vectors surrounded by rectangles correspond to the basis vectors shown in Figure 2.
indicates a harmonic peak, and the title denotes the fundamental frequency calculated from the circled harmonic peaks such to minimize the mean square error between multiples of the fundamental frequency and the circled harmonic frequencies. A harmonic structure is clear especially for low fundamental frequencies (see Figure 2A), while sometimes not very clear for high fundamental frequencies (see Figure 2D). For high fundamental frequencies, the basis vectors sometimes contain large values (see Figure 2C). Such basis collapses for high fundamental frequencies occur presumably because the training data are not well represented by harmonic bases whose fundamental frequencies are high. In other words, the harmonic structure is supposed to arise especially for low fundamental frequencies because the training data are well represented by the acquired harmonic bases. A few basis vectors (see Figure 2F) exhibit no prominent harmonic structure.
132
S. Maeda, W.-J. Song, and S. Ishii $
+]
ZHLJKWVRIDEDVLV
í
'
%
+]
í
í
í
+]
(
+]
í
)
+]
&
í
í
IUHTXHQF\+]
Figure 2: Six column vectors out of matrix A after learning, each of which represents a basis vector for an internal representation sj . Each panel corresponds to a rectangle in Figure 1. A circle indicates a harmonic peak, and the title denotes a fundamental frequency calculated from the circled harmonic peaks. Most of the column vectors exhibit harmonic structures. We therefore suppose that each basis vector represents a spectral pattern of a certain pitch, which tends to have a harmonic structure. Especially for high fundamental frequencies, basis vectors sometimes contain components with large values relative to the other components (e.g., C), and a few basis vectors do not exhibit prominent harmonic structures (e.g., F).
Ritsma (1962) examined whether two stimuli—one consisting of a fundamental frequency g, its twice frequency 2g and three-times frequency 3g, and the other consisting of three higher-order harmonics of the fundamental frequency, ( f − g), f and ( f + g)—provide the same pitch sensation. More precisely, the higher-order harmonics of the fundamental frequency tone v(t) in that experiment had a modulation parameter m that controlled the mixture ratio: v(t) = m sin 2π( f − g)t + sin 2π f t + m sin 2π( f + g)t. We call the former and the latter tones a fundamental tone (FT) and a higher harmonics tone (HHT), respectively (the spectrum of each tone is shown in Figure 3), and call frequencies g and f , a spacing frequency and a center frequency, respectively. Figure 4 shows the region in which an FT and an HHT provide the identical pitch (Ritsma, 1962). A pair of an FT and an HHT is specified by parameters g and f in Figure 4. This figure shows several aspects of virtual pitch phenomena. When the center frequency f is too large, virtual pitch does not occur. When the spacing frequency g is too small or too large, virtual pitch does not occur either. Therefore, virtual pitch can be observed for moderate values of f and g. r is a reference variable denoting the rate f/g. Note that there are no common frequency components between the FT
Nonlinear and Noisy Extension of Independent Component Analysis
DPSOLWXGHVSHFWUXP
P
VSDFLQJIUHTXHQF\
J J J
+LJKHUKDUPRQLFV
% DPSOLWXGHVSHFWUXP
IXQGDPHQWDOWRQH
$
IJ I IJ
VSDFLQJIUHTXHQF\
J J J
IUHTXHQF\
DPSOLWXGHVSHFWUXP
' DPSOLWXGHVSHFWUXP
P
FHQWHUIUHTXHQF\
IUHTXHQF\
IUHTXHQF\
&
133
FHQWHUIUHTXHQF\
IJ I IJ
IUHTXHQF\
Figure 3: Spectra of a fundamental tone (FT) and a higher harmonics tone (HHT) for m = 100 and m = 50. An FT consists of a fundamental frequency g, its twice frequency 2g, and three-times frequency 3g, and an HHT consists of three higherorder harmonics of the fundamental frequency, ( f −g), f and ( f +g). Frequencies g and f are called a spacing frequency and a center frequency, respectively. A virtual pitch is judged to occur if these two stimuli, left and right, provide the same pitch sensation.
and the HHT in the region of r = f/g ≥ 5. The modulation parameter m represents the clarity of the HHT (see above). As the m value becomes large, the harmonic structure becomes prominent, and then the region of virtual pitch enlarges. Figure 5 shows activations of internal representation sˆ in our model when presented with FTs and HHTs. We used m = 100% for all simulations in this section. In each of Figures 5A, 5B, and 5C, the upper and lower rows show the amplitude spectrum of inputs and the activation pattern against the inputs, respectively. In each lower panel is a mark that indicates whether the most activated internal representation is the same between when presented with an FT and when presented with an HHT; a circle indicates a match, a cross a mismatch, and a triangle the case that the internal representation most activated by an FT was the second most activated one by an HHT. When FTs and HHTs were moderate (corresponding to from 160 to 660 Hz), the most activated internal representation was almost the same between when presented with an FT and when presented with an HHT, as can be
134
S. Maeda, W.-J. Song, and S. Ishii
Figure 4: The existence region of virtual pitch examined by Ritsma (1962), in which whether an FT and an HHT provide the same pitch sensation was psychologically examined. The two stimuli are specified by a spacing frequency g and a center frequency f . This figure shows several aspects of virtual pitch. When either the center frequency f is too large or the spacing frequency g is too small or too large, virtual pitch does not occur. Therefore, virtual pitch can be observed for moderate values of f and g. r is a reference variable denoting the rate f/g. m is a modulation parameter that represents the clarity of an HHT (see the text for the exact definition). As the m value becomes large, the harmonic structure becomes prominent and then the region of virtual pitch enlarges.
seen in Figures 5A and 5B.3 This indicates that such an FT-HHT pair led to the identical pitch sensation. In contrast, the most activated component of internal representation showed discrepancies for lower or higher fundamental frequencies, as can be seen in Figure 5C; namely, such an FT-HHT pair did not lead to the identical pitch sensation in our simulation. To examine what kind of feature of the input sound data was extracted by internal representation s, we reconstructed input sounds from the internal representation. An example is shown in Figure 6. The original input (upper panels) and the corresponding internal representation are the same as those in Figure 5B, though the y-axis has log-scale for visibility. From this figure, we see that the reconstructed spectrum had a clear harmonic structure, but the reconstruction was degraded as the center frequency was high. This result
3 FTs and HHTs are dependent on the resolution of DFT. Although individual activation of internal representation s was affected by the resolution of DFT, the overall characteristics of activation patterns did not vary regardless of the resolution.
Nonlinear and Noisy Extension of Independent Component Analysis IXQGDPHQWDOWRQH
í
í
I +]
í
I +]
% J +]
DFWLYLW\
DPSOLWXGH VSHFWUXP
IUHTXHQF\+] í
í
í
í
í
I +]
I +]
I +]
I +]
I +]
I +]
I +]
& J +]
KLJKHUKDUPRQLFV I +]
$ J +]
135
I +]
í
LQGH[RIKLGGHQYDULDEOHV
Figure 5: Activations of internal representation s when presented with FTs (the left-most column) or HHTs (the other columns) (m = 100%). (A–C) The upper and lower rows show the amplitude spectrum of input tones and the activation patterns against the inputs, respectively. In each lower panel is a mark that indicates whether the most activated internal representation is the same between when presented with an FT and when presented with an HHT. A circle indicates a match, a cross a mismatch, and a triangle the case that the internal representation most activated by an FT is the second most activated one by an HHT. A light gray bar indicates the component most activated by the FT. When FTs and HHTs are moderate (corresponding to from 160 to 660 Hz), the most activated internal representation was almost the same between when presented with an FT and when presented with the corresponding HHT, as can be seen in A and B; namely, such an FT-HHT pair produces the identical pitch sensation in our simulation. In contrast, the most activated component of internal representation is different for lower or higher fundamental frequencies, as can be seen in C; namely, such an FT-HHT pair does not produce the identical pitch sensation in our simulation.
indicates that the sound data used for the learning involve many harmonic structures that are prominent especially for low frequencies. The learning by our nonlinear noisy ICA then made the reconstruction error small, especially for low-frequency components. Although we show a single example here, the above tendency can be seen in other cases. A more intensive result is summarized in Figure 8 (for details, see the figure caption). Figure 7 shows the existence region of virtual pitch in our model—the region where the same component of sˆ was most activated in response
136
S. Maeda, W.-J. Song, and S. Ishii
IXQGDPHQWDOWRQH
DPSOLWXGH VSHFWUXP
J +]
KLJKHUKDUPRQLFV I +]
I +]
I +]
I +]
IUHTXHQF\+]
Figure 6: Comparison between an input sound spectrum and a spectrum reconstructed from internal representation sˆ . The original input (upper panels) and the corresponding internal representation are the same as those in Figure 5B, though the y-axis has a log scale for visibility. Each reconstructed spectrum has a clear harmonic structure, but the reconstruction is degraded as the center frequency is high.
to both an FT and the corresponding HHT. We show only the region of f/g ≥ 5. The circles and triangles have the same meaning as in Figure 5. A shaded dot corresponds to a cross in Figure 5. When the center frequency f was fairly large, we did not perform simulations because the frequency resolution sufficient for producing test sounds could not be realized in the current simulation condition. When the spacing frequency g was too small or too large, the number of shaded dots increased, suggesting that virtual pitch hardly occurred in that region. This tendency can also be seen in the psychophysical experiment in Figure 4. The reason that virtual pitch was observed in the moderate region can be considered as follows. When the spacing frequency was too small, virtual pitch did not occur, possibly because the frequency resolution was not sufficient; although this insufficient resolution is the case of our simulation, similar features may exist in the actual cochlear filter. When the spacing frequency was too large, virtual pitch did not occur because the harmonic structure of the basis vectors collapsed for high fundamental frequencies. We also compared the pitch sensation by our simulation with that by Wightman’s central pattern transformation model. The result is shown in Figure 9. The overall characteristics of the Wightman’s model are similar to those by our simulation, but there are some discrepancies. The main difference is that Wightman’s model produced virtual pitch even for high spacing frequencies (g > 800 Hz), whereas our simulation did not produce in that region. Because the harmonic structures of basis vectors are clearly prepared even for high spacing frequencies in Wightman’s model, the model produced virtual pitch. Note that virtual pitch does not actually occur when the spacing frequency is high (see Figure 4).
Nonlinear and Noisy Extension of Independent Component Analysis
U
137
VSDFLQJIUHTXHQF\J+]
&
%
$
FHQWHUIUHTXHQF\I+] Figure 7: Existence region of virtual pitch in our model ( f/g ≥ 5). This figure corresponds to the situation with m = 100% in Figure 4. A circle indicates a parameter pair of g and f , with which an FT and an HHT produced the same pitch, and a triangle indicates that the most activated internal representation by an FT corresponded to the next most activated internal representation by an HHT. A dot indicates a parameter pair with which the pitch sensation showed discrepancy (examples of activation pattern sˆ are shown in Figure 5, and those examples correspond to the simulation conditions surrounded by light square rectangles). When the spacing frequency g is too small or too large, the number of shaded dots increases, suggesting that virtual pitch hardly occurs in that region. This tendency can also be seen in the psychophysical experiment in Figure 4.
5 Discussion In this letter, we proposed a noisy nonlinear ICA that not only generalizes the formerly proposed noisy linear ICA, but also corresponds to the noisefree linear ICA in the noiseless limit. Exploring the estimation function that is free from the assumption of source p.d.f. in the noisy nonlinear ICA is our future work. We trained our model by using speech and music and examined the pitch sensation by the trained model. We found that the existence region of virtual pitch, obtained by our computer simulation, is qualitatively similar to
138
S. Maeda, W.-J. Song, and S. Ishii
U
VSDFLQJIUHTXHQF\J+]
㨪 㨪 㨪 㨪
FHQWHUIUHTXHQF\I+] Figure 8: Reconstruction error between an input sound spectrum and a sound spectrum reconstructed from internal representation sˆ for various combinations of f and g. The reconstruction error denotes absolute difference between the amplitude spectrum of input sounds and the reconstructed amplitude spectrum, averaged over frequencies. A dark (light) circle indicates that the reconstruction error is large (small). While most of the error takes a relatively small value (error < 10), sometimes it becomes a very large value. Therefore, the coloring is performed in a nonlinear fashion for visibility (see inset).
that by the existing psychological experiment. Chang and Merzenich (2003) reported that continuous exposure to moderate-level noise disturbs the normal development of topographic representational order of an infant rat’s primary auditory cortex and the primary auditory cortex still has a plasticity after the exposure to the noise; namely, the topographic representation can be reorganized in a new sound environment. Their study is consistent with our hypothesis that the transformation in the central auditory system can be acquired and changed through learning, depending on the environment. In particular, the primary auditory cortex seems to be a responsible region for such learning, for several reasons; neurons in the primary auditory cortex have frequency selectivity (Merzenich, Knight, & Roth, 1975; Reale & Imig, 1980), and the primary auditory cortex is known to have neurons activated when presented with a pure tone and presented with a harmonic
Nonlinear and Noisy Extension of Independent Component Analysis
U
139
VSDFLQJIUHTXHQF\J+]
FHQWHUIUHTXHQF\I+] Figure 9: Existence region of virtual pitch in Wightman’s model ( f/g ≥ 5). This figure is plotted similar to that in Figure 7. This figure also corresponds to the situation with m = 100%. When the spacing frequency g was larger than 260 Hz, virtual pitch was observed; it was not observed when the spacing frequency g was small. This tendency of Wightman’s model can be seen in our simulation, but there are some discrepancies. The major difference is that Wightman’s model produced virtual pitch even for high spacing frequencies (g > 800 Hz), whereas our simulation did not produce in that region. We consider that this arose because the basis vectors in the Wightman’s model were prepared to have clear harmonic structures even for high spacing frequencies. However, such a feature cannot be observed in the psychoacoustic experiment (see Figure 4).
combination tone that induces a virtual pitch phenomenon (Pantev, Hoke, Lutkenhoner, & Lehnertz, 1989; Riquimaroux & Hashikawa, 1994). Since our model involves a cepstrum analysis as a special case, from an engineering viewpoint, it may be possible to obtain a pitch sensation model or a feature extraction system that can be applied to, for example, detection of the fundamental frequency, through learning from actual speech and music signals, based on our framework of nonlinear noisy ICA.
140
S. Maeda, W.-J. Song, and S. Ishii
Appendix ¯ θ¯ ) A.1 Derivation of Equation 3.19. Expected log likelihood Q(A, θ |A, is given by ¯ θ¯ ) = − 1 Q(A, θ|A, T
= −β
T T ¯ θ)(g(x ¯ β p(st |xt , A, t ; θ ) − Ast ) t=1
× t (g(xt ; θ) − Ast )dst + log Z(A, θ )
T 1 g(xt ; θ)T t g(xt ; θ) − 2g(xt ; θ)T t Ast T t=1 + Tr(st sTt AT t A) + log Z(A, θ ).
When the noise is low or the nonlinear function g(xt ; θ ) is linear, terms that are not multiplied by β can be neglected. Then, ¯ θ) ¯ θ|A, ¯ θ) ¯ − Q(A, ¯ F() ≡ Q(A, θ|A, T 1 −2g(xt ; θ)T t A st + Tr(st sTt AT t A) ≈ −β T t=1 T 1 ¯ . −2g(xt ; θ)T t A¯ st + Tr(st sTt A¯ T t A) +β T t=1 Substituting A for A¯ + A, it becomes = 2β
T 1 g(xt ; θ)T t (A¯ + A) st − g(xt ; θ )T t A¯ st T t=1
T 1 ¯ −Tr(st sTt (A¯ + A)T t (A¯ + A))+Tr(st sTt A¯ T t A) T t=1 T β Tr(st sTt AT t A) = − 2 T t=1
+β
+
T 2β ¯ . g(xt ; θ)T t A st − Tr(st sTt AT t A) T t=1
This is identical to equation 3.19. A.2 Sign of Parameters a and b. The parameter b is positive because the gradient descent always satisfies F ( = 0) > 0, which yields b > 0. Determining whether the parameter a is positive is a bit complicated.
Nonlinear and Noisy Extension of Independent Component Analysis
141
The term Tr(st sTt AT t A) is a sum of all elements of the matrix that is a Hadamard product of two matrices st sTt and AT t A. A Hadamard product of two matrices P and Q is represented as P Q, and the (i, j) element of P Q is denoted as pij qij , where pij and qij are the (i, j) elements of matrix P and Q, respectively. To determine the positivity of the second term, the following lemmas are used. Lemma 1. When a matrix U is (semi-) positive definite, the sum of all elements of U is positive. Lemma 2. When two n × n matrices P and Q are (semi-) positive definite, the Hadamard product of them, U = P Q, is also (semi-) positive definite. Lemma 3.
The matrices st sTt and AT t A are (semi-) positive definite.
When a matrix U is (semi-) positive definite, the sum of all elements of U, 1T U1, is (semi-) positive. This fact is identical to lemma 1. Next, we prove lemma 2. When n×n matrices P and Q are (semi-) positive definite, P and Q are represented as P = LpT Lp and Q = LTq Lq , where Lp and Lq are lower triangular matrices of P and Q, respectively. Let aij or bij be the (i, j) element of the matrix Lp or Lq , respectively. Then the (i, j) element of U = P Q, Uij , is calculated as Uij = [(LpT Lp ) (LTq Lq )]ij n n = aki akj bli blj k=1 l=1
=
n2
(αm,i )(αm,j ),
m=1
where αm,i = aq(m)i br(m)i , q(m) = m/n, and r(m) = m − q(m)n. Let M be an n2 × n matrix whose (i, j) element is αi,j . Then the above relation can be described in a matrix form: U = MT M. This proves that the matrix U is (semi-) positive definite. Next, we prove lemma 3. The matrix st sTt is obviously (semi-) positive definite. AT t A is also positive definite because t is positive definite, and for any nonzero vector s, sT AT t As = (As)T t (As) > 0 is satisfied. Lemmas 1, 2, and 3 indicate that the term Tr(H(ˆst )−T AT t A) is (semi-) positive.
142
S. Maeda, W.-J. Song, and S. Ishii
References Almeida, L. B. (2004). MISEP—linear and nonlinear ICA based on mutual information. Signal Processing, 84(2), 231–245. Amari, S., & Cardoso, J. F. (1997). Blind source separation—semiparametric statistical approach. IEEE Transaction on Signal Processing, 45(11), 2692– 2700. Amari, S., Chen, T.-P., & Cichocki, A. (1997). Stability analysis of learning algorithms for blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 752–763). Cambridge, MA: MIT Press. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803– 851. Barlow, H. B. (1961). Sensory communication. Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Belouchrani, A., & Cardoso, J. F. (1995). Maximum likelihood source separation by the expectation-maximization technique: Deterministic and stochastic implementation. In Proceedings of NOLTA (pp. 49–53). Las Vegas, NV. Bermond, O., & Cardoso, J. F. (1999). Approximate likelihood for noisy mixtures. In ICA ‘99 (pp. 325–330). Aussois, France. Bregman, A. S., & Campbell, J. (1971). Primary auditory stream segregation and perception of order in rapid sequences of tones. Journal of Experimental Psychology, 89(2), 244–249. Cardoso, J. F. (1997). Infomax and maximum likelihood for source separation. IEEE Trans. Auto. Contr. Sys., 4(4), 112–114. Cardoso, J. F., & Souloumiac, A. (1993). Blind beamforming for non gaussian signals. In IEEE Proceedings-F, 140(6), 362–370. Chang, E. F., & Merzenich, M. M. (2003). Environmental noise retards auditory cortical development. Science, 300, 498–502. de Boer, E. (1956). Pitch of inharmonic signals. Nature, 178, 535–536. de Boer, E. (1977). Psychophysics and physiology of hearing. San Diego, CA: Academic Press. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society. Series B, Statistical Methodological, 39, 1–38. Douglas, S. C., Cichocki, A., & Amari, S. (1998). Bias removal technique for blind source separation with noisy measurements. Electronics Letters, 34(14), 1379–1380.
Nonlinear and Noisy Extension of Independent Component Analysis
143
Gaeta, M. & Lacoume, J. L. (1990). Source separation without a priori knowledge: the maximum likelihood solution. In EUSIPCO (Vol. 24, pp. 621–624). Berlin: Springer-Verlag. Girolami, M. (2001). A variational method for learning sparse and overcomplete representations. Neural Computation, 13(11), 2517–2532. Goldstein, J. L. (1973). An optimum processor theory for the central formation of the pitch of complex tones. J. Acoust. Soc. Am., 54, 1496–1516. Helmholtz, H. L. F. (1863). Die Lehre von den Tonempfindungen als Physiologische ¨ die Theorie der Musik. Braunschweig. Grundlage fur Højen-Sørensen, P. A., Winther, O., & Hansen, L. K. (2002). Mean-field approaches to independent component analysis. Neural Computation, 14(4), 889–918. Houtsma, A. J. M., & Goldstein, J. L. (1972). The central origin of the pitch of complex tones: Evidence from musical interval recognition. J. Acoust. Soc. Am., 51, 520–529. Hyv¨arinen, A. (1998a). Independent component analysis in the presence of gaussian noise by maximizing joint likelihood. Neurocomputing, 22, 49–67. Hyv¨arinen, A. (1998b). New approximations of differential entropy for independent component analysis and projection pursuit. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 273–279). Cambridge, MA: MIT Press. Hyv¨arinen, A. (1999). Gaussian moments for noisy independent component analysis. IEEE Signal Processing Letters, 6(6), 145–147. Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. In Proceedins of ICA 2003 (pp. 245–256). Nara, Japan. Kawanabe, M., & Murata, N. (2000). Independent component analysis in the presence of gaussian noise (Tech. Rep. 2000-03). Tokyo: University of Tokyo. Lappalainen, H., & Honkela, A. (2000). Advances in independent component analysis. Berlin: Springer-Verlag. Lee, T.-W., Koehler, B., & Orglmeister, R. (1997). Blind separation of nonlinear mixing models. In IEEE International Workshop on Neural Networks for Signal Processing (pp. 406–415). Lewicki, M. S. (2002). Efficient coding of natural sounds. Nature Neuroscience, 5(4), 356–363. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12, 337–365. Lindsay, P. H., & Norman, D. A. (1977). Human information processing: An introduction to psychology. San Diego, CA: Academic Press. Merzenich, M. M., Knight, P. L., & Roth, G. L. (1975). Representation of cochlea within primary auditory cortex in the cat. J. Neurophysiol., 38, 231–249. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Pantev, C., Hoke, M., Lutkenhoner, B., & Lehnertz, K. (1989). Tonotopic organization of the auditory cortex: Pitch versus frequency representation. Science, 246, 486–488. Patterson, R. D. (1973). Physical variables determining residue pitch. J. Acoust. Soc. Am., 53, 1565–1572.
144
S. Maeda, W.-J. Song, and S. Ishii
Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79–87. Reale, R. A., & Imig, T. J. (1980). Tonotopic orginization in auditory cortex of the cat. J. Comp. Neurol., 192, 265–291. Riquimaroux, H., & Hashikawa, T. (1994). Units in the primary auditory cortex of the Japanese monkey can demonstrate a conversion and place pitch in the central auditory system. Journal de physique IV, C5, 419–425. Ritsma, R. J. (1962). Existence region of the tonal residue. I. J. Acoust. Soc. Am., 42(1), 191–198. Schouten, J. F. (1940). The perception of pitch. Philips Technical Review, 5, 286–294. Schouten, J. F., Ritsma, R. J., & Cardozo, B. L. (1962). Pitch of the residue. J. Acoust. Soc. Am., 34, 1418–1424. Seebeck, A. (1843). Uber die sirene. Ann. Phys. Chem., 60, 449–481. Taleb, A. & Jutten, C. (1999). Source separation in post-nonlinear mixtures. IEEE Transaction on Signal Processing, 47(10), 2807–2820. Terhardt, E. (1974). Pitch, consonance, and harmony. J. Acoust. Soc. Am., 55, 1061–1069. Wightman, F. L. (1973). The pattern-transformation model of pitch. J. Acoust. Soc. Am., 54, 407–416. Yoshioka, M. (1980). Mathematical relations between three pitch theories. In Spring Meeting of the Acoustical Society of Japan (pp. 683–684). Zibulevsky, M., & Pearlmutter, B. A. (2001). Blind source separation by sparse decomposition in a signal dictionary. Neural Computation, 13(4), 863–882.
Received April 8, 2003; accepted June 4, 2004.
LETTER
Communicated by Edward Harrington
Online Ranking by Projecting Koby Crammer
[email protected] Yoram Singer
[email protected] School of Computer Science and Engineering, Hebrew University, Jerusalem 91904, Israel
We discuss the problem of ranking instances. In our framework, each instance is associated with a rank or a rating, which is an integer in 1 to k. Our goal is to find a rank-prediction rule that assigns each instance a rank that is as close as possible to the instance’s true rank. We discuss a group of closely related online algorithms, analyze their performance in the mistake-bound model, and prove their correctness. We describe two sets of experiments, with synthetic data and with the EachMovie data set for collaborative filtering. In the experiments we performed, our algorithms outperform online algorithms for regression and classification applied to ranking. 1 Introduction The ranking problem we discuss in this article shares common properties with both classification and regression problems. As in classification problems, the goal is to assign one of k possible labels to a new instance. Similar to regression problems, the set of k labels is structured, as there is a total order relation between the labels. We refer to the labels as ranks and without loss of generality assume that the ranks constitute the set {1, 2, . . . , k}. Settings in which it is natural to rank or rate instances rather than classify are common in tasks such as information retrieval and collaborative filtering. We use the latter as our running example. In collaborative filtering, the goal is to predict a user’s rating on new items such as books or movies given the user’s past ratings of the similar items. The goal is to determine whether a movie fan will like a new movie and to what degree, which is expressed as a rank. An example for possible ratings might be, “run-to-see, very-good, good, only-if-you-must, and do-not-bother.” While the different ratings carry meaningful semantics, from a learning-theoretic point of view, we model the ratings as a totally ordered set (whose size is five in the example above). The interest in ordering or ranking of objects is by no means new and is still the source of ongoing research in many fields, such as mathematical economics, social science, and computer science. For an overview of rankNeural Computation 17, 145–175 (2005)
c 2004 Massachusetts Institute of Technology
146
K. Crammer and Y. Singer
ing problems from a learning-theoretic point of view see Cohen, Schapire, and Singer (1999). One of the main results underscored in this article is a complexity gap between classification learning and ranking learning. To sidestep the inherent intractability problems of ranking learning, several approaches have been suggested. One possible approach is to cast a ranking problem as a regression problem. Another is to reduce a total order into a set of preferences over pairs (Freund, Iyer, Schapire, & Singer, 2003; Herbrich, Graepel, & Obermayer, 2000). The first case imposes a metric on the set of ranking rules that might not be realistic, while the second approach is time-consuming since it requires increasing the sample size from n to O(n2 ). In this letter, we consider an alternative approach that directly maintains a totally ordered set via projections. Our starting point is similar to that of Herbrich et al. (2000) in the sense that we project each instance into the real numbers. However, our work then deviates and operates directly on rankings by associating each ranking with distinct subinterval of the real numbers and adapting the support of each subinterval while learning. The article is organized as follows. In the next section, we describe a simple and efficient online algorithm that manipulates concurrently the direction onto which we project the instances and the division into subintervals. In section 3, we prove the correctness of the algorithm and analyze its performance in the mistake-bound model on various assumptions. Section 4 contains the description and analysis of a norm-optimized version of the basic ranking algorithm. We then shift our attention to a multiplicative algorithm that is described and analyzed in section 5. We provide empirical validation of the merits of the various ranking algorithms we devise in section 6. This section describes experiments comparing the ranking algorithms to online classification and regression algorithms. Finally, we conclude with a brief discussion and mention a few open problems. Before moving on to the core of the article, we point to a few closely related articles. First, a preliminary version of this letter appeared at Neural Information Processing Systems, 2001, under the title “Pranking with Ranking” (Crammer & Singer, 2001a). This article extends the conference version in numerous directions by providing complete analysis as well as three new algorithms that were not discussed in the conference version. Second, in a recent work by Shashua and Levin (2002), an SVM-based algorithm for instance ranking was described. The algorithms of Shashua and Levin share numerous properties with the algorithms presented in this article. However, it is designed for batch settings, while our focus is on online algorithms. Last, we would like to note that Harrington (2003) demonstrated empirically an improved generalization performance of our algorithm using an “averaging” technique, while preserving the order of the thresholds. 2 The Basic PRank Algorithm This letter focuses on online algorithms for ranking instances. We are given a sequence (x¯ 1 , y1 ), . . . , (x¯ t , yt ), . . . of instance-rank pairs. For concreteness,
Online Ranking by Projecting
147
we assume that each instance x¯ t is in Rn , and its corresponding rank yt is an element from finite set Y with a total order relation. We assume without loss of generality that Y = {1, 2, . . . , k} with > as the order relation. The total order over the set Y induces a partial order over the instances in the following natural sense. We say that x¯ t is preferred over x¯ s if yt > ys . We also say that x¯ t and x¯ s are not comparable if neither yt > ys nor yt < ys . We denote this case simply as yt = ys . Note that the induced partial order is of a unique form in that the instances form k equivalence classes that are totally ordered.1 We stress that although the elements of Y are denoted by integers, we do not use the fact that the integers also belong to a metric space. We assume only that the set Y is discrete and its elements are totally ordered. A ranking rule H is a mapping from the instance space to the set of rank values, H: Rn → Y . The family of ranking rules we discuss in this article employs a vector w¯ ∈ Rn and a set of k thresholds b1 ≤ · · · ≤ bk−1 ≤ bk = ∞. For convenience, we denote by b¯ = (b1 , . . . , bk−1 ) the vector of thresholds, ¯ the ranking rule excluding bk , which is fixed to ∞. Given a new instance x, ¯ The predicted rank is then first computes the inner product between w¯ and x. ¯ x¯ < br . defined to be the index of the first (smallest) threshold br for which w· Such a ranking rule divides the space into parallel, equally ranked regions: all the instances that satisfy br−1 < w¯ · x¯ < br are assigned the same rank ¯ the predicted rank of r. Formally, given a ranking rule defined by w¯ and b, ¯ = minr∈{1,...,k} {r : w¯ · x¯ − br < 0}. Note that the above an instance x¯ is H(x) minimum is always well defined since we set bk = ∞. The analysis that we use in this letter is based on the mistake-bound model for online learning. The algorithms we describe work in rounds. On round t, the learning algorithm gets an instance x¯ t . Given x¯ t , the algorithm outputs a rank, yˆ t = minr {r : w¯ · x¯ t − br < 0}. It then receives the correct ¯ We say that rank yt and updates its ranking rule by modifying w¯ and b. our algorithm made a ranking mistake if yˆ t = yt and wish to make the predicted rank as close (in a sense described later in this article) as possible to the true rank. Formally, the goal of the learning algorithm is to minimize the ranking loss, which is defined to be the number of thresholds between the true rank and the predicted rank. Using the representation of ranks as integers in {1, . . . , k}, the ranking loss after T rounds is equal to the cumulative difference between the predicted rank values and true rank values, T ˆ t − yt |. The algorithm we describe updates its ranking rule only on t=1 |y rounds on which the predicted rank value was incorrect. Such algorithms are called conservative. We now describe the update rule of the algorithm, which is motivated by the perceptron algorithm for classification; hence, we call it the PRank algorithm, shorthand for perceptron ranking. For simplicity, we omit the ¯ y) and index of the round when referring to an input instance rank pair (x, ¯ Since b1 ≤ b2 ≤ · · · ≤ bk−1 ≤ bk , the predicted the ranking rule w¯ and b. 1
For a discussion of this type of partial orders see Kemeny and Snell (1962).
148
K. Crammer and Y. Singer
¯ x¯ > br for r = 1, . . . , y−1 and w· ¯ x¯ < br for r = y, . . . , k−1. rank is correct if w· We represent the above inequalities by expanding the rank y into k − 1 virtual variables y1 , . . . , yk−1 . We set yr = +1 for the case w¯ · x¯ > br and yr = −1 for w¯ · x¯ < br . Put another way, a rank value y induces the vector (y1 , . . . , yk−1 ) = (+1, . . . , +1, −1, . . . , −1) where the maximal index r for which yr = +1 is y−1. Thus, the prediction of a ranking rule is correct if yr (w¯ · x¯ − br ) > 0 for all r. If the algorithm makes a mistake by ranking x¯ as yˆ instead of y, then there is at least one threshold, indexed r, for which the ¯ x¯ is on the wrong side of br , that is, yr (w· ¯ x−b ¯ r ) ≤ 0. To correct the value of w· mistake, we need to “move” the values of w¯ · x¯ and br toward each other. We ¯ x−b ¯ r ) ≤0 and do so by modifying only the values of the br ’s for which yr (w· replace them with br − yr . We also replace the value of w¯ with w¯ + ( yr )x¯ where the sum is taken over the indices r for which there was a prediction error, that is, yr (w¯ · x¯ − br ) ≤ 0. An illustration of the update rule is given in Figure 1. In the example, we used the set Y = {1, . . . , 5}. (Note that b5 = ∞ is omitted from all the plots in Figure 1.) The correct rank of the instance is y = 4, and thus the value of w¯ · x¯ should fall in the fourth interval, between b3 and b4 . However, in the illustration, the value of w¯ · x¯ fell below b1 and the predicted rank is yˆ = 1. The threshold values b1 , b2 , and b3 are a source of the error since the ¯ To compensate for the mistake, the value of b1 , b2 , b3 is higher than w¯ · x. algorithm decreases b1 , b2 , and b3 by a unit value and replaces them with b1 − 1, b2 − 1, and b3 − 1. It also modifies w¯ to be w¯ + 3x¯ since
yr = 3.
¯ x−b ¯ r )≤0 r:yr (w·
¯ 2 . This update is illustrated at Thus, the inner product w¯ · x¯ increases by 3x the middle plot of Figure 1. The prediction rule after the update is illustrated on the right-hand side of Figure 1. Note that after the update, the predicted rank of x¯ is yˆ = 3, which is closer to the true rank y = 4. The pseudocode of algorithm is given in Figure 2. To conclude this section, we note that PRank can be straightforwardly combined with Mercer kernels (Vapnik, 1998) and voting techniques (Freund & Schapire, 1999) often used for improving the performance of margin classifiers in batch and online settings (Cristianini & Shawe-Taylor, 2000). To do so, we assume access to an inner-product space X equipped with a kernel operator K: X × X → R. We now need to replace any innerproduct operation the algorithm performs with an implicit inner-product operation defined via the kernel operator. We thus keep inner-product operators rather than explicit vectors. For instance, the update of PRank becomes K(w¯
t+1
, ·) ← K(w¯ , ·) + t
r
τrt
K(x¯ t , ·),
Online Ranking by Projecting
149
P re d icte d ra n k
1
C o rrec t in te rv a l
2
3
4
5
C o rrec t in te rv a l
U pd a ted p re d icte d ra nk
1
2
C o rrec t in te rv a l
3
4
5
Figure 1: An Illustration of the update rule. The ranking rule predicts a rank of yˆ = 1 instead of y = 4 (top). The update decreases the thresholds b1 , b2 , b3 by ¯ with w ¯ + 3¯x (center), yielding that the predicted rank of one unit and replaces w x¯ after the update is yˆ = 3 (bottom).
and thus the inner product w¯ T · x¯ t becomes t−1 s=1
τrt K(x¯ s , x¯ t ).
r
3 Analysis Before we prove the mistake bound of the algorithm, we first need to show that it maintains a consistent hypothesis. That is, we need to show that PRank preserves the correct order of the thresholds; otherwise, it might be impossible to induce a rank prediction rule. We prove that the consistency of thresholds is maintained by showing inductively that for any ranking rule that can be derived by the algorithm along its run, (w¯ 1 , b¯1 ), . . . , (w¯ T+1 , b¯T+1 ) the set of inequalities btr ≤ · · · ≤ btk−1 hold for all t. Clearly, since the initialization of the thresholds is such that b11 ≤ b12 ≤ · · · ≤ b1k−1 , then it suffices
150
K. Crammer and Y. Singer
Initialize: Set w¯ 1 = 0¯ , b11 , . . . , b1k−1 = 0, b1k = ∞ Loop: For t = 1, 2, . . . , T • Receive a new instance x¯ t ∈ Rn • Predict: yˆ t = min {r : w¯ t · x¯ t − btr < 0} r∈{1,...,k}
• Receive a new rank-value yt • If yˆ t = yt update w¯ t (otherwise set w¯ t+1 = w¯ t , ∀r: bt+1 = btr ): r 1. For r = 1, . . . , k − 1 : 2. For r = 1, . . . , k − 1 :
If yt ≤ r Then ytr = −1 Else ytr = 1 If (w¯ t · x¯ t − btr )ytr ≤ 0 Then τrt = ytr Else τrt = 0
3. Update: For r = 1, . . . , k − 1 :
t t ¯ w¯ t+1 ← w¯ t + r τr x t − τt bt+1 ← b r r r
¯ = minr∈{1,...,k} {r : w¯ T+1 · x¯ − bT+1 Output: H(x) < 0}. r Figure 2: The PRank algorithm.
to show that the claim holds inductively. For simplicity of the proof below, let us write the update rule of PRank in an alternative form. Let [[π ]] be 1 if the predicate π holds and 0 otherwise. We now rewrite the value of τrt (from Figure 2) as τrt = ytr [[(w¯ t · x¯ t − btr )ytr ≤ 0]]. Note also that the values of btr are integers for all r and t since for all r, we initialize b1r = 0 and bt+1 − btr ∈ {−1, 0, +1}. r Lemma 1 (order preservation). Let w¯ t and b¯t be the current ranking rule, where bt1 ≤ · · · ≤ btk−1 , and let (x¯ t , yt ) be an instance rank pair fed to PRank on round t. Denote by w¯ t+1 and b¯t+1 the resulting ranking rule after the update of PRank. Then bt+1 ≤ · · · ≤ bt+1 1 k−1 . Proof. In order to show that PRank preserves a nondecreasing order of the thresholds, we use the definition of the algorithm for ytr . We define ytr = +1 t+1 for r < yt and ytr = −1 for r ≥ yt . To prove that bt+1 r+1 ≥ br , we rewrite the threshold update as bt+1 = btr − ytr [[(w¯ t · x¯ t − btr )ytr ≤ 0]]. r
(3.1)
Online Ranking by Projecting
151
t+1 for all feasible r, it is sufficient to show In order to prove that bt+1 r+1 ≥ br that
btr+1 − btr ≥ ytr+1 [[(w¯ t · x¯ t − btr+1 )ytr+1 ≤ 0]] − ytr [[(w¯ t · x¯ t − btr )ytr ≤ 0]].(3.2) Since by our inductive assumption btr+1 ≤ btr and btr , btr+1 ∈ Z, we get that the value of btr+1 − btr on the left-hand side of equation 3.2 is a nonnegative integer. Recall also that ytr = 1 if yt > r and ytr = −1 otherwise, and therefore ytr+1 ≤ ytr . We now need to analyze two cases. We first consider the case where ytr+1 = ytr , which implies that ytr+1 = −1, ytr = +1. In this case, the right-hand side of equation 3.2 is at most zero, and the claim trivially holds. The second case is when ytr+1 = ytr . Here we get that the value of the right-hand side of equation 3.2 cannot exceed 1. If btr+1 > btr , then since both values are integers, the bound of 1 on the right-hand side of equation 3.2 implies that the value of btr cannot exceed the value of btr+1 . We are thus left with the case where btr = btr+1 and ytr+1 = ytr . For this case, we have that the terms ytr+1 [[(w¯ t · x¯ t − btr+1 )ytr+1 < 0]] and ytr [[(w¯ t · x¯ t − btr )ytr < 0]] attain the same value. Therefore, the right-hand side of equation 3.2 is zero and bt+1 = bt+1 r r+1 . This completes the proof. We now turn to the mistake-bound analysis of the algorithm. In order to simplify the analysis of the algorithm, we introduce the following notation. ¯ we denote by v¯ ∈ Rn+k−1 Given a hyperplane w¯ and a set of k−1 thresholds b, ¯ ¯ For brevity, ¯ ¯ b). the vector that is a concatenation of w and b, that is, v¯ = (w, we refer to the vector v¯ as a ranking rule. Given two vectors v¯ = (w¯ , b¯ ) and ¯ 2 . Note that ¯ we have v¯ · v¯ = w¯ · w¯ + b¯ · b¯ and v ¯ 2 = w ¯ 2 + b ¯ b), v¯ = (w, n a vector v¯ induces a partial order of the vectors R . Theorem 1 (mistake bound). Let (x¯ 1 , y1 ), . . . , (x¯ T , yT ) be an input sequence to PRank where x¯ t ∈ Rn and yt ∈ {1, . . . , k}. Denote by R2 = maxt x¯ t 2 . If there exists a ranking rule v¯ ∗ = (w¯ ∗ , b¯∗ ) with b∗1 ≤ · · · ≤ b∗k−1 of a unit norm that classifies the entire sequence correctly with margin γ = minr,t {(w¯ ∗ ·x¯ t −b∗r )ytr } > 0, the ranking loss of the algorithm Tt=1 |yˆ t − yt | is at most (k − 1)
R2 + 1 . γ2
Proof. Let us examine an example (x¯ t , yt ) that the algorithm received on round t. By definition, the algorithm ranked the example using the ranking rule v¯ t , which is composed of w¯ t and the set of thresholds b¯t . Similarly, v¯ t+1 is t t ¯ the ranking rule (w¯ t+1 , b¯t+1 ) after round t. Therefore, w¯ t+1 = w¯ t + r τr x t − τ t for r = 1, 2, . . . , k − 1. Let us denote by nt = |y t − yt | the ˆ and bt+1 = b r r r
152
K. Crammer and Y. Singer
difference between the true rank and the predicted rank. Since τrt is zero for t t t t all the indices r such thatt sign(w¯ · x¯ − br )yr > 0, then it is straightforward t to verify that n = r |τr |. Note that if there was not a ranking mistake on round t, then τrt = 0 for r = 1, . . . , k − 1, and thus also nt = 0. To prove the theorem, we bound t nt from above by bounding v¯ t 2 from above and below. First, we derive a lower bound on v¯ t 2 by bounding v¯ ∗ · v¯ t+1 . Substituting the values of w¯ t+1 and b¯t+1 , we get that v¯ ∗ · v¯ t+1 = v¯ ∗ · v¯ t +
k−1
τrt (w¯ ∗ · x¯ t − b∗r ).
(3.3)
r=1
We further bound the term on the right-hand side by considering two cases corresponding to whether τrt is positive or zero. Using the definition of τrt from the pseudocode in Figure 2, we first examine the case where (w¯ t · x¯ t − btr )ytr ≤ 0 and therefore τrt = ytr . From the assumption that v¯ ∗ ranks the examples correctly with a margin of at least γ , we get that τrt (w¯ ∗ · x¯ t −b∗r ) ≥ γ . The second case is when (w¯ t · x¯ t − btr )ytr > 0. In this case, we have τrt = 0 and thus τrt (w¯ ∗ · x¯ t − b∗r ) = 0. Combining the two cases and summing now over r, we get k−1
τrt (w¯ ∗ · x¯ t − b∗r ) ≥
r=1
k−1
|τrt |γ = nt γ .
(3.4)
r=1
Combining equations 3.3 and 3.4, we get that v¯ ∗ ·v¯ t+1 ≥ v¯ ∗ ·v¯ t +nt γ . Unfolding the sum, we get that after T rounds, the projection of the vector v¯ T+1 on v¯ ∗ satisfies v¯ ∗ · v¯ T+1 ≥
nt γ = γ
t
nt .
(3.5)
t
Recall that Cauchy-Schwartz inequality implies that v¯ T+1 2 v¯ ∗ 2 ≥ (v¯ T+1 · v¯ ∗ )2 . Using equation 3.5 with Cauchy-Schwartz inequality and the assumption that v¯ ∗ is of a unit norm, we get the following lower bound: v¯
T+1 2
≥
2 n
t
γ 2.
(3.6)
t
We next bound the norm of v¯ T+1 from above. As before, assume that an example (x¯ t , yt ) was ranked using the ranking rule v¯ t , and denote by v¯ t+1
Online Ranking by Projecting
153
the ranking rule at the end of round t. We now expand the values of w¯ t+1 and b¯t+1 whose sum is the norm of v¯ t+1 and get τrt w¯ t · x¯ t − btr v¯ t+1 2 = w¯ t 2 + b¯t 2 + 2 +
r
2
x¯ t 2 +
τrt
(τrt )2 .
r
r
Since τrt ∈ {−1, 0, +1}, we have that ( r τrt )2 ≤ (nt )2 and r (τrt )2 = nt , and we therefore get v¯ t+1 2 ≤ v¯ t 2 + 2 τrt w¯ t · x¯ t − btr + (nt )2 x¯ t 2 + nt . (3.7) r
We further develop the second term using the update rule of the algorithm and get τrt w¯ t · x¯ t − btr = [[(w¯ t · x¯ t − btr )ytr ≤ 0]] (w¯ t · x¯ t − btr )ytr ≤ 0. (3.8) r
r
Plugging equation 3.8 into equation 3.7 and using the bound x¯ t 2 ≤ R2 , we get that v¯ t+1 2 ≤ v¯ t 2 + (nt )2 R2 + nt . Thus, the ranking rule we obtain after T rounds of the algorithm is upper bounded by (nt )2 + nt . (3.9) v¯ T+1 2 ≤ R2 t
t
Combining the lower bound v¯ T+1 2 ≥ ( t nt )2 γ 2 with the upper bound of equation 3.9, we have that
2 t
n
γ 2 ≤ v¯ T+1 2 ≤ R2
t
(nt )2 +
t
nt .
t
Dividing the above equations by γ 2 t nt , we finally get R2 [ t (nt )2 ]/[ t nt ] + 1 nt ≤ . γ2 t Since by definition, nt is at most k − 1, it implies that (nt )2 ≤ nt (k − 1) = (k − 1) nt . t
t
t
(3.10)
154
K. Crammer and Y. Singer
Plugging this inequality into equation 3.10, we get the desired bound, T
|yˆ t − yt | =
t=1
T
nt ≤
t=1
(k − 1)R2 + 1 R2 + 1 ≤ (k − 1) . γ2 γ2
We now turn our attention to the inseparable case in which there does not exist a positive margin value γ . To analyze the inseparable case, we use an analysis technique suggested by Freund and Schapire (1999). (This proof technique was first informally given by Vapnik, 1998.) To prove a mistake bound for the inseparable case, each example (x¯ t , yt ) is augmented with a slack variable, denoted dt . Informally, for each time step t, the variable dt designates how much the margin assumption is violated by the example. Formally, we obtain the following mistake bound for the inseparable case: Theorem 2 (mistake bound for inseparable case). Let (x¯ 1 , y1 ), . . . , (x¯ T , yT ) be an input sequence for PRank where x¯ t ∈ Rn and yt ∈ {1, . . . , k}. Denote by R2 = maxt x¯ t 2 . Let v¯ ∗ = (w¯ ∗ , b¯∗ ) be a ranking rule of a unit norm with b∗1 ≤ · · · ≤ b∗k−1 . Let γ > 0 and define dt = max{0, γ − min{(w¯ ∗ · x¯ t − b∗r )ytr }}. r
Denote by D2 = by T
t 2 t (d ) .
(3.11)
Then the ranking loss of the algorithm is bounded above
|yˆ t − yt | ≤ (k − 1)
(D +
t=1
√ R2 + 1)2 . γ2
The proof of the theorem is based on the proof technique for theorem 1 and is given in the appendix. To conclude this section, we describe and briefly analyze a simplified version of PRank. This version shares the same algorithmic skeleton as PRank and differs only in the way it modifies its ranking rule. Rather than modifying all of the thresholds in the error set, this version chooses a single threshold to modify and update w¯ accordingly. Since a single threshold is modified on each round, we use the abbreviation Si-PRank to refer to this version. More formally, we replace step 3 in Figure 2 with the following steps, 3. Define update index: If yˆ t > yt Then r = yˆ t − 1 Else r = yˆ t 4. Update: w¯ t+1 ← w¯ t + τrt x¯ t bt+1 ← btr − τrt r bt+1 ← bts s
(s = r).
Online Ranking by Projecting
155
It is to verify that this choice of a single threshold to update preserves the overall order of the threshold. The proof is immediate consequence of the lemma 1. In addition, following exactly the same proof technique of theorem 1, we get that the mistake bound of PRank also holds for Si-PRank.2 Surprisingly, as we see later in section 6, in practice Si-PRank performs slightly better than PRank. 4 A Norm-Optimized Version of PRank The PRank algorithm presented in the previous section performs the same form of update whenever a ranking error occurs. Furthermore, even when the projection of xt onto w¯ t yields the correct rank value, the value of w¯ t · xt might lie very close to one of the thresholds btr . The difference between the projection of xt and the closest threshold plays a similar role to the notion of margin in classification problems and was used in theorem 1 to derive a mistake bound for PRank. In this section, we present a version of PRank that solves on each round a mini-optimization problem that balances between two opposing requirements. On one hand, we require that the new ranking rule v¯ t+1 be as similar as possible to the previous ranking rule v¯ t , which encompasses all our knowledge on past examples. On the other hand, we force the new ranking rule to rank the most recent example x¯ t correctly and with a large enough margin. Formally, assuming that there was a rank prediction error on round t, then we require that the new rule (w¯ t+1 , b¯t+1 ) t would satisfy, minr {(w¯ t+1 · x¯ t − bt+1 r )yr }} ≥ β, where β is a positive constant. These two requirements yield the following optimization problem: min ¯ b¯ w,
1 ¯ − (w¯ t , b¯t )2 ¯ b) (w, 2
subject to: (w¯ · x¯ t − br )ytr ≥ β
for r = 1, . . . , k − 1.
(4.1)
We call this version the Norm-Optimized PRank algorithm, or No-PRank in short. The pseudocode of the algorithm appears in Figure 3. Before analyzing the algorithm, let us first further develop the optimization problem given in equation 4.1 by expanding its Lagrangian function, k−1 1 ¯ − (w¯ t , b¯t )2 − ¯ b) L = (w, τr [(w¯ · x¯ t − br )ytr − β]. 2 r=1
(4.2)
2 In fact, an alternative bound can be given for Si-PRank, which states that the number of rounds on which an error occurs (indicated by a strictly positive value of the loss) is upper bounded by
R2 + 1 . γ2
156
K. Crammer and Y. Singer
Input: Minimal margin parameter β Initialize: Set w¯ 1 = 0¯ , b11 , . . . , b1k−1 = 0, b1k = ∞ Loop: For t = 1, 2, . . . , T • Receive a new instance x¯ t ∈ Rn . • Predict: yˆ t = min {r : w¯ t · x¯ t − btr < 0} r∈{1,...,k}
• Receive a new rank-value yt • If yt = yˆ t update w¯ t and b¯t : (otherwise set w¯ t+1 = w¯ t , b¯t+1 = b¯t ): 1. For r = 1, . . . , k − 1 :
If yt ≤ r Then ytr = −1 Else ytr = 1.
2. Update: set (w¯ t+1 , b¯t+1 ) ∈ Rn+k−1 to be the minimizer of: min ¯ b¯ w,
1 ¯ − (w¯ t , b¯t )2 ¯ b) (w, 2
subject to: (w¯ · x¯ t − br )ytr ≥ β
for r = 1, . . . , k − 1.
¯ = minr∈{1,...,k} {r : w¯ T+1 · x¯ − bT+1 < 0}. Output: H(x) r Figure 3: The Norm-Optimized PRank algorithm.
Here, τr are nonnegative Lagrange multipliers. Taking the derivative of equation 4.2 with respect to w¯ and comparing it to zero, we get ∂ L = w¯ − w¯ t − x¯ t τr ytr = 0 ∂ w¯ r
w¯ = w¯ + t
⇒
τr ytr
x¯ t . (4.3)
r
Repeating the process for br , we get ∂ L = br − btr + τr ytr = 0 ∂br
⇒
br = btr − τr ytr .
(4.4)
Plugging the value of w¯ from equation 4.3 and b from equation 4.4 into equation 4.2, we get the following dual problem: 2 1 t 2 t 1 2 minQt (τ¯ ) = x¯ τr yr + τ τ 2 2 r r r + τr ytr w¯ t · x¯ t − btr − β r
s.t 0 ≤ τr r = 1, . . . , k − 1.
(4.5)
Online Ranking by Projecting
157
Before proceeding to discuss the formal properties of the No-PRank algorithm, it is worth noting that the new vector obtained at the end of round t, w¯ t+1 is a linear combination of w¯ t and the current instance x¯ t . Therefore, it is rather straightforward to use No-PRank in conjunction with kernel methods by maintaining w¯ t in its dual form as a weighted combination of the instances x¯ 1 , . . . , x¯ t . Note also that the optimization problem of No-PRank reduces to a simple optimization problem with a quadratic objective function and nonnegativity constraints and can be solved using standard convex optimization tools (Fletcher, 1987). Let us now discuss the formal properties of No-PRank. The following simple lemma states that No-PRank is a conservative online algorithm as it modifies its ranking rule if and only if the minimal margin requirements are not attained for the current instance. Lemma 2 (conservativeness). Let (x¯ t , yt ) denote the input to No-PRank on round t, and let ytr be −1 if yt ≤ r and +1 otherwise. Then if ytr w¯ t · x¯ t − btr ≥ β for all r = 1, . . . , k−1 the optimum of No-PRank’s optimization problem is attained at τr = 0 for all r. Proof. Consider the dual form of No-PRank’s optimization problem given by equation 4.5. The objective function Qt (τ¯ ) is composed of three summands. The first two are clearly nonnegative and attain a value of zero iff all the multipliers τr are zero. If ytr (w¯ · x¯ − btr ) ≥ β, then the third summand is linear in τ¯ with nonnegative coefficients. Thus, its minimum is also attained when τr is zero for all r. Next we show that No-PRank preserves the order of the thresholds along its run and thus can always serve as a valid ranking rule. Lemma 3 (order preservation). Let w¯ t and b¯t be the current ranking rule, where bt1 ≤ · · · ≤ btk−1 , and let (x¯ t , yt ) be an instance-rank pair fed to No-PRank on round t. Denote by w¯ t+1 and b¯t+1 the resulting ranking rule after the update of No-PRank. Then bt+1 ≤ · · · ≤ bt+1 1 k−1 . Proof. In order to show that No-PRank maintains a correct (monotonically increasing) order of the thresholds, we use the variables employed by the algorithm along its run. Namely, we set ytr = +1 for r < yt and ytr = −1 for t+1 for all r, we expand b¯t+1 and show that r ≥ yt . To prove that bt+1 r+1 ≥ br t − ytr τrt . btr+1 − btr ≥ ytr+1 τr+1
(4.6)
We need to analyze two different settings. The first setting we analyze is when ytr+1 = ytr , which implies that ytr+1 = −1 and ytr = +1. In this case, the right-hand side of equation 4.6 is at most zero, while the left-hand side of
158
K. Crammer and Y. Singer
the equation is at least zero. Thus, the claim holds. The other case is when ytr+1 = ytr = y. In this case, equation 4.6 becomes t − τrt ). btr+1 − btr ≥ y(τr+1
(4.7)
Assume by contradiction that the optimal value of equation 4.5 does not satisfy equation 4.7 for some r. We now construct another feasible set for τ¯ that yields a lower value of the objective function Q. Let us define τs − y τs = τs + y τs
s=r+1 s=r otherwise,
for some value of > 0, which is determined below. Informally, the value of is set to be small enough so that the constraint τs ≥ 0 still holds. (This is possible since τr+1 > 0.) Since τ¯ and τ¯ differ only at their r and r + 1 components, we get
Q(τ¯ ) − Q(τ¯ ) =
1 2 2 ) + τr [yr (w¯ t · x¯ t − btr ) − β] (τ + τr+1 2 r + τr+1 [yr+1 (w¯ t · x¯ t − btr+1 ) − β] 1 − (τr 2 + τr+1 2 ) − τr [yr (w¯ t · x¯ t − btr ) − β] 2 − τr+1 [yr+1 (w¯ t · x¯ t − btr+1 ) − β].
Expanding τ¯ , we now get
Q(τ¯ ) − Q(τ¯ ) =
1 ((τr + y )2 + (τr+1 − y )2 ) 2 + (τr + y )[yr (w¯ t · x¯ t − btr ) − β] + (τr+1 − y )[yr+1 (w¯ t · x¯ t − btr+1 ) − β] 1 − (τr 2 + τr+1 2 ) − τr [yr (w¯ t · x¯ t − btr ) − β] 2 − τr+1 [yr+1 (w¯ t · x¯ t − btr+1 ) − β]
= (y )2 + y (τr − τr+1 ) + y [yr (w¯ t · x¯ t − btr ) − β] − y [yr+1 (w¯ t · x¯ t − btr+1 ) − β]. Denoting both yr and yr+1 simply as y and using the fact y ∈ {−1, +1}, we get
Q(τ¯ ) − Q(τ¯ ) = (y )2 + y (τr − τr+1 ) − br + br+1 = [ − (y(τr+1 − τr ) − (br+1 − br ))].
Online Ranking by Projecting
159
Since we assumed by contradiction that y(τr+1 − τr ) − (br+1 − br ) = A > 0, we can choose 0 < ≤ A/2 and get
Q(τ¯ ) − Q(τ¯ ) = (A/2 − A) = − A/2 < 0, which contradicts the assumption that τ is the optimal solution of equation 4.5. We now turn to the analysis of the performance of No-PRank. We first bound the sum of the weight employed by the dual program as they accu t mulate along the run of No-PRank t r τr . Based on the bound on the weights, we then prove a mistake bounds on the number of rounds on which an error occurred, that is, yt = yˆ t . As we see shortly, the bound depends on both the value of the attainable margin γ and the margin parameter, β, employed by the algorithm. Theorem 3 (bound on weights). Let (x¯ 1 , y1 ), . . . , (x¯ T , yT ) be an input sequence to No-PRank where x¯ t ∈ Rn and yt ∈ {1, . . . , k}. Let v¯ ∗ = (w¯ ∗ , b¯∗ ) be ∗ ∗ ∗ 2 ¯ a ranking rule with b1 ≤ · · · ≤ bk−1 and w + r (b∗r )2 = 1. Assume that v¯ ∗ classifies the entire sequence correctly with a margin value γ = minr,t {(w¯ ∗ · x¯ t − b∗r )ytr } > 0. Then the total sum of the weights generated by No-PRank is bounded by T t=1
τrt ≤ 2
r
β , γ2
where β is a predefined parameter of the algorithm (see Figure 3). Proof. Let us concentrate on an example (x¯ t , yt ) that the algorithm received on round t. By construction, the algorithm ranked the example using the ranking rule v¯ t , which is composed of w¯ t and the thresholds b¯t . Similarly, we denote by v¯ t+1 the updated rule (w¯ t+1 , b¯t+1 ) after round t, that is, w¯
t+1
= w¯ + t
ytr τrt
= btr − ytr τrt for r = 1, 2, . . . , k − 1. x¯ t and bt+1 r
r
To bound t r τrt from above, we derive bounds on v¯ t 2 from both above and below. First, we derive a lower bound on v¯ t 2 by bounding v¯ ∗ · v¯ t+1 . Substituting the values of w¯ t+1 and b¯t+1 , we get v¯ ∗ · v¯ t+1 = v¯ ∗ · v¯ t +
k−1 r=1
τrt ytr (w¯ ∗ · x¯ t − b∗r ).
160
K. Crammer and Y. Singer
Using the assumption that v¯ ∗ ranks the data correctly with a margin of at least γ , we get that ytr (w¯ ∗ · x¯ t − b∗r ) ≥ γ : v¯ ∗ · v¯ t+1 ≥ v¯ ∗ · v¯ t +
k−1
τrt γ = v¯ ∗ · v¯ t + γ
r=1
k−1
τrt .
r=1
Unfolding the sum, we get that after T rounds, the algorithm satisfies v¯ ∗ · v¯ T+1 ≥ γ
τrt .
t,r
Plugging this result into Cauchy-Schwartz inequality, v¯ T+1 2 v¯ ∗ 2 ≥ (v¯ T+1 · v¯ ∗ )2 , and using the assumption that v¯ ∗ is of a unit norm, we get the lower bound, v¯
T+1 2
≥
2 τrt
γ 2.
(4.8)
t,r
We next bound the norm of v¯ T+1 from above. As before, assume that an example (x¯ t , yt ) was ranked using the ranking rule v¯ t , and denote by v¯ t+1 the ranking rule after the round. We now expand the values of w¯ t+1 and b¯t+1 in the norm of v¯ t+1 and get v¯ t+1 2 = w¯ t 2 + b¯t 2 + 2 + x¯ t 2
τrt ytr (w¯ t · x¯ t − btr )
r
2 τrt ytr
+
(ytr τrt )2 .
r
r
We add and subtract the term 2β equation and get v¯ t+1 2 = w¯ t 2 + b¯t 2 + 2 + x¯
t 2
r
t r τr
τrt [ytr (w¯ t · x¯ t − btr ) − β]
r 2
τrt ytr
on the right-hand side of the above
+
ytr τrt
r
= w¯ t 2 + b¯t 2 + 2Q(τ¯ t ) + 2β
2
+ 2β
τrt
r
τrt ,
(4.9)
r
where we used the definition of Q from equation 4.5 to obtain the last equality.
Online Ranking by Projecting
161
Note that τ¯ = 0 is a feasible solution for the optimization problem posed in equation 4.5. The value attained by Q for this particular choice of τ¯ is zero. Since τ¯ t is the minimizer Q(τ¯ ), we must have that
Q(τ¯ t ) ≤ 0.
(4.10)
Substituting equation 4.10 in equation 4.9, we obtain the following bound on the norm of v¯ t+1 in terms of the norm of v¯ t : τrt . (4.11) v¯ t+1 2 ≤ v¯ t 2 + 2β r
Thus, the ranking rule we obtain after T rounds of the algorithm satisfies the upper bound: v¯ T+1 2 ≤ 2β
τrt .
(4.12)
t,r
Combining the lower bound of equation 4.8 with the upper bound of equation 4.12, we have that
2 τrt
γ 2 ≤ v¯ T+1 2 ≤ 2β
t,r
t,r
τrt ≤
τrt .
t,r
Dividing both sides by γ 2
2β . γ2
t t,r τr ,
we finally get (4.13)
We now discuss some properties of the algorithm and its corresponding loss bound. As mentioned above, the algorithm updates its ranking rule only on rounds on which the margin is less than β, which reflects a minimal margin requirement. Informally, β can be viewed as a minimal difference requirement between the projection of the example x¯ t onto w¯ t and any of the thresholds btr (see Figure 4). Thus, the larger β is, the better is the separation between the rank levels. Formally, the algorithm attempts to enclose all the examples in subintervals of the real numbers defined by the thresholds. As stated above, based on theorem 3, we can derive a mistake bound for NoPRank. Surprisingly, the dependency on the margin parameter β cancels out, and the end result is a mistake bound that depends on only the geometrical properties of the problem: the normalized margin. Corollary 1 (mistake bound). Assume the conditions of theorem 3 hold, and denote by R = maxt x¯ t . Then the number of rounds for which No-PRank made
162
K. Crammer and Y. Singer
β
b1
β
WX
b2
b3
b4
Figure 4: An illustration of the margin required by the No-PRank algorithm.
a mistake is upper bounded by
2
R2 + 1 . γ2
Proof. To prove the corollary, we use theorem 3 and show that whenever an error occurred on round t, the sum of the dual weights r τrt is at least β/(R2 + 1). Since w¯ t+1 must satisfy the constraints of equation 4.1, we now show that for r = 1, . . . , k − 1, t (w¯ t+1 · x¯ t − bt+1 r )yr ≥ β.
from equation 4.3 and bt+1 = Substituting w¯ t+1 = w¯ t + x¯ t r τrt ytr and bt+1 r r t t t br − τr yr from equation 4.4, we get that the dual variables τ1t , . . . , τk−1 satisfy β≤
w¯ + x¯ t
t
· x¯ −
τst yts
t
btr
+
τr ytr
ytr .
s
Rearranging terms in the above equation, we get β≤
w¯ + x¯ t
t
τst yts
· x¯ − t
s
= (w¯ t · x¯ t − btr )ytr + ytr x¯ t 2
btr
+
τr ytr
ytr
τst yts + τrt (ytr )2 .
(4.14)
s
Since a rank prediction error occurred on round t, there exists r such that t·x ¯ t − btr )ytr ≤ 0. In addition, since ytr ∈ {−1, +1}, we have that ytr s τst yts (w¯ ≤ s τst . Using these two inequalities in equation 4.14 in conjunction with
Online Ranking by Projecting
163
the fact that τrt ≥ 0, we get β ≤ 0 + x¯ t 2
τrt +
r
≤ (R2 + 1)
τrt
r
τrt .
r
We have thus shown that if there was a rank prediction error on round t, then β τrt . ≤ R2 + 1 r
(4.15)
To prove the corollary, we combine theorem 3 with equation 4.15. Denote by M the number of rounds with errors. We now show that on each prediction 2 + 1) ≤ t , and for the rest of the rounds, we simply such round, β/(R τ r r bound the sum r τrt from below by 0. We thus get that M
R2
β τrt . ≤ +1 t,r
Applying theorem 3, we get M
β R2 + 1 β ⇒ M ≤ 2 . ≤ 2 R2 + 1 γ2 γ2
Note that the mistake bounds of theorem 1 and corollary 1 are identical up to a multiplicative (k − 1)/2 factor. Furthermore, if we employ the t t trivial t factt that |y − yˆ | ≤ k − 1, we can immediately obtain a bound on ˆ |y − y | from corollary 1 that is a 2 factor of the bound given in theot rem 1. However, the bound for No-PRank is more refined, as in addition to the mistake bound, we can also bound the total sum of the weights t,r τrt . Unfortunately, since τrt may be arbitrarily close to zero, the more refined analysis does not yield a better mistake bound. One possible direction for improving the mistake bound itself may be obtained by modifying the minimal margin requirement from β to β|yt − yˆ t |. Another viable approach is to replace the fixed margin constraint (w¯ t+1 · x¯ t − brt+1 )ytr ≥ β with a margin requirement that is dependent on the difference between the correct label yt and the specific threshold r. Specifically, we define a new set of constraints (w¯ t+1 · x¯ t − brt+1 )ytr ≥ βar , where ar = yt − r for r < yt and ar = r + 1 − yt for r ≥ yt . Therefore, the further the threshold r from the interval containing the projection of the instance on the hyperplane, the larger the margin we require. We leave these possible extensions to future research.
164
K. Crammer and Y. Singer
5 A Multiplicative Version of PRank In this section, we give a multiplicative version of PRank that we term MuPRank. This version is analogous to the basic PRank update, but it modifies w¯ and b¯ in a multiplicative manner. That is, on each round with rank prediction error, the current weight vector w¯ t and thresholds b¯t are multiplied by factors that depend on x¯ t . The motivation for this version is the multiplicative updates employed by the work of Warmuth and colleagues on online prediction (see, e.g., Kivinen & Warmuth, 1997). Mu-PRank maintains a normalized ranking rule, that is, (w¯ t , b¯t )1 = 1 for all t. As in PRank, on round t, Mu-PRank computes the vector τ¯ t , which determines the update of w¯ t and b¯t . It then takes the exponent of the updated ranking rule and normalizes it so that the 1 norm of (w¯ t+1 , b¯t+1 ) will be 1. The pseudocode of the algorithm is given in Figure 5. Examining the logarithm of b¯t on each round, it is rather simple to verInput: Learning rate η 1 Initialize: Set w1i = n+k−1 Loop: For t = 1, 2, . . . , T
i = 1, . . . , n , b11 , . . . , b1k−1 =
• Receive a new instance x¯ t ∈ Rn ,
1 n+k−1 ,
b1k = ∞
x¯ t ∞ ≤ 1
• Predict: yˆ t = min {r : w¯ t · x¯ t − btr < 0} r∈{1,...,k}
• Receive a new rank-value yt • If yt = yˆ t update w¯ t and b¯t (otherwise set w¯ t+1 = w¯ t , b¯t+1 = b¯t ): 1. For r = 1, . . . , k − 1 :
If yt ≤ r Then ytr = −1 Else ytr = 1 If (w¯ t · x¯ t − btr )ytr ≤ 0 Then τrt = ytr Else τrt = 0
2. For r = 1, . . . , k − 1 : 3. Define:
Zt =
n
wti eηxi
t
τt r r
i=1
+
k−1
btr e−ητr
t
r=1
4. Update:
t t For i = 1, . . . , n : wt+1 ← wti eηxi r τr /Z t i t For r = 1, . . . , k − 1 : bt+1 ← btr e−ητr /Z t r
¯ = minr∈{1,...,k} {r : w¯ T+1 · x¯ − bT+1 Output: H(x) < 0} r Figure 5: The multiplicative algorithm.
Online Ranking by Projecting
165
ify that, like PRank, Mu-PRank preserves the order of the thresholds. The additional step Mu-PRank employs in its update stage is the normalization of its ranking rule. However, since this normalization is applied to all the thresholds, it clearly keeps the order of the thresholds intact. More formally, log(btr ) can be written as ηutr + log(Ct ), where utr is an integer and Ct is independent of r. Applying exactly the same proof technique used in lemma 1 to utr gives the following corollary: Lemma 4 (order preservation). Let w¯ t and b¯t denote the current ranking rule and assume that bt1 ≤ · · · ≤ btk−1 . Then, after (x¯ t , yt ) is fed to Mu-PRank, the new rule (w¯ t+1 , b¯t+1 ) preserves the order of the thresholds, bt+1 ≤ · · · ≤ bt+1 1 k−1 . Next, we analyze the mistake bound of Mu-PRank. Note that Mu-PRank employs a learning rate parameter, η. The mistake bound of Mu-PRank given in the following theorem depends on this value. If an priori lower bound on the margin γ is known, we can fix the value of η so as to minimize the loss bound of Mu-PRank. The resulting bound is described in corollary 2, which follows theorem 4. Theorem 4 (mistake bound). Let (x¯ 1 , y1 ), . . . , (x¯ T , yT ) be an input sequence to Mu-PRank, where x¯ t ∈ Rn , x¯ t ∞ ≤ 1 , and yt ∈ {1, . . . , k}. Assume that there exists a ranking rule v¯ ∗ = (w¯ ∗ , b¯∗ ) with b∗1 ≤ · · · ≤ b∗k−1 and w¯ ∗ , b¯∗ 1 = 1, which classifies the entire sequence correctly with margin γ = minr,t {(w¯ ∗ ·x¯ t −b∗r )ytr } > 0. Then the ranking loss of Mu-PRank, Tt=1 |yˆ t − yt | is, at most, log k + n − 1 . 2 log eη(k−1) +e + ηγ −η(k−1) Proof. As before, let v¯ t = (w¯ t , b¯t ) and v¯ t+1 = (w¯ t+1 , b¯t+1 ) denote the ranking rules at the beginning and end of round t, respectively. To prove the theorem, we analyze the decrease in the Kullback-Leibler (KL) divergence (Cover & Thomas, 1991) between v¯ t and v¯ ∗ . The KL divergence of two dis q is DKL (p q) = i pi log(pi /qi ). To prove crete probability distributions p, the theorem, we examine the change of the KL divergence in two consecutive rounds: t = DKL (v¯ ∗ v¯ t+1 ) − DKL (v¯ ∗ v¯ t ). We bound
t t from above and below. We first bound this sum from above.
166
K. Crammer and Y. Singer
As a reminder, we use nt to denote |yˆ t − yt |. Using this notation, we get n
t =
w∗i log
i=1 n
=
w∗i log
=
w∗i
+
i=1
+
Zt t
k−1
k−1
b∗r log
r=1
eηxi
i=1
n
w∗i wti
b∗r
+
τt r r
k−1
b∗r btr
b∗r log
r=1
Zt e−ητr
t
k−1 τrt (w¯ ∗ · xt − b∗r ) log Z t − η
r=1
≤ log(Z t ) − ηγ
r=1
k−1
|τrt | = log(Z t ) − ηγ nt ,
(5.1)
r=1
where we used the definition of nt and τrt in conjunction with the fact that v¯ ∗ achieves a margin of γ to obtain the inequality above. We now bound log(Z t ). We need the following inequality, a + x ηa a − x −ηa e + e , 2a 2a
eηx ≤
(5.2)
which holds for a, η > 0 and x ∈ [−a, a]. (The proof of this inequality is an immediate application of the convexity of the exponent function.) Now recall that
Zt =
n
wti eηxi
t
i=1
τt r r
+
k−1
btr e−ητr . t
(5.3)
r=1
We bound the left-hand side and the right-hand side of the above sum sep-
arately. For the left-hand side, we bound each term eηxi r τr . Using equation 5.2 in conjunction with the fact that |xti | ≤ 1 and | r τrt | ≤ k − 1, we get t
eηxi
t
τt r r
≤
t
k − 1 + xti r τrt η(k−1) k − 1 − xti r τrt −η(k−1) + . e e 2(k − 1) 2(k − 1)
(5.4)
Similarly, |τrt | ≤ 1 ≤ k − 1, and thus eητr ≤ t
k − 1 + τrt η(k−1) k − 1 − τrt −η(k−1) + . e e 2(k − 1) 2(k − 1)
Using the bounds from equations 5.4 and 5.5 in 5.3, we get
Z ≤ t
i
wti
k − 1 + xti r τrt η(k−1) k − 1 − xti r τrt −η(k−1) + e e 2(k − 1) 2(k − 1)
(5.5)
Online Ranking by Projecting
+
r
=
i
+
btr
167
k − 1 + τrt η(k−1) k − 1 − τrt −η(k−1) + e e 2(k − 1) 2(k − 1)
1 1 wti (eη(k−1) + e−η(k−1) ) + btr (eη(k−1) + e−η(k−1) ) 2 2 r
τrt (w¯ t · xt − btr )
r
eη(k−1) + e−η(k−1) . 2(k − 1)
(5.6)
The definition of τrt implies that τrt (w¯ t ·xt −btr ) ≤ 0. (Either there was a ranking error or τrt = 0). In addition, since Mu-PRank normalizes its ranking rule at the end of each round, we know that (w¯ t , b¯t )1 = i wti + r btr = 1. Using these facts in equation 5.6, we get the following bound on Z t : 1 Z t ≤ (eη(k−1) + e−η(k−1) ). 2
(5.7)
Using equation 5.7 in equation 5.1, we obtain an upper bound on t :
1 η(k−1) −η(k−1) +e ) − ηγ nt . t ≤ log (e 2 Since a ranking error occurred, we know that nt ≥ 1. In addition, the argument of the logarithm is at least 1; we can rearrange terms and write 1 η(k−1) t −η(k−1) t ≤ n log (e +e ) − ηγ . (5.8) 2 Summing over t, we get 1 η(k−1) t ≤ log + e−η(k−1) − ηγ nt . e 2 t t
(5.9)
To bound t from below, we unravel the sum and get t = (DKL (v¯ ∗ v¯ t+1 ) − DKL (v¯ ∗ v¯ t )) t
t
= DKL (v¯ ∗ v¯ T+1 ) − DKL (v¯ ∗ v¯ 1 ) ≥ −DKL (v¯ ∗ v¯ 1 ), where the inequality is due to the fact that the KL divergence is always ¯ 1 , we get that DKL (v¯ ∗ v¯ 1 ) = log(k − 1 + n). nonnegative. Using the value of w Combining the lower bound on t t with equation 5.9, we get
2 log(k − 1 + n) ≥ log η(k−1) e + e−η(k−1)
+ ηγ
t
nt .
168
K. Crammer and Y. Singer
Rearranging terms, we finally get the desired bound: t
log k − 1 + n . n ≤ 2 + ηγ log eη(k−1) +e −η(k−1) t
As discussed above, the bound of theorem 4 depends on the learning rate η. If γ or a lower bound on γ is known, we can set η to be 1 k−1+γ η= log . 2(k − 1) k−1−γ For this choice of η, we get the following corollary: Corollary 2. η=
If we run Mu-PRank with
k−1+γ 1 log , 2(k − 1) k−1−γ
then under the assumptions of theorem 4, the cumulative ranking loss obtained by Mu-PRank is bounded above by
t
2 log
n ≤ (k − 1)
t
n+k−1 . γ2
This corollary implies that the cumulative ranking loss of Mu-PRank is inversely proportional to square of the margin γ . Note that a direct comparison of this bound to the mistake bound of PRank (see theorem 1) is not possible as the notion of margin here is different. (For PRank, we used the 2 norm of the instances and the ranking rule to define the margin, while for MuPRank, we tacitly use the 1 norm of the ranking rule in conjunction with the ∞ of the instances.) This relation between the bounds also holds between additive and multiplicative online algorithms for classification (Rosenblatt, 1958; Littlestone, 1987, 1988, 1989). Putting these differences in the notion of margin aside, the two bounds exhibit the same quadratic dependency on the margin. 6 Experiments In this section, we describe experiments we performed that compared the variants of PRank discussed in the previous section with two other online learning algorithms applied to ranking: a multiclass generalization of the Perceptron algorithm (Crammer & Singer, 2001b), denoted MCP, and the Widrow-Hoff algorithm (Widrow & Hoff, 1960) for online regression learning, which we denote by WH. For WH we fixed its learning rate to a
Online Ranking by Projecting
169
constant value. The hypotheses that the variants of PRank and the two other algorithms maintain share similarities but are different in their complexity: PRank maintains a vector w¯ of dimension n and a vector of k − 1 modifiable ¯ totaling n + k − 1 parameters; MCP maintains k prototypes, thresholds b, which are vectors of dimension n, yielding kn parameters; WH maintains a single vector w¯ of size n. Therefore, MCP builds the most complex hypothesis, while WH builds the simplest. We describe two sets of experiments with two different data sets. The data set used in the first experiment is synthetic and was generated in a similar way to the data set used by Herbrich et al. (2000). We first generated points x¯ = (x1 , x2 ) uniformly at random from the unit square [0, 1]2 . Each point was assigned a rank y from the set {1, . . . , 5} according to the following ranking rule, y = maxr {r : 10((x1 − 0.5)(x2 − 0.5)) + ξ > br } where b¯ = (−∞, −1, −0.1, 0.25, 1) and ξ is a normally distributed noise with a zero mean and a standard deviation of 0.125. We generated sequences of instance rank pairs, each of length 8000. We fed the sequences to PRank, SiPRank, MCP, and WH. We then obtained predictions for each instance. We converted the real-valued predictions of WH into ranks by rounding each prediction to its closest rank value. As in Herbrich et al. (2000), we used a nonhomogeneous polynomial of degree 2, K(x¯ 1 , x¯ 2 ) = ((x¯ 1 · x¯ 2 ) + 1)2 as the inner product operation between each input instance and the hyperplanes that the four additive algorithms maintain. At each time step, we computed for each algorithm the accumulative ranking loss normalized by the instantaneous sequence length. Formally, the time-averaged loss after T rounds is (1/T) Tt |yˆ t − yt |. We computed the loss for each algorithm we tested for T = 1, . . . , 8000. To increase the statistical significance of the results, we repeated the process 100 times, picking a new random instance rank sequence of length 8000 each time, and then averaged the instantaneous losses across the 100 runs. The results are depicted on the left-hand side of Figure 6. The size of the symbols in the plot is larger than 95% confidence intervals for each result depicted in the figure. In this experiment, the performance of MCP is constantly worse than the performance of WH and PRank. WH initially suffers the smallest instantaneous loss, but after about 500 rounds, both PRank and Si-PRank start to outperform MCP and WH. Eventually the ranking loss that PRank and Si-PRank suffer is significantly lower than both WH and MCP. In the second set of experiments, we used the EachMovie data set (McJones, 1997). This data set is used for collaborative filtering tasks and contains ratings of movies provided by 61,265 people. Each person in the data set viewed a subset of movies from a collection of 1623 titles. Each viewer rated each movie that she saw using one of six possible ratings: 0, 0.2, 0.4, 0.6, 0.8, 1. We chose subsets of people who viewed a significant number of movies, extracting for evaluation people who have rated at least 100 movies. There were 7542 such viewers. We chose at random one per-
170
K. Crammer and Y. Singer
1.1
PRank Si−PRank WH MC−Perceptron
1 0.9
Rank Loss
0.8 0.7 0.6 0.5 0.4 0.3 0
1000
2000
3000
4000
5000
6000
7000
8000
Round
Figure 6: Comparison of the time-averaged ranking loss of PRank, Si-PRank, WH, and MCP on synthetic data.
son among these viewers and set the person’s ratings to be the target rank. We used the ratings of the other viewers as features. Thus, the goal is to learn to predict the “taste” of a random user with the user’s past ratings serving as a feedback and the ratings of fellow viewers as features. The prediction rule associates a weight with each fellow viewer and therefore can be seen as learning correlations between the tastes of different viewers. Next, we subtracted 0.5 from each rating; therefore the possible rating values are −0.5, −0.3, −0.1, 0.1, 0.3, 0.5. This linear transformation enabled us to assign a value of zero to movies that have not been rated. We fed each pair of feature and rank level in an online fashion. Since we chose viewers who rated at least 100 movies, we were able to perform at least 100 rounds of online predictions and updates. We repeated this experiment 500 times, each time choosing a random viewer as the target rank. The results are given in the left-hand plots of Figure 7. The error bars in the plot indicate 95% confidence levels. We repeated the experiment using viewers who have seen at least 200 movies. (There were 1802 such viewers.) The results of this experiment are given on the right-hand plots of Figure 7. The plots at the top of the figure compare the additive algorithms: PRank, Si-PRank, MCP, and WH. Along the entire run of the algorithms, PRank and Si-PRank are significantly
Online Ranking by Projecting 2.4
171 2.2
PRank Si−PRank WH MC−Perceptron
2.2
1.8 Rank Loss
Rank Loss
2 1.8 1.6
1.6 1.4
1.4
1.2
1.2
1
1 0
PRank Si−PRank WH MC−Perceptron
2
20
40
60
80
0.8 0
100
50
Round
1.9
1.7
200
Si−PRank η 0.50 η 1.00 η 4.00 η 16.00
1.7 1.6 1.5
1.6
Rank Loss
Rank Loss
150
1.8
Si−PRank η 0.50 η 1.00 η 4.00 η 16.00
1.8
1.5 1.4
1.4 1.3 1.2
1.3
1.1
1.2 1.1 0
100 Round
1
20
40
60 Round
80
100
0.9 0
50
100
150
200
Round
Figure 7: Comparison of the time-averaged ranking loss of the variants of PRank, WH, and MCP on the EachMovie data set using viewers who rated at least 200 movies (top) and at least 100 movies (bottom).
better than WH and consistently better than the multiclass perceptron algorithm, although the last employs a hypothesis that is substantially bigger. Comparing the two additive variants of PRank, we see that the ranking loss of Si-PRank is slightly lower than that of PRank. The plots at the bottom of the figure compare the best of the additive algorithms, Si-PRank, with the Mu-PRank run with different learning rates (η = 0.5, 1, 4, 16). One of the goals of this experiment is to check the dependency of the performance of Mu-PRank on its learning rate. It is clear from the two bottom plots of Figure 7 that the performance of Mu-PRank is sensitive to the choice of the learning rate. Note, though, that the best learning rate is problem dependent: the best learning rate is η = 4 for the first partition of EachMovie, while η = 0.5 results in the smallest ranking loss in the case of the second partition. This type of behavior is also exhibited by multiplicative algorithms in other problems such as binary classification (Kivinen & Warmuth, 1997). Nonetheless, Mu-PRank outperforms most of the other algorithms we compared for a broad range
172
K. Crammer and Y. Singer
of learning rates. Focusing first on the left plot, we observe that after about 60 rounds, Si-PRank achieves the best performance, while at the start of the training process, its ranking loss is almost inferior to most of the copies of Mu-PRank. Summing up, both the additive and multiplicative versions of PRank outperform regression and classification algorithms when evaluated with respect to the ranking loss. While the superior performance of PRank is not surprising, it provides an empirical validation of the formal analysis presented in the previous sections. 7 Conclusion In this article, we described a family of algorithms for instance ranking. The roots of the algorithms go back the perceptron algorithm (Rosenblatt, 1958). One of the major results in the article is the description of a new approach for solving ranking problems. While most of the previous approaches reduce the problem of ranking to a classification of pairs, we presented an alternative approach that builds a connection between rank levels and subintervals of the reals. An open problem that arises from both the theoretical analysis and the empirical results is deciding what version of PRank to use. In particular, PRank and Si-PRank share the same mistake bound, while in our experiments, Si-PRank performed better than PRank. All of the versions of PRank presented in this article are online algorithms, and for their analysis we used the mistake-bound model. An interesting research direction is the design and analysis of algorithms for batch settings in which all of the training examples are given at once. As mentioned in section 1, Shashua and Levin (2002) have described a batch algorithm for ranking. An interesting question is how the different variants relate to the algorithm of Shashua and Levin. In addition to continuing our research on online algorithms for ranking problems, an interesting research direction is the design and generalization analysis of batch algorithms for various ranking losses. Appendix: Technical Proofs A.1 Proof of Theorem 2. To prove the theorem, we introduce a slight modification of theorem 1 by relaxing the assumption that v¯ ∗ is of a unit norm and adding the norm of v¯ ∗ to the mistake bound of the basic PRank algorithm. We thus write the mistake bound of theorem 1 as T t=1
|yˆ t − yt | ≤ (k − 1)
(R2 + 1)v¯ ∗ 2 . γ2
Since the case D = 0 reduces to setting of theorem 1, we can assume that D > 0. We prove the theorem by transforming the inseparable problem into a separable one. We do so by expanding each original instance x¯ t ∈ Rn into
Online Ranking by Projecting
173
a vector z¯ t ∈ Rn+T as follows. The first n coordinates of z¯ t are set to x¯ t . The n + t coordinate of z¯ t is set to , which is a positive real number whose value is set below. The rest of the coordinates of z¯ t are set to zero. We similarly extend the ranking rule (w¯ ∗ , b¯∗ ) to (u¯ ∗ , c¯∗ ) ∈ R(n+T) × Rk−1 as follows. We rewrite the value of dt from equation 3.11 as dt = max γ − min{(w¯ ∗ · x¯ t − b∗r )ytr }, 0, γ − min{(w¯ ∗ · x¯ t − b∗r )ytr } , (A.1) r≥yt
r 2, we associate a th class or category denoted by C . A common approach to solve such problem is by learning a vector-valued function f = ( f : ∈ Nn ): X → Rn and classify x in the class Cq such that q = argmax∈Nn f (x). Given a training set {(xj , kj ) : j ∈ Nm } ⊆ X × Nn with m > n if, for every j ∈ Nm , xj belongs to class Ckj , the output yj ∈ Rn is a binary vector with all components equal to −1 except for the k − jth component, which equals 1. Thus, the function f should separate as best as possible examples in the class C (the positive examples) from examples in the remaining classes (the negative examples).1 In particular, if we choose Q(yj , f (xj )) := ∈Nn max(0, 1 − fkj (xj ) + f (xj )), the minimization problem 4.7 leads to the multiclass formulation studied in Vapnik (1998) and Weston and Watkins (1998), while the choice Q(yj , f (xj )) := max{max(0, 1 − fkj (xj ) + f (xj )) : ∈ Nn , = kj } leads to the multiclass formulation of Cramer and Singer (2001) (see also Bennett & Mangasarian, 1993, for related work in the context of linear programming). (For further information on the quadratic programming problem formulation and discussion of algorithms for the solution of these optimization problems, see Bennett & Mangasarian, 1993; Cramer & Singer, 2001; Vapnik, 1998; and Weston & Watkins, 1998). Finally, we wish to emphasize that from a Bayesian perspective, kernels can be seen as covariances of a gaussian process in the RKHS. This relation is discussed Wahba (1990) and has been further developed in machine learning (see, e.g., Williams and Barber, 1998; Cornford, Csato, Evans, & Opper, 2004). 5 Kernels In this section, we consider the problem of characterizing a wide variety of kernels of a form that have been found to be useful in learning theory (see Vapnik, 1998). We begin with a discussion of dot product kernels. To recall n this idea, we let our space X to be R , and on X we put the usual Euclidean inner product x · y = j∈Nn xj yj , x = (xj : j ∈ Nn ), y = (yj : j ∈ Nn ). A dot product kernel is any function of x · y, and a typical example is to choose Kp (x, y) = (x · y)p , where p is a positive integer. Our first result provides a substantial generalization. To this end, we let Nm be the set of vectors in Rm whose components are nonnegative integers, and if α = (αj : j ∈ Nm ) ∈ α Nm , z ∈ Rm , we define zα := j∈Nm zj j , α! := j∈Nm αj !, and |α| := j∈Nm αj . Finally, we let C be the set of complex numbers. 1 The case n = 2 (binary classification) is not included here because it merely reduces to learning a scalar-valued function.
On Learning Vector-Valued Functions
193
Definition 4. We say that a function h: Cm → L(Y ) is entire whenever there is a sequence of bounded operators {Aα : α ∈ Nm } ⊆ L(Y ), such that for every z ∈ Cm and any c ∈ Y , the function
(c, Aα c)zα
α∈Nm
is an entire function on Cm , and for every c ∈ Y and z ∈ Cm , it equals (c, h(z)c).2 In this case we write h(z) =
Aα zα , z ∈ Cm .
(5.1)
α∈Nm
Let B1 , . . . , Bm be any n × n complex matrices and h an entire function with values in L(Y ). We let h((B1 ), . . . , (Bm )) be the n × n matrix whose j, element is the operator h((B1 )j , . . . , (Bm )j ) ∈ L(Y ) Proposition 2. If {Kj : j ∈ Nm } is a set of kernels on X × X with values in R, and h: Rm → L(Y ) an entire function of the type 5.1 where {Aα : α ∈ Nm } ⊆ L+ (Y ), then K = h(K1 , . . . , Km ) is a kernel on X × X with values in L(Y ). Proof. that
For any {cj : j ∈ Nm } ⊆ Y , and {xj : j ∈ Nm } ⊆ X , we must show
(cj , h(K1 (xj , x ), . . . , Km (xj , x ))c ) ≥ 0.
j,∈Nm
To this end, we write the sum in the form α∈Nm
αm K1α1 (xj , x ) · · · Km (xj , x )(cj , Aα c ).
j,∈Nm
Since Aα ∈ L+ (Y ), it follows that the matrix ((cj , Aα c )), j, ∈ Nm is positive semidefinite. Therefore, by the lemma of Schur (Aronszajn, 1950), we see that the matrix whose jth, th element appears in the sum above is positive semidefinite for each α ∈ Nm . From this fact, the result follows. 2 A function f : Cm → C is called entire or analytic if it has a convergent power series everywhere in Cm .
194
C. Micchelli and M. Pontil
In our next observation, we show that the function h with the property described in proposition 2 must have the form 5.1 with {Aα : α ∈ Nm } ⊆ L+ (Y ) provided it and the kernels K1 , . . . , Km satisfy some additional conditions. This issue is resolved in FitzGerald, Micchelli, and Pinkus (1995) in the scalar case. To apply those results, we find the following definition useful: Definition 5. We say that the set of kernels Kj : X × X → C, j ∈ Nm , is full if for every n and any n × n positive semidefinite matrices Bj , j ∈ Nm , there exists a sequence of sets of distinct points {{xs : ∈ Nn } : s ∈ N} ⊆ X such that for every j ∈ Nm , Bj = lims→∞ (Kj (xs , xsk )),k∈Nn . For example, when m = 1 and X is an infinite-dimensional Hilbert space with inner product (·, ·), the kernel (·, ·) is full. Proposition 3. If {Kj : j ∈ Nm } is a set of full kernels on X × X , hCm → L(Y ), there exists some r > 0 such that the constant γ = sup{h(z) : z ≤ r} is finite, and if h(K1 , . . . , Km ) is a kernel with values in L(Y ), then h is entire of the form h(z) =
Aα z α , z ∈ C m
α∈Nm
for some {Aα : α ∈ Nm } ⊆ L+ (Y ). For the proof, see Micchelli and Pontil (2003). Let us now comment on translation-invariant kernels on Rn with values in L(Y ). This case is covered in great generality in Berberian (1966) and Fillmore (1970), where the notion of operator-valued Borel measures is described and used to generalize the theorem of Bochner (see Fillmore, 1970). For the purpose of applications, we point out the following sufficient condition on a function h: Rn → L(Y ) to give rise to a translation-invariant kernel. Specifically, for each x ∈ Rn , we let W(x) ∈ L+ (Y ), and define for x ∈ Rn the function eix·t W(t) dt. h(x) = Rn
Consequently, for every m ∈ N, {xj : j ∈ Nm } ⊆ Rn , and {cj : j ∈ Nm } ⊆ Y , we have that (cj , h(xj − x )c ) = (cj eixj ·t , W(t)(c eix ·t )), j,∈Nm
Rn
On Learning Vector-Valued Functions
195
which implies that K(x, y) := h(x − t) is a kernel on Rn × Rn . In Berberian (1966) and Fillmore (1970), all translation-invariant kernels are characterized in the form h(x) = eix·t dµ(t), Rn
where dµ is an operator-valued Borel measure relative to L+ (Rn ). In particular, this says, for example, that a function ei(x−t)·tj Bj h(x − t) = j∈Nm
where {Bj : j ∈ Nm } ⊆ L+ (Rn ) gives a translation-invariant kernel too. We also refer to Berberian (1966) and Fillmore (1970) to assert the claim that the function g: R+ → R+ such that K(x, t) := g(x − t2 ) is a radial kernel on X × X into L(Y ) for any Hilbert space X has the form ∞ g(s) = e−σ s dµ(σ ), 0
where again dµ is an operator-valued measure with values in L+ (Y ). When the measure is discrete and Y = Rn , we conclude that 2 e−σj x−t Bj K(x, t) = j∈Nm
is a kernel for any {Bj : j ∈ Nm } ⊆ L+ (Rn ) and {σj : j ∈ Nm } ⊂ R+ . This is the form of the kernel we use in our numerical simulations. 6 Practical Considerations In this final section, we describe several issues of a practical nature. They include a description of some numerical simulations we have performed, a list of several problems for which we feel learning vector-valued functions is valuable, and a brief review of previous literature on the subject. 6.1 Numerical Simulations. In order to investigate the practical advantages offered by learning within the proposed function spaces, we have carried out a series of three preliminary experiments where we tried to learn a target function f0 by minimizing the regularization functional in equation 4.1 under different conditions. In all experiments, set f0 to be of the form f0 (x) = K(x, x1 )c1 + K(x, x2 )c2 , c1 , c2 ∈ Rn , x ∈ [−1, 1].
(6.1)
In the first experiment, the kernel K was given by a mixture of two gaussian functions, K(x, t) = Se−6x−t + Te−30x−t , x, t ∈ [−1, 1], 2
2
(6.2)
196
C. Micchelli and M. Pontil
where S, T ∈ L+ (Rn ). Here we compared the solution obtained by minimizing equation 4.1 either in the RKHS of kernel K above (we call this method 1) or in the RKHS of the diagonal kernel, I
tr(S) −6x−t2 tr(T) −30x−t2 + e e , x, t ∈ [−1, 1], n n
(6.3)
where tr denotes the trace of a squared matrix (we call this method 2). We performed different series of simulations, each identified by the number of the outputs n ∈ {2, 3, 5, 10, 20} and an output noise parameter a ∈ {0.1, 0.5, 1}. Specifically, for each pair (n, a), we computed the matrices S = A∗ A and T = B∗ B where the elements of matrices A and B were uniformly sampled in the interval [0, 1]. The centers x1 , x2 ∈ [−1, 1] and the vector parameters c1 , c2 ∈ [−1, 1]n defining the target function f0 were also uniformly distributed. The training set {(xj , yj ) : j ∈ Nm } ⊂ [−1, 1] × Rn was generated by sampling the function f0 with noise: xj was uniformly distributed in [−1, 1] and yj = f0 (xj ) + j with j also uniformly sampled in the interval [−a, a]. In all simulations, we used a set of m = 20 points for training and a separate validation set of 100 data to select the optimal value of the regularization parameter. Then we computed on a separate test set of 200 samples the mean squared error between the target function f0 and the function learned from the training set. This simulation was repeated 100 times; that is, for each pair (n, a), 100 test error measures were produced for method 1 and method 2. As a final comparison measure of method 1 and method 2, we computed the average difference between the test error of method 2 and the test error of method 1 and its standard deviation, so, for example, a positive value of this average means that method 1 performs better than method 2. These results are shown in Table 1, where each cell contains the average difference of the test squared errors and, below it, its standard deviation. Not surprisingly, the results indicate that the use of the nondiagonal kernel always on the average improves performance, especially when the number of outputs increases. In the second experiment, the target function in equation 6.1 was modeled by the diagonal kernel, 2 2 K(x, t) = I ae−6x−t + be−30x−t , x, t ∈ [−1, 1], where a, b ∈ [0, 1], a + b = 1. In this case, we compared the optimal solution produced by minimizing equation 4.1 in either the RKHS of the above kernel (we call this method 3) or the RKHS of the kernel an bn 2 2 Se−6x−t + Te−30x−t , x, t ∈ [−1, 1] tr(S) tr(T) (we call this method 4). The results of this experiment are shown in Table 2, where it is evident that the nondiagonal kernel very often does worse
On Learning Vector-Valued Functions
197
Table 1: Results of Experiment 1. a\n
2
3
5
10
20
0.1
2.01e−4 (1.34e−4) 1.83−3 (1.17e−3) .0167 (.0104)
9.28e−4 (5.07e−4) .0276 (.0123) .0238 (.0097)
.0101 (.0203) .0197 (.126) .0620 (.0434)
.0178 (.0101) .0279 (.0131) .3397 .3608
.0331 (.0238) .5508 (.3551) .5908 (0.3801)
0.5 1
Notes: The average difference between the test error of method 1 and the test error of method 2 with its standard deviation (bottom row in each cell) for different values of the number of outputs n and the noise level a. The average error of method 1 (not reported here) ranged between 2.80e−3 and 1.423.
Table 2: Results of Experiment 2. a\n
2
3
5
10
20
0.1
8.01e−4 (6.76e−4) 1.57e−3 (1.89e−3) .0113 (.0086)
7.72e−4 (8.05e−4) .0133 (.0152) .0149 (.0245)
-6.85e−4 (5.00e−4) 7.30−3 (4.51e−3) 9.17e−3 (5.06e−3)
6.72e−4 5.76e−4 -7.30e−3 (4.51e−3) .0242 (.0103)
7.42e−4 3.99e−3 8.64e−3 (4.63e−3) .0233 (.0210)
0.5 1
Notes: The average difference between the test error of method 3 and the test error of method 4 with its standard deviation for different values of the number of outputs n and the noise level a. The average error of method 3 (not reported here) ranged between 3.01e−3 and 6.95e−2.
than the diagonal kernel. Comparing these results with those in Table 1, we conclude that the nondiagonal kernel does not offer any advantage when n = 2, 3. But this type of kernel helps dramatically when the number of outputs is larger than or equal to 5 (it does little damage in experiment 2 and great benefit in experiment 1). There is no current explanation for this phenomenon. Thus, it appears to us that such a kernel and its generalization, as we discussed above, may be important for applications where the number of outputs is large and complex relations among them are likely to occur. The last experiment addressed hyperparameters tuning. Specifically, we again choose a target function as in equation 6.1 and K as in equation 6.2, but now, S = S0 where S0q = 1 if = q, −.4 if = q + 1 (with periodic conditions, i.e., n + 1 = 1), and zero otherwise, and T = T0 is a sparse outof-diagonal positive definite binary matrix. Thus, S0 models correlations between adjacent components of f (it acts as a difference operator), whereas T0 accounts for possible “far-away” component correlations. We learn this
198
C. Micchelli and M. Pontil
target function by means of three kernel models. Model 1 consists of kernels as in equation 6.2 with S = S0 + λI and T = ρT0 , λ, ρ ≥ 0. Model 2 is as model 1 but with ρ = 0. Finally, model 3 consists of diagonal kernels as in equation 6.3, with S = S0 and T = νT0 , ν > 0. The data generation setting was as in the above experiments except that the matrices S0 and T0 were kept fixed in each simulation (this explains the small variance in the plots below) and the output noise parameter was fixed to 0.5. Hyperparameters of each model, as well as the regularization parameter in equation 4.1, were found by minimizing the squared error computed on a separate validation set containing 100 independently generated points. Figure 1 depicts the mean square test error with its standard deviation as a function of the number of training points (5,10,20,30) for n = 10, 20, 40 output components. The result clearly indicated the advantage offered by model 1. Although the above experiments are preliminary and consider a simple data setting, they enlighten the value offered by matrix-valued kernels. 6.2 Application Scenarios. We outline some practical problems arising in the context of learning vector-valued functions where the above theory can be of value. Although we do not treat the issues that would arise in a detailed study of these problems, we hope that our discussion will motive substantial studies of them. A first class of problems deals with finite-dimensional vector spaces. For example, an interesting problem arises in reinforcement learning or control when we need to learn a map between a set of sensors placed in a robot or autonomous vehicle and a set of actions taken by the robot in response to the surrounding environment (see, e.g., Franke et al., 1998). Another instance of this class of problems deals with the space of n × m matrices over a field whose choice depends on the specific application. Such spaces can be equipped with the standard Frobenius inner product. One of many problems that come to mind is to compute a map that transforms an ×k image x into an n × m image y. Here both images are normalized to the unit interval, so that X = [0, 1]×k and Y = [0, 1]n×m . There are many specific instances of this problem. In image reconstruction, the input x is an incomplete (occluded) image obtained by setting to 0 the pixel values of a fixed subset in an underlying image y, which forms our target output. A generalization of this problem is image denoising, where x is an n × m image corrupted by a fixed but unknown noise process and the output y is the underlying “clean” image. A different application is image morphing, which consists in computing the pointwise correspondence between a pair of images depicting two similar objects, for example, the faces of two people. The first image, x0 , is meant to be fixed (a reference image), whereas the second one, x, is sampled from a set of possible images in X = [0, 1]n×m . The output y ∈ R2(n×m) is a vector field in the input image plane that associates each pixel of the input image to the vector of coordinates of the corresponding pixel in the reference image (see, e.g., Beymer & Poggio, 1996, for more information on these issues).
On Learning Vector-Valued Functions
199
1.4
n=10
1.2 1 0.8 0.6 0.4 0.2
5
10
15
20
25
1.2
30
n=20
1 0.8 0.6 0.4 0.2
5
10
15
20
25
30
1.6
n=40 1.4 1.2 1 0.8 0.6 0.4 0.2 5
10
15
20
25
30
Figure 1: Experiment 3 (n = 10, 20, 40). Mean square test error as a function of the training set size for model 1 (solid line), model 2 (dashed line), and model 3 (dotted line). See the text for descriptions of these models.
200
C. Micchelli and M. Pontil
A second class of problems deals with spaces of strings, such as text, speech, or biological sequences. In this case, Y is a space of finite sequences whose elements are in a (usually finite) set A. A famous problem asks to compute a text-to-speech map that transforms a word from a certain fixed language into its sound as encoded by a string of phonemes and stresses. This problem was studied by Sejnowski and Rosenberg (1987) in their NETtalk system. Their approach consists of learning several Boolean functions that subsequently are concatenated to obtain the desired text-to-speech map. Another approach is to learn directly a vector-valued function. In this case, Y could be described by means of an appropriate RKHS (see below). Another important problem is protein structure prediction, which consists in predicting the 3D structure of a protein immersed in a aqueous solution (see, e.g., Baldi, Pollastri, Frasconi, & Vullo, 2002). Here x = (x j ∈ A : j ∈ N ) is the primary sequence of a protein, that is, a sequence over an alphabet of 20 possible symbols representing the amino acids present in nature. The length of the sequence varies typically between a few tens of and a few thousand elements. The output y = (yj ∈ R3 : j ∈ N ) is a vector with the same length as x and yj in the position of the jth amino acid of the input sequence. Toward the solution of this difficult problem, an intermediate step consists of predicting a neighborhood relation among the amino acids of a protein. In particular, recent work has focused on predicting a squared binary matrix C, called the contact map matrix, where C(j, k) = 1 if the amino acids j and k are at a distance smaller or of the order of 10A, and zero otherwise (see Baldi et al., 2002). Note that in the last class of problems, the output space is not a Hilbert space, since the sum of two vectors of different length is not defined. An approach to overcome this difficulty is to embed Y into a RKHS. This requires choosing a scalar-valued kernel G: Y × Y → R, which provides a RKHS space HG . We associate every element y ∈ Y with the function G(y, ·) ∈ HG and denote this map by G: Y → HG . If we wish to learn a function f : X → Y , we instead learn the composition mapping f˜ := G ◦ f . We then can obtain f (x) for every x ∈ X as f (x) = argminy∈Y G(y, ·) − f˜(x)G . If G is chosen to be bijective, the minimum is unique. This idea is also described in Weston, Chapelle, Elisseeff, Scholkopf, ¨ and Vapnik, (2003), where some applications of it are presented. As a third and final class of problems that illustrate the potential of vector-valued learning, we point to the possibility of learning a curve, or even a manifold. In the first case, the usual paradigm for the computer generation of a curve in Rn is to start with scalar-valued basis functions {Mj : j ∈ Nm } and a set of control points, {cj : j ∈ Nm } ⊆ Rn , and consider
On Learning Vector-Valued Functions
201
the vector-valued function cj Mj f = j∈Nm
(see, e.g., Micchelli, 1995). We are given data {yj : j ∈ Nm } ⊆ Rn , and we want to learn the control points. However the input data, which belong to a unit interval, say X = [0, 1], are unknown. We then fix {xj : j ∈ Nm } ⊂ [0, 1], and we learn a parameterization τ : [0, 1] → [0, 1] such that f (τ (xi )) = yi . So the problem is to learn both τ and the control points. Similarly, if {Mj : j ∈ Nm } are functions in Rk , k < n, we face the problem of learning a manifold M obtained by embedding Rk in Rn . It seems that to learn the manifold M, a kernel of the form Mj (x)Mj (t)Aj K(x, t) = j∈Nm
where {Aj : j ∈ Nm } ⊂ L+ (Rn ) would be appropriate because the functions generated by such kernel lie in the manifold generated by the basis functions, namely, K(x, t)c ∈ M for every x, t ∈ Rk , c ∈ Rn . 6.3 Previous Works on Vector-Valued Learning. The problem of learning vector-valued functions has been addressed in the statistical literature under the name of multiple response estimation or multiple output regression (see Hastie, Tibshirani, & Friedman, 2002). Here we briefly discuss two methods that we find interesting. A well-studied technique is the Wold partial least-squares (PLS) approach. This method is similar to principal component analysis but with the important difference that the principal directions are computed simultaneously in the input and output spaces, both being finite-dimensional Euclidean spaces (see, e.g., Hastie et al., 2002). Once the principal directions have been computed, the data are projected along the first n directions and a least-square fit is computed in the projected space where the optimal value of the parameter n can be estimated by means of cross-validation. We note that like principal component analysis, partial least squares are linear models, that is, they can model only vector-valued functions that depend linearly on the input variables. Recent work by Rosipal and Trejo (2001) and Bennett and Embrechts (2003) has reconsidered PLS in the context of scalar RKHS. Another statistically motivated method is the Breiman and Friedman (1997). curd & whey procedure. This method consists of two main steps. First, the coordinates of a vector-valued function are separately estimated by means of a least-squares fit or by ridge regression. Then they are combined in a way that exploits possible correlations among the responses (output
202
C. Micchelli and M. Pontil
coordinates). The authors show experiments where their method is capable of reducing prediction error when the outputs are correlated and not increasing the error when the outputs are uncorrelated. The curd & whey method is also primarily restricted to model linear relations of possibly nonlinear functions. However, it should be possible to “kernelize” this method following the same lines as in Bennett and Embrechts (2003). Acknowledgments We are grateful to Dennis Creamer, head of the Computational Science Department at National University of Singapore, for providing both of us with the opportunity to complete this work in a scientifically stimulating and friendly environment. Kristin Bennett of the Department of Mathematical Sciences at RPI and Wai Shing Tang of the Mathematics Department at NUS provided us with several helpful references and remarks. Bernard Buxton of the Department of Computer Science at UCL read a preliminary version of the manuscript and made useful suggestions. Finally, we are grateful to Phil Long of the Genome Institute of NUS for discussions that led to theorem 6, as well as the referees for their useful comments. This work was partially supported by NSF grant ITR-0312113. References Akhiezer, N.I., & Glazman, I.M. (1993). Theory of linear operators in Hilbert spaces (Vol. 1). New York: Dover. Amodei, L. (1997). Reproducing kernels of vector-valued function spaces. In A. Le Meehaute, C. Rabut, & L. L. Schumaker (Eds.), Curves and Surfaces: Proceedings of Chamonix 1996. Nashville, TN: Vonderbilt University Press. Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337–404. Baldi, P., Pollastri, G., Frasconi, P., & Vullo, A. (2002). New machine learning methods for the prediction of protein topologies. In P. Frasconi & R. Shamir (Eds.), Artificial intelligence and heuristic methods for bioinformatic. Amsterdam: IOS Press. Bennett, K.P., & Embrechts, M.J. (2003). An optimization perspective on partial least squares. In J. Suykens, G. Horrath, S. Basu, C. Micchelli, & J. Vandewalle (Eds.), Advances in kearning theory: Methods, models, and applications. (pp. 227– 250). Amsterdam: IOS Press. Bennett, K.P., & Mangasarian, O.L. (1993). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 722–734. Berberian, S.K., (1966). Notes on spectral theory. New York: Van Nostrand. Beymer, D., & Poggio, T. (1996). Image representations for visual learning. Science, 272(5270), 1905–1909. Breiman, L., & Friedman, J. (1997). Predicting multivariate responses in multiple linear regression (with discussion). J. Roy. Statist. Soc. B., 59, 3–37.
On Learning Vector-Valued Functions
203
Burbea, J., & Masani, P. (1984). Banach and Hilbert spaces of vector-valued functions. New York: Pitman. Cherkassky, V., & Mulier, F.(1998). Learning from data: Concepts, theory, and methods. New York: Wiley. Cornford, D., Csato, L., Evans, D., & Opper, M. (2004). Bayesian analysis of the scatterometer wind retrieval inverse Problem: Some new approaches. Journal of the Royal Statistical Society B, 66, 1–17. Cramer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265– 292. Cressie, N.A.C. (1993). Statistics for spatial data. New York: Wiley. Cristianini N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Evgeniou, T., & Pontil, M. (2004). Regularized multi-task learning. Proc. of the 10th ACM SIGKOD Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13, 1–50. Fillmore, P.A. (1970). Notes on operator theory. New York: Van Nostrand. FitzGerald, C.H., Micchelli, C. A., & Pinkus, A. M. (1995). Functions that preserve families of positive definite functions. Linear Algebra and Its Applications, 221, 83–102. Franke, U., Gavrila, D., Goerzig, S., Lindner, F., Paetzold, F., & Woehler, C. (1998). Autonomous driving goes downtown. IEEE Intelligent Systems 13, 40–48. Hastie, T., Tibshirani, R., & Friedman, J. (2002). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer-Verlag. Mangasarian, O.L. (1994). Nonlinear programming. Philadelphia: SIAM. Melkman, A.A., & Micchelli, C.A., (1979). Optimal estimation of linear operators in Hilbert spaces from inaccurate data. SIAM Journal of Numerical Analysis, 16(1), 87–105. Micchelli, C.A. (1995). Mathematical aspects of geometric modeling. Philadelphia: SIAM. Micchelli, C.A., & Pontil, M. (2003). On learning vector-valued functions (Research Note No. RN/03/08). London: Department of Computer Science, University College London. Micchelli, C.A., & Pontil, M. (2004). A function representation for learning in Banach spaces. In J. Shawe-Taylor & Y. Singer (Eds), Proceedings of the Seventeenth Annual Conference on Learning Theory. New York: Springer-Verlag. Rosipal, R., & Trejo, L. J. (2001). Kernel partial least squares regression in reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 2, 97–123. Scholkopf, ¨ B., & Smola, A.J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Sejnowski, T.J., & Rosenberg, C.R. (1987). Parallel networks which learn to pronounce English text. Complex Systems, 1, 145–163. Tikhonov, A. N., & Arsenin, V. Y. (1997). Solutions of ill-posed problems. Washington, DC: Winston. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley.
204
C. Micchelli and M. Pontil
Wahba, G. (1990). Splines models for observational data. Philadelphia: SIAM. Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, ¨ B., & Vapnik, V.N. (2003). Kernel dependency estimation. In Advances in neural information processing Systems, 15, Cambridge, MA: MIT Press. Weston, J. & Watkins, C. (1998). Multi-class support vector machines. (Tech. Rep. No. CSD-TR-98-04). London: Department of Computer Science, Royal Holloway, University of London. Williams, C.K.I. & Barber, D. (1998). Bayesian classification with gaussian processes. IEEE Trans. on Patt. Anal. Mach. Intel., 20, 1342–1351. Received November 5, 2003; accepted May 27, 2004.
LETTER
Communicated by Stephen Jose Hanson
RSPOP: Rough Set–Based Pseudo Outer-Product Fuzzy Rule Identification Algorithm Kai Keng Ang
[email protected] Chai Quek
[email protected] Centre for Computational Intelligence, Nanyang Technological University, School of Computer Engineering, Singapore 639798
System modeling with neuro-fuzzy systems involves two contradictory requirements: interpretability verses accuracy. The pseudo outer-product (POP) rule identification algorithm used in the family of pseudo outerproduct-based fuzzy neural networks (POPFNN) suffered from an exponential increase in the number of identified fuzzy rules and computational complexity arising from high-dimensional data. This decreases the interpretability of the POPFNN in linguistic fuzzy modeling. This article proposes a novel rough set–based pseudo outer-product (RSPOP) algorithm that integrates the sound concept of knowledge reduction from rough set theory with the POP algorithm. The proposed algorithm not only performs feature selection through the reduction of attributes but also extends the reduction to rules without redundant attributes. As many possible reducts exist in a given rule set, an objective measure is developed for POPFNN to correctly identify the reducts that improve the inferred consequence. Experimental results are presented using published data sets and real-world application involving highway traffic flow prediction to evaluate the effectiveness of using the proposed algorithm to identify fuzzy rules in the POPFNN using compositional rule of inference and singleton fuzzifier (POPFNN-CRI(S)) architecture. Results showed that the proposed rough set–based pseudo outer-product algorithm reduces computational complexity, improves the interpretability of neuro-fuzzy systems by identifying significantly fewer fuzzy rules, and improves the accuracy of the POPFNN. 1 Introduction Neural networks and fuzzy systems are very popular techniques in soft computing (Zadeh, 1994). Neuro-fuzzy hybridization synergizes these two techniques by combining the human-like reasoning style of fuzzy systems with the learning and connectionist structure of neural networks. Neuro-fuzzy hybridization is widely termed fuzzy neural networks (FNN) or neuroNeural Computation 17, 205–243 (2005)
c 2004 Massachusetts Institute of Technology
206
K. Ang and C. Quek
fuzzy systems (NFS) in the literature (Mitra & Hayashi, 2000; Buckley & Hayashi, 1994, 1995; Nauck, Klawonn, & Kruse, 1997; Lin & Lee, 1996). Neuro-fuzzy systems (the more popular term is used henceforth) incorporates the human-like reasoning style of fuzzy systems through the use of fuzzy sets and a linguistic model consisting of a set of if-then fuzzy rules. Thus, the main strength of neuro-fuzzy systems is that they are universal approximators (Castro, 1995; Ying, Ding, Li, & Shao, 1999; Tikk, Koczy, ´ & Gedeon, 2003) with the ability to solicit interpretable if-then rules (Guillaume, 2001). The identification of interpretable fuzzy rules thus forms the most important aspect in the design of neuro-fuzzy systems. A number of articles on research on the learning of fuzzy rules from numerical data have been published (Lin & Lee, 1991; Hayashi, Nomura, Yamasaki, & Wakami, 1992; Ishibuchi, Tanaka, & Okada, 1994; Yager, 1994; Shann & Fu, 1995; Lozowski, Cholewo, & Zurada, 1996; Quek & Zhou, 2001; Tung & Quek, 2002b), and a survey is given in Mitra and Hayashi (2000). The strength of neuro-fuzzy systems involves two contradictory requirements in fuzzy modeling: interpretability versus accuracy. In practice, one of the two properties prevails. The fuzzy modeling research field is divided into two areas: linguistic fuzzy modeling, which is focused on interpretability, mainly the Mamdani model (Mamdani & Assilian, 1975) given in equation 1.1, and precise fuzzy modeling, which is focused on accuracy, mainly the Takagi-Sugeno-Kang (TSK) model (Takagi & Sugeno, 1985; Sugeno & Kang, 1988) given in equations 1.2 and 1.3 (Guillaume, 2001; Casillas, Cordon, ´ Herrera, & Magdalena, 2003). A recent comprehensive coverage on interpretability issues of fuzzy modeling is given in Casillas et al. (2003). Rk : IF x1 is Ak,1 AND · · · AND xn1 is Ak,n1 THEN y is Bk
(1.1)
Rk : IF x1 is Ak,1 AND · · · AND xn1 is Ak,n1 THEN yk = ak x + bk n3 y= wk (x)yk ,
(1.2) (1.3)
k=1
where x, y are the input vector x = [x1 , . . . , xn1 ] and output value, respectively, Ak,i , Bk are the linguistic labels with fuzzy sets associated defining their meaning, n1 is the number of inputs, and n3 is the number of rules. The consequents in equation 1.2 are linear functions of the inputs, whereas the consequents in equation 1.1 are simply linguistic labels. Therefore, the TSK model has decreased interpretability but increased representative power compared to the Mamdani model (Casillas et al., 2003). Recently, a number of articles on research addressing the issues of the former have been reported (Johansen & Babuˇska, 2003; Jin, 2000; Yen, Wang, & Gillespie, 1998). In contrast, an increased number of fuzzy rules are needed in the latter to yield the same representative power of the former. However, if the number of fuzzy rules is large in any of these models, then any semantic
RSPOP
207
interpretability is lost (Tikk & Baranyi, 2003; Nauck, 2003). Moreover, if the number of rules is bounded as a practical limitation, then the universal approximation property of the models does not hold anymore (Moser, 1999). Thus, the interpretability issue that arises from the large number of fuzzy rules motivates the complexity reduction of rule-based models; a recent survey is given in Kaynak, Jezernik, and Szeghegyi (2002). This issue is similar to the problems encountered by numerical data-driven techniques in data mining (Han & Kamber, 2001) due to ultra-large data and redundant data. These techniques rely on heuristics to guide their search or reduce their search space horizontally or vertically (Lin & Cercone, 1997). Horizontal reduction is realized by the merging of identical data tuples or the quantization of continuous numerical values. Vertical reduction is realized by the application of feature selection methods. The former corresponds to fuzzy sets and the latter to fuzzy rule pruning and reduction in neuro-fuzzy systems. In the literature, fuzzy rule identification algorithms apply fuzzy rule pruning based on certainty factors (Chakraborty & Pal, 2004; Shann & Fu, 1995), or identify fuzzy rules based on certain heuristic threshold (Quek & Zhou, 2001; Tung & Quek, 2002b). Recently, rough set methods (Pawlak, 1991) have been shown to reduce pattern dimensionality significantly and proven to be viable data mining techniques (Swiniarski & Skowron, 2003). Shen and Chouchoulas (2002) presented a rough-fuzzy approach in fuzzy rule identification using the rough set attribute reduction algorithm (RSAR) (Shen & Chouchoulas, 1999) to reduce the attributes of fuzzy rules identified from the rule induction algorithm (RIA) (Lozowski et al., 1996). This approach not only reduced the sensitivity of the fuzzy rules to the data dimension but also outperformed the fuzzy rules before reduction. This motivates the investigation on the rough-fuzzy approach of fuzzy rule identification in this work. This letter proposes a novel rough set–based pseudo outer-product (RSPOP) fuzzy rule identification algorithm for the pseudo outer-product– based fuzzy neural networks (POPFNN) (Ang, Quek, & Pasquier, 2003; Zhou & Quek, 1996; Quek & Zhou, 1999). The pseudo outer-product (POP) fuzzy rule identification algorithm (Quek & Zhou, 2001) used in POPFNN is a simple one-pass rule identification algorithm that is fast, reliable, efficient, and easy to understand. However, it has suffered from an exponential increase in the number of fuzzy rules identified, as well as an increased computational complexity that arose with high-dimensional data. This decreased the interpretability of POPFNN in linguistic fuzzy modeling as well as accuracy in high-dimensional modeling. Therefore, instead of pruning fuzzy rules based on certainty factors that introduce inconsistencies, as described in Pal and Pal (1999), or the use of a heuristic threshold to govern the derivation of fuzzy rules, the proposed RSPOP algorithm integrates the sound concept of knowledge reduction from rough set theory (Pawlak, 1991) with the POP algorithm (Quek & Zhou, 2001). Compared with rough set attribute reduction (RSAR) (Shen & Chouchoulas, 1999), the proposed
208
K. Ang and C. Quek
RSPOP algorithm not only performs reduction of redundant attributes, but also extends the reduction to the fuzzy rules without redundant attributes. A brief background of POPFNN and rough set (Pawlak, 1991) is provided in sections 2 and 3. Section 4 gives a detailed description of the proposed RSPOP algorithm. Experimental results and analysis are provided in section 5. Finally conclusions are presented in section 6. 2 Pseudo Outer-Product-Based Fuzzy Neural Network POPFNN is a family of neuro-fuzzy systems based on the linguistic fuzzy model (Quek & Zhou, 2001). Three members of POPFNN exist in the literature: POPFNN-CRI(S) (Ang et al., 2003), which is based on the commonly accepted fuzzy compositional rule of inference; POPFNN-TVR (Zhou & Quek, 1996), which is based on truth value restriction; and POPFNN-AARS(S) (Quek & Zhou, 1999), which is based on the approximate analogical reasoning scheme. The POPFNN architecture is a five-layer neural network, as shown in Figure 1. For simplicity, only the interconnections for the output ym are shown in the figure. Each layer performs a specific fuzzy operation, and the nodes and operations of each layer are noted with a superscript of I to V for clarity. The inputs and outputs are represented as nonfuzzy vector X = [x1 , x2 , . . . , xi , . . . , xn1 ] and nonfuzzy vector Y = [y1 , y2 , . . . , yl , . . . , yn5 ], respectively. The fuzzification of the inputs and the defuzzification of the outputs are, respectively, performed by the input linguistic and output linguistic layers, while the fuzzy inference is collectively performed by the rule, condition, and consequence layers. The numbers of neurons in the condition, rule, and consequence layers are defined in equations 2.1 to 2.3, respectively: n2 =
n1
Ji
(2.1)
i=1
n1 n3 = Ji n4 =
i=1 n5
Lm ,
(2.2) (2.3)
m=1
where Ji is the number of linguistic labels for the ith input; Lm is the number of linguistic labels for the mth output; n1 is the number of inputs; n2 is the number of neurons in the condition layer; n3 is the number of rules or rulebased neurons; n4 is the number of linguistic labels for the output; and n5 is the number of outputs. Neurons in the input linguistic layer are called input nodes. Each input node Ii1 represents an input linguistic variable (Zadeh, 1975a, 1975b, 1975c)
RSPOP
209 y1
ym
yn5
O1V
OmV
OnV5
IV IV OL1,1 OL1,2
Consequent
OL1,IVL
1
OLIV m,1
I Antecedent ci ∈Ck
II IL1,2
IV OLIV n ,1 OLn ,2
OLIV m, Lm
5
wk ,m,1 wk ,m ,l wk ,m, Lm
d m ∈DVk
R1III
II IL1,1
OLIV m,l
c1
IL1,IIJ
1
RkIII
Weights
ILiII, j
Consequence Layer IV 5 (Output-Label nodes)
OLIV n ,Ln 5
wkIII,m Rule Layer III (Rule nodes)
RnIII3
cni
ci
ILiII,1
5
Output Linguistic Layer V (Output nodes)
ILiII, J
i
ILIIn ,1 ILIIn ,2 1
1
I1I
IiI
I nI1
x1
xi
xn1
ILIIn , J n 1
1
Condition Layer II (Input-Label nodes)
Input Linguistic Layer I (Input nodes)
Figure 1: Structure of pseudo outer-product-based fuzzy neural networks (POPFNN).
of the corresponding input xi . Neurons in the condition layer are called input-label nodes. Each input-label node ILII i,j represents the jth linguistic label of the ith input node from the input layer. The input label nodes constitute the antecedent of the fuzzy rules. Each input label node is represented by a membership function µi,j (x). Two commonly used fuzzy membership function are gaussian membership function, described by two parameters (vi,j , wi,j ), and trapezoidal membership function, described by a fuzzy interval formed by four parameters (αi,j , βi,j , γi,j , δi,j ). Neurons in the rule-based layer are called rule nodes. Each rule node RIII k represents an if-then fuzzy rule. Neurons in the consequence layer are called output label nodes. The output label node represents the lth linguistic label of the output ym . The output
210
K. Ang and C. Quek
label nodes constitute the conclusions of the if-then fuzzy rules. Each output label node is represented by a membership function µm,l (x) similar to the input label node where gaussian or trapezoidal membership functions can be used. The neurons in the output linguistic layer are called output nodes. The output node OV m represents the output linguistic variable of the output ym . The function of an output node is to perform defuzzification, where each conclusion expressed in terms of fuzzy sets in the consequence layer is converted to a single real number. 2.1 Learning Process of POPFNN. The learning process of POPFNN consists of three phases: fuzzy membership generation, fuzzy rule identification, and supervised fine-tuning. Various fuzzy membership generation algorithms can be used: learning vector quantization (LVQ; Kohonen, 1989), fuzzy Kohonen partitioning (FKP; Ang et al., 2003), or discrete incremental clustering (DIC; Tung & Quek, 2002a). Generally, the POP algorithm and its variant, LazyPOP, are used to identify the fuzzy rules (Quek & Zhou, 2001). Figure 2 shows the learning process of the POP algorithm in the identification of the output label nodes for a rule node RIII k . is linked to only one input label node from each Each rule node RIII k input node and only one output label node from each output node. The links of rule nodes to the antecedent in the condition layer and the consequent in the consequence layer are mathematically denoted as sets (CI , DV ) in this article. The antecedent of rule nodes is represented by a set CI = {c1 , c2 , . . . , ci , . . . , cn1 } where ci | ci ∈ ILII i is a condition variable. The set of condition labels that ci can assume is semantically represented by the set of II II II II input label nodes ILII i = {ILi,1 , ILi,2 , . . . , ILi,j , . . . , ILi,Ji }, but a computational notation of ILII i = {1, 2, . . . , j, . . . , Ji } is used in this article. Similarly, the consequent of rule nodes is represented by a set DV = {d1 , d2 , . . . , dm , . . . , dn5 }, where dm | dm ∈ OLIV m is a consequent variable. The set of consequent labels that dm can assume is semantically represented by the set of output label IV IV IV IV nodes OLIV m = {OLm,1 , OLm,2 , . . . , OLm,l , . . . , OLm,Lm }, but a computational notation of OLIV m = {1, 2, . . . , l, . . . , Lm } is used in this article. The specific links of the kth rule node RIII k to the antecedent in the condition layer and the consequent in the consequence layer are denoted as (CIk , DV k ). During are computationally repPOP learning, the consequents of rule node RIII k . A description of POP algorithm (Quek & Zhou, resented as weights wIII k 2001) follows: POP Algorithm • Step 1: Rule initialization. Initialize the links CIk of rule nodes to input label nodes for k = 1 · · · n3 such that all possible combinations of rules are created using equation 2.7. Initialize wIII k,m,l = 0 for k = 1 · · · n3 , m = 1 · · · n5 , l = 1 · · · Lm .
RSPOP
211 Output Linguistic Layer V (Output nodes)
OmV
OLIV m,1
Consequent
R1III
I Antecedent ci ∈Ck
II IL1,2
Consequence Layer IV (Output-Label nodes)
OLIV m, Lm
wk ,m,1 wk ,m,l wk ,m, Lm
d m ∈DVk
POP Learning
II IL1,1
OLIV m,l
c1
IL1,IIJ
1
RkIII
Weights
Rule Layer III (Rule nodes)
RnIII3 cni
ci
ILiII,1
ILiII, j
ILiII, J
IiI
I1I
wkIII,m
i
ILIIn ,1 ILIIn ,2 1
1
I nI1
ILIIn , J n 1
1
Condition Layer II (Input-Label nodes)
Input Linguistic Layer I (Input nodes)
Figure 2: Pseudo outer-product in the identification of output label nodes for kth rule node
• Step 2: Hebbian learning. Given the training data tuples {{X1 , Y1 }, {X2 , Y2 , . . . , {Xp , Yp }, . . . , {Xn , Yn }}, determine weights wIII k,m,l using pseudo other-product Hebbian learning (Quek & Zhou, 2001; Hebb, 1949) given in equation 2.4. For p = 1 · · · n, m = 1 · · · n5 , wIII k,m,l =
n3
III fk,p µm,l (ym,p )
(2.4)
p=1 IV III III where wIII k,m,l is the weight of the link between Rk and OLm,l ; fk,p is the firing strength of the pth instant of training data; µm,l is the membership value of the lth output label node for the mth output; and ym,p is the the mth element of the pth instant of Y. End for.
• Step 3: Rule formulation. Determine the link of each rule node RIII k to out= {d , d , . . . , d , . . . , d }, where d∈ put label nodes, denoted as DV 1 2 m n5 k IV OLIV and OL = {1, 2, . . . , l, . . . , L }. For k = 1 · · · n , m = 1 · · · n m 3 5: m m
212
K. Ang and C. Quek
a. Compute the consequence link with the highest weight using equation 2.5: III wIII k,m,lmax = max wk,m,l . l
(2.5)
b. Select or delete the link to the consequence of mth output using equation 2.6: dm =
lmax ∅
if wIII k,m,lmax > 0 . if wIII k,m,lmax = 0
(2.6)
V V c. Delete rule node RIII k if Dk = ∅, that is, ∀dm ∈ Dk , dm = ∅. End for. d. Update n3 with the total number of remaining rules. III in step The initialization of rules in step 1 and the computation of fk,r I 2 involve an integer rule index k. The computation of k given Ck is given in equation 2.7. This equation is constructed such that each condition variable ci ∈ CIk construes a digit in a positional number system where c1 is the least significant digit and cn1 is the most significant digit. As the set of each i−1 condition labels ci can assume is ILII i = {1, 2, . . . , Ji }, the term p=0 Jp computes the associated weight of each digit, and the addition and subtraction of unity in the equation ensures that all possible combinations of CIk fall within the contiguous integer range of k = 1 · · · n3 ,
k=
n1 i=1
(ci − 1) ×
i−1
Jp + 1,
J0 = 1,
(2.7)
p=0
where Ji is the number of linguistic labels for the ith input; ci is the condition I variable of the ith input of rule Rk s. t. ci ∈ CIk , ci ∈ ILII i ; Ck is the set of condition variables of rule Rk , CIk = {c1 , c2 , . . . , ci , . . . , cn1 }; and ILII i is the set II of condition labels ci can assume, ILi = {1, 2, . . . , Ji }. 3 Preliminaries on Rough Set Theory Pawlak (1991) introduced rough set theory to deal with imprecise or vague concepts. A recent development of rough set theory and artificial intelligence is available in Curry (2003). A rough set is an approximation of a vague concept by a pair of precise concepts called the lower and upper approximations. The lower approximation is a description of the domain objects that are known with certainty to belong to the subset of interest, whereas the upper approximation is a description of the objects that possibly belong
RSPOP
213
to the subset. In addition, rough set theory provides rigorous mathematical techniques for discovering regularities in data. The approach in this letter applies two fundamental concepts of rough set knowledge reduction in the derivation of fuzzy rules: a reduct and the core. A reduct of knowledge is its essential part, which suffices to define all basic concepts occurring in the considered knowledge, whereas the core is the most important part of the knowledge. Central to this rough set approach is the concept of attribute dispensability in terms of indiscernability relations (Pawlak, 1991). An attribute R ∈ R is dispensable if it satisfies IND(R) = IND(R − {R}),
(3.1)
where IND(P) is the indiscernibility relation over P, which is the intersection of all equivalence relations belonging to P. The rough set knowledge reduction uses the value attribute pair to represent the knowledge of decision rules (Pawlak, 1991). When decision rules are represented as such, rough set theory then supplies the logical methods employing attribute dispensability and decision rule consistency for knowledge reduction and analysis. Because decision rules do not have the attributes of truth or falsity as opposed to formulas, consistency and inconsistency are used instead to describe decision rules in computational terms. The following sections elaborate on the knowledge representation system (KRS), attribute dispensability, and consistency from rough set theory. 3.1 Knowledge Representation System. To express mathematically how rough set theory is applied in knowledge reduction, Tarski’s style semantics of the decision logic (DL) language is used in Pawlak (1991), where the notions of a model and satisfiability are employed. Knowledge is represented as a model S = (U, A) where U is the universe of discourse and A is the set of attributes. The formula S is satisfied by an element x in U if and only if an extraction of attribute value from x belongs to the set of attribute A. In other words, an object x ∈ U satisfies S if and only if a(x) = v, where a ∈ A and v ∈ V, V = Va and Va being the set of attribute value for a ∈ A. An atomic formula is represented as (a, v) in DL language for any a ∈ A and v ∈ V. A formula is expressed as (a1 , v1 ) ∧ (a2 , v2 ) ∧ · · · ∧ (ai , vi ) · · · ∧ (an , vn ),
(3.2)
where vi is the attribute constant value, vi ∈ Vai ; ai is the attribute constant, ai ∈ A; and ∧ is the conjunction connectives. In a decision rule system, A = {C ∪ D}, where C is the set of conditional attributes and D is the set of decision attributes. If P = {a1 , a2 , . . . , ai , . . . , an } and P ⊆ A, then the formula in equation 3.2 is called a P-basic formula (in short, P-formula). The satisfied in the system disjunction of all P-formulas S is represented as S P. If P = A, then S A is called the characteris-
214
K. Ang and C. Quek
Table 1: Example of the Knowledge Representation System Using Attribute Value Table Adopted from Pawlak (1991). U
a1
a2
a3
1 2 3 4 5 6
1 2 1 1 2 1
0 0 1 1 1 0
2 3 1 1 3 3
tic formula of S. The example in Table 1 illustrates how the knowledge is represented. Expressing the atomic formula (ai , v) in short as ai,v and omitting the conjunction symbol ∧, the characteristic formula of the knowledge in Table 1 is A = a1,1 a2,0 a3,2 ∧ a1,2 a2,0 a3,3 ∧ a1,1 a2,1 a3,1 ∧ a1,2 a2,1 a3,3 ∧ a1,1 a2,0 a3,3 . (3.3) S
Now, given a decision rule φ → ψ where φ and ψ are the C-basic and Dbasic formula, respectively, then the decision rule φ → ψ is called a CD basic decision rule, or in short CD rule denoted as (C, D). Considering the knowledge in Table 1, assume that C = {a1 , a2 } and D = {a3 }, sets C and D associate the following CD decision rules: a1,1 a2,0 → a3,2 a1,2 a2,0 → a3,3 a1,1 a2,1 → a3,1 a1,2 a2,1 → a3,3 a1,1 a2,0 → a3,3
(3.4)
3.2 Rule Consistency. When a decision rule φ → ψ satisfies S, the decision rule is consistent in S if and only if for any decision rule φ → ψ in S, φ = φ implies ψ = ψ (Pawlak, 1991). Considering the knowledge in Table 1 with C = {a1 , a2 } and D = {a3 }, rule a1,1 a2,0 → a3,2 is inconsistent with rule a1,1 a2,0 → a3,3 . Thus, the decision rule base (C, D) is inconsistent. If either rule is removed, then (C, D) is consistent. 3.3 Attribute Reduction. Considering a consistent CD decision rule base (C, D), an attribute ai ∈ C is dispensable if and only if ((C − {ai }), D) is consistent (Pawlak, 1991). If all attributes ai ∈ C are indispensable in (C, D), then (C, D) is independent. The concept of a reduct of knowledge in rough set theory is the subset of attributes R ⊆ C, where (R, D) is independent
RSPOP
215
and consistent. The concept of the core of knowledge in rough set theory is the set of all indispensable attributes of knowledge (C, D), denoted as CORE(C, D). 3.4 Decision Rule Reduction. The concept of a reduct can also be applied to simplify dispensable conditions of decision rules similar to attribute reduction performed on the decision rule base (Pawlak, 1991). Before proceeding further, an addition denotation, /, is required. Considering a CD rule φ → ψ and Q ⊆ C, the notation φ/Q means the Q basic decision rule obtained from only those consistent in Q; others are removed. An attribute ai is dispensable in the decision rule φ → ψ if the removal of ai from C is also consistent. In other words, an attribute ai ∈ C is dispensable in φ → ψ that is consistent in S if and only if φ/(C − {ai }) → ψ is also consistent in S. 4 Rough Set–Based Pseudo Outer-Product Although the POP algorithm (Quek & Zhou, 2001) is a simple, easy-tounderstand one-pass learning algorithm that performs fast, reliable, and efficient consequence identification for fuzzy rules, it has to compute all possible rules during the learning process in equation 2.4 (Quek & Zhou, 1999). As a result, the fuzzy rules identified using POP usually consist of a large number of rules. The total number of rules identified in equation 2.3 also increases rapidly with an increase in the dimension of the input space or the number of input labels. As pointed out in Tikk and Baranyi (2003) and Nauck (2003), if the number of fuzzy rules identified is large, then any semantic interpretability of the fuzzy rules is lost. To resolve the problem of identifying a large number of rules, the lazy pseudo outer-product (LazyPOP) learning algorithm is proposed in Quek and Zhou (1999). Instead of considering all possible rules, this algorithm uses three user-defined threshold criteria—noise criterion (NC), ambiguity criterion (AC), and focus criterion (FC)—in the selection and pruning of fuzzy rules. Similarly, the RULEMAP algorithm in the generic self-organizing fuzzy neural network (GenSoFNN) uses two user-defined thresholds, ThresISP and ThresOSP , to govern the identification of fuzzy rules (Tung & Quek, 2002b). The introduction of heuristic thresholds in the identification and pruning of fuzzy rules serves the purpose of reducing the number of identified fuzzy rules. However, the accuracy of the neuro-fuzzy systems and the interpretability of the fuzzy rules identified are usually heavily dependent on the selection of these threshold values. Therefore, this motivates work to enhance POP using the rough set approach instead of using heuristic thresholds to identify a reduced number of fuzzy rules. The proposed rough set– based pseudo outer-product (RSPOP) learning algorithm consists of three parts: RSPOP rule identification, RSPOP attribute reduction, and RSPOP rule reduction. A description of the RSPOP learning algorithm follows with a detailed explanation of the rationale of the algorithm:
216
K. Ang and C. Quek
RSPOP Rule Identification • Step 1: Initialization. Initialize CIk and wIII k,m,l as in step 1 of POP. • Step 2: Hebbian learning. Given the training data tuples {{X1 , Y1 }, {X2 , Y2 }, . . . , {Xp , Yp }, . . . , {Xn , Yn }}, for p = 1 · · · n: a. Determine the input labels CIk,p = {c1 , c2 , . . . , ci , . . . , cn1 } for the pth instant of X with the maximum output for each input using II ci = jmax | max(oII i,jmax ,p ) = max(oi,j,p )∀ i = 1 · · · n1 j
ci ∈
CIk,p ,
(4.1)
where k is the index to rule nodes to be computed with CIk,p II using equation 2.7; oII i,j,p is the output of ILi,j for the pth instant of X; and jmax is the label index j = 1 · · · Ji for a particular ith input from ILII i,j that yields the highest output for the pth instant of X. b. Compute the firing strength of the most influential rule node RIII k using III fk,p = min(oII i,ci ),
(4.2)
i
I II where oII i,ci is the output of ILi,ci , ci ∈ Ck,p for the pth instant of X. c. Compute the output labels DV k,p = {d1 , d2 , . . . , dm , . . . , dn5 } for the pth instant of Y with the maximum membership value using
dm = lmax | µm,Jmax (ym,p ) = max µm,l (ym,p )∀ m = 1 · · · n5 , l
dm ∈ DV k,p ,
(4.3)
where µm,l is the membership function of OLIV m,l for the pth instant of Y; ym,p is the mth element of the pth instant of Y; and Imax is the label index l = 1 · · · Lm for a particular mth output from OLIV m,l that yields the maximum membership value for the pth instant of Y. d. Update the weights wIII k,m,dm of the most influential rule node III Rk using a Hebbian learning rule (Hebb, 1949) given in III V wIII k,m,dm = fk,p × µm,dm (ym,p )∀ m = 1 · · · n5 , dm ∈ Dk,p , (4.4) IV III where wIII k,m,dm is the weights of Rk to OLm,dm with the highest III membership value; fk,p is the firing strength of the most influ-
ential rule node RIII k for the pth instant of X from equations 4.1
RSPOP
217
to 4.3; and µm,dm is the maximum membership value of the mth output for the pth instant of Y from equation 4.3. End for. • Step 3: Rule formulation. Determine the link of each rule node RIII k to output label nodes and update n3 with the total number of rules as in step 3 of POP. RSPOP Attribute Reduction • Step 1: Baseline. Given the training data tuples {{X1 , Y1 }, {X2 , Y2 }, . . . , {Xp , Yp }, . . . , {Xn , Yn }} and the rules in POPFNN before reduction as CI , DV ), compute the baseline objective measure of CD rule base CI , DV ). For p = 1 · · · n: Compute the POPFNN consequence output for pth instant of X using COm,l,p = oIV m,l,p ∀ m = 1 · · · n5 , l = 1 · · · Lm .
(4.5)
End for. • Step 2: Reduction. Perform input attribute reduction for each attribute ci of the rule base. For i = 1 · · · n1 : a. Compute the consistency of the CD rule base ((CI − {ci }), DV ). ((CI − {ci }), DV ) is inconsistent if the following equation is true when rules with the same conditions ignoring attribute ci infer different consequents: I V ∀Rk : ((CIk − {ci }), DV k ), ∃Rk : ((Ck − {ci }), Dk ) | V ((CIk − {ci }) = (CIk − {ci })) ∧ (DV k = Dk ) ∀ k,
k = 1 · · · n3 .
(4.6)
b. If ((CI − {ci }), DV ) is consistent, compute the objective measure of the CD rule base ((CI − {ci }), DV ) using IV COreduct m,l,p = om,l,p ∀ p = 1 · · · n, m = 1 · · · n5 , l = 1 · · · Lm . (4.7)
c. Detect the deterioration of the objective measure by comparing the objective measure computed in step 2b with the baseline objective measure. The objective measure is deteriorated if the following equation is true when one of the correct consequent outputs (COm,l,p , l = dm ) of using reduct is decreased. The object measure is also deteriorated if the incorrect consequent output (COm,l,p , l = dm ) is greater than that of a correct consequent
218
K. Ang and C. Quek
output: V ∃p, ∃m s.t. ∀l((COreduct m,l,p < COm,l,p ), l = dm ∈ Dk,p ) ∨ reduct V ((COreduct m,l,p > COm,l,p ) ∧ (COm,l,p > COm,dm ,p ), l = dm ∈ Dk,p )
∀p = 1 · · · n, m = 1 · · · n5 , l = 1 · · · Lm .
(4.8)
d. If there is no deterioration of objective measure, then allow the reduction of attribute ci using the two following equations, remove duplicated rules, and recompute n3 : (CI , DV ) = ((CI − {ci }), DV ) COm,l,p =
COreduct m,l,p
(4.9)
∀ p = 1 · · · n, m = 1 · · · n5 ,
l = 1 · · · Lm .
(4.10)
End for. RSPOP Rule Reduction • Step 1: Baseline. Given the training data tuples {{X1 , Y1 }, {X2 , Y2 }, . . . , {Xp , Yp }, . . . , {Xn , Yn }} and the rules in POPFNN before reduction as (CI , DV ), compute the baseline objective measure COm,l,p of CD rule base (CI , DV ) as in step 1 of RSPOP Attribute reduction. • Step 2: Reduction. Perform rule reduction for each attribute ci of each RIII k . For k = 1 · · · n3 , i = 1 · · · n1 : V I a. Based on the picked rule RIII k : (Ck , Dk ) and the picked condition attribute ci , denoting CD rule as φ → ψ, compute the consistency of all CD rules φ/(CIk − ci ) → ψ using
∃Rk : ((CIk − {ci }), DV k ) | V ((CIk − {ci }) = (CIk − {ci })) ∧ (DV k = Dk )
∀ k = 1 · · · n3 .
(4.11)
b. If the CD rules φ/(CIk −ci ) → ψ are consistent, remove attribute ci from these CD rules and compute the objective measure of POPFNN using these reduced rules with equation 4.7. c. Detect the deterioration of objective measures of using reduced rules if equation 4.8 is true. d. If there is no deterioration of objective measures, allow the reduction of rules, and remove duplicated rules. End for. The first part, RSPOP rule identification, addresses the inherent problem of the POP algorithm in considering all possible rules. In equation 2.4 of the
RSPOP
219
III POP algorithm, the weight wIII k,m,l is dependent on the firing strength fk,p for
the pth instant of training data {X, Y}. In the computation of oIV m,l , the highest IV III from RIII results in the largest o . This implies that the kth rule with fk,p m,l k III the highest firing strength fk,p is the most influential rule for the pth instant of training data {X, Y}. However, it is not necessary to consider all possible III rules to determine the rule index k to the rule node RIII k with the highest fk,p . III The firing strength fk,p for the pth instant of training data {X, Y} is computed from the same equation given in III fk,p = min(oII i,j ),
(4.12)
i
II where oII i,j is the output of the input label node ILi,j that forms the antecedent conditions for the ith input to the kth fuzzy rule for the pth instant of X. Expanding equation 4.12 with the objective of getting a maximum fkIII gives III II II II = min(max(oII fk,p 1,c1 ), max(o2,c2 ), . . . , max(oi,ci ), . . . , max(on1 ,cn )), (4.13) 1
where ci is the label index j = 1 · · · Ji for a particular ith input from ILII i,j that yields the highest output for the pth instant of X. Therefore, the maximum III firing strength fk,p for the pth instant of training data tuple {X, Y} can be computed from the minimum of all the maximum output of the condition layer for each input. This allows the identification of the most influential rule I node RIII k with the conditions Ck,p = {c1 , c2 , . . . , ci , . . . , cn1 } where ci ∈ ILi for the pth instant of training data tuple {X, Y} in equation 4.1 of RSPOP rule identification step 2a. The time complexity of the reduction step 2a of both RSPOP attribute reduction and RSPOP rule reduction is O(n23 ), which is dependent on the number of fuzzy rules. If POP is used, then the reduction step inherits the problem from POP in considering a large number of rules given in equation 2.2. RSPOP rule identification improves on POP such that given the pth instant of training data tuple {X, Y}, RSPOP identifies only one most influential rule. Because this rule may already have been identified from the previous instant of the training data, the total number of fuzzy rules identified using RSPOP rule identification is thus bounded from above as given in n3 ≤ n,
(4.14)
where n is the number of training tuples and n3 is the number of rules identified. Thus, RSPOP rule identification improves on the time complexity of POP by reducing the number of rules considered by POP, which is exponentially
220
K. Ang and C. Quek
dependent on the dimension of the input space in equation 2.2, to an upper bound as given in equation 4.14. Comparing RSPOP against LazyPOP, the former identifies only one influential rule for a given training data tuple, but the latter identifies a number of legitimate rules based on a heuristic noise criterion. This gives RSPOP a stronger advantage in identifying a significantly reduced number of fuzzy rules for high-dimensional modeling, which improves the interpretability of the fuzzy rules identified. To allow the reduction of attributes and rules in POPFNN, the firing strength of rules of POPFNN has to accommodate the removal of condition attributes. Therefore, the firing strength of a rule RIII k in the rule layer of POPFNN for handling reduced attributes and rules is modified to I fkIII = min(oII i,ci ), ci ∈ Ck , ci = ∅, ∀ i = 1 · · · n1 . i
(4.15)
In the reduction of identified fuzzy rules, the effectiveness of a reduced rule base must not deteriorate compared to the rule base prior to reduction. Therefore, an objective measure is vital in assessing the effectiveness of the neuro-fuzzy system before and after reduction. In rough set theory, there can be only one core of a knowledge, but there can be many possible reducts (Pawlak, 1991). This objective measure thus serves the purpose of identifying the reducts that improve but do not deteriorate the performance of the neuro-fuzzy system. The fundamental objective of a fuzzy rule identification algorithm is to identify rules that infer correct consequents based on the conditions and consequents learned from training data. Thus, an objective measure can be construed based on the correct inference of consequents from training data. Given the training data {{X1 , Y1 }, {X2 , Y2 }, . . . , {Xp , Yp }, . . . , {Xn , Yn }}, the lth consequent of the mth output for the pth instant of training data {X, Y} is oIV m,l , which is computed in the consequence layer. The correct consequent of the mth output is computed from the maximum membership of output labels OLIV m,l given ym,p . Therefore, an improvement in the objective measure is an increase in oIV m,lmax | µm,lmax (ym,p ) = max µm,l (ym,p ). A deterioration in the l
IV IV objective measure is an increase in oIV m,l | l = lmax , om,l > om,lmax . This objective measure is used in RSPOP attribute reduction and RSPOP rule reduction to assess the performance of POPFNN before and after reduction. The application of attribute reduction and rule reduction in RSPOP is as described in section 3. Section 5 gives a more detailed explanation of the operation of RSPOP in the process of identifying dispensible attributes and interpretable rules. The next section presents experimental results using published data sets and a real-world application involving highway traffic flow prediction to evaluate the effectiveness of using RSPOP to identify the fuzzy rules in pseudo outer-product–based fuzzy neural networks using compositional rule of inference and singleton fuzzifier (POPFNN-CRI(S)) architecture.
RSPOP
221
5 Numerical Experiments Two sets of experiments are performed to evaluate the performance of the proposed RSPOP algorithm in the identification of fuzzy rules in the POPFNN-CRI(S) architecture (Ang et al., 2003): (1) three sets of Nakanishi data set extracted from published papers (Nakanishi, Turksen, & Sugeno, 1993; Sugeno & Yasukawa, 1993) and (2) highway traffic modeling and prediction.
5.1 Nakanishi Data Set. This section outlines the experiments performed to evaluate the effectiveness of fuzzy rules identified using the proposed RSPOP algorithm in the POPFNN-CRI(S) architecture (abbreviated as RSPOP-CRI) on three data sets: a nonlinear system, the human operation of a chemical plant, and the daily price of a stock in a stock market. These three data sets were used by Nakanishi et al. (1993) to evaluate six reasoning methods: Mamdani version of Zadeh’s CRI, Turksen’s version of interval-valued CRI, Turksen’s versions of point-valued and interval-valued AARS, and Sugeno’s version of position and gradient type of reasoning. They are also used in the classic paper by Sugeno and Yasukawa (1993), where qualitative modeling based on fuzzy logic is proposed. A qualitative model of a system is one built with numerical input-output data without a prior knowledge of a system. The use of fuzzy-logic-based qualitative modeling instead of nonfuzzy modeling stems from the fact that system behavior can be qualitatively described using natural language where a precise mathematical model of these complex systems cannot be obtained (Sugeno & Yasukawa, 1993). Each of these three data sets is split into three groups: A, B, and C (Nakanishi et al., 1993). In the following experiments, groups A and B of all data sets are used as the training data sets, and group C is used as the ex ante test data sets. The experimental results are compared against POPFNN-CRI(S) using the same gaussian membership functions generated in RSPOP-CRI but with fuzzy rules identified using POP (abbreviated as POP-CRI) and other neuro-fuzzy systems: adaptive-network–based fuzzy inference systems (Jang, 1993) with subtractive clustering (Chiu, 1994; denoted as ANFIS), evolving fuzzy neural networks (denoted as EFuNN; Kasabov, 2001), and dynamic evolving neural-fuzzy inference system (denoted as DENFIS; Kasabov & Song, 2002). ANFIS is configured with a range of influence of 0.5 for subtractive clustering, while EFuNN and DENFIS are both configured with default parameters.
5.1.1 Example 1: A Nonlinear System. In this experiment, POPFNNCRI(S) is used to model the nonlinear system. Table 2 shows the centroid v and width w of the gaussian membership function generated using a modified variant of the learning vector quantization (Kohonen, 1989) al-
222
K. Ang and C. Quek
Table 2: Gaussian Membership Function of Example 1 Generated for POPFNNCRI(S). ILII
Attribute
OLIV
Label j
vi,j
wi,j
v2,j
w2,j
v3,j
w3,j
v4,j
w4,j
v1,j
w1,j
1 2 3 4
1.599 2.535 3.527 4.472
0.216 0.561 0.567 0.209
1.293 2.206 4.412
0.140 0.548 0.329
1.459 2.615 4.069 4.870
0.209 0.693 0.480 0.054
1.185 2.261 3.599 4.340
0.105 0.646 0.445 0.102
0.498 0.498 0.718 0.806
1.499 2.330 3.526 4.869
gorithm (abbreviated as MLVQ) from the training data set.1 A total of 4, 3, 4, 4 clusters from the input data X1 to X4 , respectively, and a total of 4 clusters from the output data Y1 are generated. The gaussian membership functions generated form the input label and output label nodes of POPFNN-CRI(S). Table 3 shows the initial and reduced rules identified using RSPOP with the same training data set. The feature selection method in Nakanishi et al. (1993) discards the inputs X3 and X4 , whereas RSPOP discards the input X3 and only partially discards X4 in some of the rules. A detailed explanation on the attribute and rule reduction process of RSPOP is provided here in Table 3. From the identified input and output membership functions, the sets CI = {c1 , c2 , c3 , c4 } and DV = {d1 } are used to represent the condition and consequent attributes, respectively. The attribute values c1 ∈ {1, 2, 3, 4}, c2 ∈ {1, 2, 3}, c3 ∈ {1, 2, 3, 4}, c4 ∈ {1, 2, 3, 4}, and d1 ∈ {1, 2, 3, 4} are used to represent the 4, 3, 4, 4, 4 gaussian membership functions for condition variables c1 to c4 and consequent variable d1 , respectively. RSPOP identified an initial 22 rules before attribute and rule reduction. In the process of attribute reduction, one reduct ((CI − {c3 }), DV ) that results in consistent rules is identified. Using this reduct, the computation of objective measure resulted in no deterioration, and thus this reduct is used and duplicated fuzzy rules are removed. An example of duplicated fuzzy rules using this reduct from the initial rules in Table 3 is R9 and R10 . After attribute reduction, the attribute c3 is redundant, and the number of rules is reduced from 22 to 20 after removing duplicated rules. In the process of rule reduction, several groups of rules are identified with dispensable condition. An example of a pair of rules with dispensable condition c1 from the attribute reduced fuzzy rules in Table 3 is R5 and R9 . However, the computation of the objective measure using this pair of reduced rules resulted in deterioration; thus, this
1 The authors’ work on the generation of interpretable fuzzy membership functions will be communicated in a separate article. It suffices here to consider the algorithm as a modified variant of learning vector quantization algorithm as used in POPFNN-TVR (Zhou & Quek, 1996).
RSPOP
223
Table 3: Fuzzy Rules of Example 1 Identified Using RSPOP. Rules From RSPOP Rule Identification
Rules After RSPOP Attribute Reduction DV k
CIk
Rules After RSPOP Rule Reduction DV k
CIk
DV k
CIk
k
c1
c2
c3
c4
d1
k
c1
c2
c3
c4
d1
k
c1
c2
c3
c4
d1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
2 2 1 4 3 3 1 3 2 2 4 3 1 4 1 3 3 1 2 2 1 4
3 1 2 2 2 3 1 1 2 2 3 2 3 3 2 2 3 1 1 2 2 3
1 2 3 4 1 1 2 2 2 3 3 4 1 1 2 2 2 3 4 1 2 2
1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4
1 3 2 1 2 1 2 2 2 2 1 2 2 1 3 1 1 3 2 2 3 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2 2 1 4 3 3 1 3 2 4 1 4 1 3 3 1 2 2 1 4
3 1 2 2 2 3 1 1 2 3 3 3 2 2 3 1 1 2 2 3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4
1 3 2 1 2 1 4 2 2 1 2 1 3 1 1 3 2 2 3 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2 2 1 4 3 3 1 3 2 4 1 1 3 1 2 2 1
3 1 2 2 2 3 1 1 2 3 3 2 2 1 1 2 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 2 0 2 2 2 0 3 3 3 3 3 4 4
1 3 2 1 2 1 4 2 2 1 2 3 1 3 2 2 3
pair of rules is not reduced. As the rule reduction process continues, groups of reducible rules that resulted in no deterioration of objective measures are used with the dispensable conditions set to 0, and the number of rules is reduced from 20 to 17. The numbers 0–4 that represent condition and consequent labels in Table 3 do not show that the fuzzy rules identified are interpretable. To illustrate the intuitiveness of the fuzzy rules identified, a mapping of semantic labels to each condition and consequent labels is required. As formulated in Zadeh (1975a, 1975b, 1975c), a linguistic variable is characterized by a quintuple (L, T(L), U, G, M), where L is the name of the variable; T(L) is the linguistic term set of L; U is a universe of discourse; G is a syntactic rule that generates T(L); and M is a semantic rule that associates each T(L) with its meaning. Here, the names L of the input and output linguistic variables of the data set are {X1 , X2 , X3 , X4 } and {Y}, which are represented by condition and consequent variables {c1 , c2 , c3 , c4 } and {d1 }, respectively. The linguistic terms are represented by the attribute values, which are {1, 2, 3, 4} for attribute variables c1 , c2 , c3 , d1 and {1, 2, 3} for attribute variable c2 . The meaning of each linguistic label M is represented by its gaussian membership function in Table 2. A mapping of semantic labels such as T(·) = {Low,Medium,High,Very High} reveals the intuitiveness of the rules identified in Table 3, as shown in Figure 3 and Table 4.
membership function µ4,j(x4)
K. Ang and C. Quek
1
T(2) Medium
T(1) Low
T(3) High
T(4) Very High
0.8 0.6 0.4 0.2 0 1
1
2
3
4
input value x1
T(1) T(2) Low Medium
5
T(3) High
0.8 0.6
⇒
0.4 0.2 0 0
1
1
2
3
4
5
input value x2
T(1) Low
6
membership function µ1,l(y1)
membership function µ2,j(x2)
membership function µ1,j(x1)
224
1
T(1) T(2) T(3) T(4) Low Medium High Very High
0.8 0.6 0.4 0.2 0 0
2
4
output value y1
6
8
T(3) T(4) High Very High
T(2) Medium
0.8 0.6 0.4 0.2 0 0
1
2
3
input value x4
4
5
Figure 3: Semantic interpretation of the fuzzy sets in Table 2.
After the fuzzy rules are identified, the output using the test data set from example 1 is then computed. Figure 4 shows the experimental results obtained from RSPOP-CRI compared to POP-CRI and other neuro-fuzzy systems. Consolidated experimental results on the three data sets in terms of square of the Pearson product-moment correlation value (R2 ) and mean square error (MSE) are given in Table 9. 5.1.2 Example 2: Human Operation of a Chemical Plant. In this experiment, the POPFNN-CRI(S) is used to model the human operation of a chemical. Table 5 shows the centroid v and width w of the gaussian membership functions generated using MLVQ from the training data set. A total of 4, 5,
RSPOP
225
Table 4: Semantic Interpretation of the Fuzzy Rules in Table 3. R1 : R4 : R6 : R10 : R13 :
IF X1 is medium IF X1 is very high IF X1 is high IF X1 is very high IF X1 is high THEN Y is low IF X1 is low IF X1 is high IF X1 is high IF X1 is medium IF X1 is low IF X1 is medium IF X1 is medium THEN Y is medium IF X1 is medium IF X1 is low IF X1 is low IF X1 is low THEN Y is high IF X1 is low
R3 : R5 : R8 : R9 : R11 : R15 : R16 : R2 : R12 : R14 : R17 : R7 :
AND X2 AND X2 AND X2 AND X2 AND X2
is high is medium is high is high is medium
AND X4 is low AND X4 is low
AND X2 AND X2 AND X2 AND X2 AND X2 AND X2 AND X2
is medium is medium is low is medium is high is low is medium
AND X4 AND X4 AND X4 AND X4 AND X4 AND X4 AND X4
is low is medium is medium is medium is high is high is very high
AND X2 AND X2 AND X2 AND X2
is low is medium is low is medium
AND X4 AND X4 AND X4 AND X4
is low is high is high is very high
AND X2 is low
AND X4 is high
AND X4 is medium
5
Data Value
4
Actual RSPOP-CRI POP-CRI Sugeno (R-G) Mamdani Turksen (IVCRI) ANFIS
3
2
1
0 0
5
10
15
20
25
Data Instance Figure 4: Experimental results of RSPOP-CRI on test data of example 1 compared against POP-CRI and other neuro-fuzzy systems.
226
K. Ang and C. Quek
Table 5: Gaussian Membership Function of Example 2 Generated for POPFNNCRI(S). Attribute Label j 1 2 3 4 5 6
ILII vi,j 4.615 5.519 5.950 6.522
wi,j
v2,j
w2,j
v3,j
w3,j
OLIV v4,j
w4,j
v5,j
w5,j
v1,j
0.542 −0.244 0.055 693.5 1036.5 −0.389 0.111 −0.100 0.060 909.47 0.258 −0.152 0.037 2421.0 809.9 −0.204 0.062 0.000 0.060 2579.7 0.258 −0.090 0.037 3770.9 809.9 −0.100 0.060 0.100 0.056 3870 0.344 0.001 0.054 6742.3 1782.8 0.000 0.060 0.193 0.056 6757.8 0.154 0.092 0.100 0.056 0.194 0.056
w1,j 1002.2 774.14 774.14 1732.7
4, 6, 4 clusters from the input data X1 to X5 , respectively, and a total of 4 clusters from the output data Y1 are generated. Table 6 shows the initial and reduced rules identified using RSPOP. RSPOP identified an initial 24 rules, which reduced to 14 rules after attribute reduction and remained at 14 rules after rule reduction. The feature selection method in Nakanishi et al. (1993) discards the inputs X2 , X4 , and X5 , whereas RSPOP discards the inputs X1 , X2 , and X5 . Figure 5 shows the experimental results obtained from RSPOP-CRI compared against POP-CRI and other neuro-fuzzy systems. Consolidated experimental results on the three data sets are given in Table 9. Similar to example 1, a term set T(·) has to be used to map the attribute values semantically. The semantic interpretation of the fuzzy rules identified is omitted henceforth since translating the gaussian membership functions generated in Table 5 and the fuzzy rules identified in Table 6 is a rather intuitive process. 5.1.3 Example 3: Daily Data of a Stock in a Stock Market. In this experiment, the POPFNN-CRI(S) is used to model the daily data of a stock market. Table 7 shows the centroid v and width w of the gaussian membership function generated using MLVQ from the training data set. A total of three to five clusters from the input data X1 to X10 and a total of four clusters from the output data Y1 are generated. Table 8 shows the initial and reduced rules identified using RSPOP. RSPOP identified an initial 50 rules and reduced to 23 rules after attribute and rule reduction. The feature selection method in Nakanishi et al. (1993) discards the inputs X1 , X2 , X3 , X6 , X7 , X9 , and X10 , whereas RSPOP discards the inputs X1 , X2 , X3 , X6 , X10 and partially discards X5 , X7 , X8 , and X9 in some of the rules. Figure 6 shows the experimental results obtained from POPFNN-CRI(S) using the proposed RSPOP compared against POPFNNCRI(S) using POP and other neuro-fuzzy systems. Consolidated experimental results on the three data sets are given in Table 9. 5.1.4 Discussion. Table 9 shows the consolidated experimental results of examples 1 to 3 in terms of square of the Pearson product-moment cor-
RSPOP
227
Table 6: Fuzzy Rules of Example 2 Identified Using RSPOP. Rules from RSPOP Rule Identification
Rules After RSPOP Attribute and Rule Reduction DV k
CIk
DV k
CIk
k
c1
c2
c3
c4
c5
d1
k
c1
c2
c3
c4
c5
d1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
1 2 1 1 1 1 3 2 2 1 1 1 2 2 3 1 4 4 1 3 4 1 4 1
4 4 4 1 4 4 4 1 2 4 3 4 1 5 5 2 1 4 4 2 3 4 3 4
4 2 4 3 4 4 3 2 3 4 4 4 3 3 3 4 1 1 4 1 1 4 1 4
2 3 3 4 4 5 6 3 3 3 4 4 5 1 2 2 3 3 3 4 4 4 5 3
1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4
4 2 4 4 4 4 3 2 3 4 4 4 3 3 3 4 1 1 4 1 1 4 1 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2 4 3 4 4 3 3 3 3 3 1 1 1
2 3 3 4 4 5 6 3 5 1 2 3 4 5
0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 2 4 4 4 4 3 3 3 3 3 1 1 1
Table 7: Gaussian Membership Function of Example 3 Generated for POPFNNCRI(S). ILII
Attribute Label j
vi,j
wi,j
v2,j
w2,j
v3,j
w3,j
v4,j
w4,j
v5,j
w5,j
1 2 3 4 5
−0.052 0.065 0.181 0.296 0.412
0.070 0.069 0.069 0.069 0.069
−0.029 0.108 0.280 0.378
0.082 0.082 0.059 0.059
−12.231 −4.405 10.460 19.760 26.719
4.696 4.696 5.580 4.175 4.175
−6.541 10.980 25.491
10.512 8.707 8.707
−3.429 −2.388 −0.898 −0.127 0.922
0.625 0.625 0.462 0.462 0.629
Attribute Label j 1 2 3 4 5
ILII v6,j
w6,j
v7,j
−13.671 −4.518 0.623 7.815
5.492 3.085 3.085 4.315
−8.332 −4.692 −1.130 3.489
w7,j
v8,j
OLIV w8,j
v9,j
w9,j
v10,j
w10,j
v1,j
2.184 −17.637 4.161 −18.024 4.585 −19.222 4.658 −19.122 2.137 −10.702 4.161 −10.382 3.441 −11.460 4.658 2.972 2.137 −3.322 3.301 −4.647 2.481 −2.962 3.131 23.592 2.771 2.180 3.301 −0.512 2.481 2.256 2.671 32.693 7.818 3.382 5.248 3.456 6.708 2.671
w1,j 13.256 12.372 5.461 5.461
228
K. Ang and C. Quek
10000 8000
Data Value
6000
4000 Actual RSPOP-CRI POP-CRI Sugeno (R-G) Mamdani Turksen (IVCRI) DENFIS
2000 0
-2000 0
5
10
15
20
25
30
35
Data Instance Figure 5: Experimental results of RSPOP-CRI on test data of example 2 compared against POP-CRI and other neuro-fuzzy systems.
50 40
Data Value
30 20 10 0 Actual RSPOP-CRI POP-CRI Sugeno (R-G) Mamdani Turksen (IVCRI) ANFIS
-10 -20 -30 -40 0
10
20
30
40
50
Data Instance Figure 6: Experimental results of RSPOP-CRI on test data of example 3 compared against POP-CRI and other neuro-fuzzy systems.
RSPOP
229
Table 8: Fuzzy Rules of Example 3 Identified Using RSPOP. Rules from RSPOP Rule Identification DV k
CIk k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 .. . 41 42 43 44 45 46 47 48 49 50
c1 4 4 4 4 4 3 1 3 5 2 5 1 4 1 2 1 1
c2 1 2 2 1 1 1 1 1 2 1 1 1 3 1 1 1 1
c3 3 3 3 3 4 2 1 3 4 1 5 2 3 1 2 1 1
c4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
c5 2 2 3 1 2 4 3 4 4 4 5 3 4 3 4 3 5
4 5 5 4 5 4 5 1 5 5
4 2 2 4 4 4 4 2 4 3
3 5 5 3 5 3 5 2 3 4
3 1 1 2 3 2 3 1 3 2
5 5 5 5 5 5 5 5 5 5
.. .
c6 1 2 2 2 4 3 3 2 2 3 3 3 3 3 3 3 3
c7 3 2 3 1 4 2 3 2 3 3 4 2 2 3 3 4 3
c8 3 1 3 3 2 3 3 4 4 4 4 3 3 3 3 3 4
c9 1 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4
c10 1 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 4 3 3 3 3 3 4
3 4 3 3 4 3 3 4 4 4
4 4 4 4 4 5 5 5 5 5
5 5 5 5 5 5 5 5 5 5
4 4 5 5 5 5 5 5 5 5
d1 2 2 1 2 2 2 2 2 2 2 2 2 4 2 2 2 2 .. . 1 2 2 1 1 1 1 2 1 1
relation value (R2 ) and MSE. Comparing the modeling accuracy in terms of MSE, POP-CRI outperformed the rest on example 1, DENFIS outperformed the rest on example 2, and RSPOP-CRI outperformed the rest on example 3. From Table 9, ANFIS and DENFIS yielded significantly accurate performance on all three data sets. As these two neuro-fuzzy systems are based on the TSK model (Sugeno & Kang, 1988; Takagi & Sugeno, 1985) used for precise fuzzy modeling, these models generally yield decreased interpretability but increased accuracy (Casillas et al., 2003). In contrast, POPFNN-CRI(S) is based on the Mamdani model (Mamdani & Assilian, 1975) used for linguistic fuzzy modeling, which is focused on interpretability. The objective of the comparison with the recent TSK-based models is to obtain a quality check on the performance of RSPOP-CRI since interpretability is not directly comparable between these two models.
230
K. Ang and C. Quek
Table 8, continued. Rules After RSPOP Attribute and Rule Reduction DV k
CIk k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
c1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 3 3 1 3 1 2 1 1 2 2 2 2 3 1 2
c5 2 2 3 1 2 4 3 4 5 0 4 4 2 4 5 5 4 4 5 5 4 4 4 4 5 5 5 5 5
c6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c7 3 2 3 1 4 2 3 3 0 0 2 3 4 4 4 0 4 4 2 4 2 3 3 3 3 3 3 4 4
c8 3 1 3 3 2 0 3 4 4 3 3 0 1 5 5 4 4 4 5 5 2 3 3 4 4 5 5 5 5
c9 1 2 2 3 3 3 3 3 0 4 4 4 5 3 0 0 4 4 4 4 5 5 5 5 5 5 5 5 5
c10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
d1 2 2 1 2 2 2 2 2 2 2 4 2 2 2 1 1 2 1 2 1 1 2 3 2 1 1 1 2 1
Table 9: Consolidated Experimental Results on Nakanishi Data Sets. Example 1 2 3
R2 MSE R2 MSE R2 MSE
POP-CRI
RSPOP-CRI
Sugeno (P-G)
Mandani
Turksen (IVCRI)
0.877 0.270 0.946 5.630 × 105 0.733 76.221
0.856 0.383 0.983 2.124 × 105 0.922 24.859
0.845 0.467 0.990 1.931 × 106 0.700 168.90
0.490 0.862 0.937 6.580 × 105 0.865 40.84
0.609 0.706 0.993 2.581 × 105 0.661 93.02
ANFIS
EFuNN
DENFIS
0.853 0.286 0.780 2.968 × 106
0.720 0.566 0.946 7.247 × 105 0.756 72.542
0.805 0.411 0.995 5.240 × 104
0.875 38.062
0.810 69.824
RSPOP
231
Table 10 shows a benchmark of the number of rules and learning performance of RSPOP-CRI against POP-CRI on the Nakanishi data sets performed on a Pentium 4 2.0 GHz Dell PC with 512 Mb RAM. The table shows that as the dimensionality of data sets A to C increases from 4 to 10, the number of fuzzy rules identified using RSPOP remains bounded from above by the number of tuples of the training data. In contrast, the number of fuzzy rules identified using POP increases exponentially from 192 to 3 million. From the results on example 1 in Table 10, the number of fuzzy rules is notably reduced from 192 for POP-CRI to 17 for RSPOP-CRI, but the accuracy in terms of MSE is slightly compromised from 0.270 to 0.383. A quality control check on the accuracy with other neuro-fuzzy systems in Table 9 showed an exceptional performance in terms of MSE for POP-CRI, whereas RSPOP-CRI yielded comparable performance with ANFIS, DENFIS, and EFuNN. This shows that the interpretability of the fuzzy rules in POPFNN-CRI(S) identified using RSPOP improved, but accuracy is not significantly compromised on example 1. From the results on example 2 in Table 10, the number of fuzzy rules identified is significantly reduced from 1920 for POP-CRI to 14 for RSPOP-CRI, and the accuracy in terms of MSE improved marginally from 5.630 × 105 to 2.124 × 105 . A quality control check on the accuracy with other neuro-fuzzy systems in Table 9 showed exceptional performance in terms of MSE for DENFIS, whereas RSPOP-CRI yielded comparable performance with EFuNN. This shows that the interpretability of the fuzzy rules in POPFNN-CRI(S) identified using RSPOP improved significantly with a marginal improvement in accuracy on example 2. From the results on example 3 in Table 10, the number of fuzzy rules identified is extensively reduced from 3 million for POP-CRI to 29 for RSPOPCRI, and the accuracy in terms of MSE significantly improved from 76.2 to 24.9. With the increase in the dimension of the input space from 5 in example 2 to 10 in example 3, the number of rules identified using POP increases exponentially from 1920 to 3 million. The POP algorithm, which uses an outer product form of Hebbian learning, has to compute all possible rules given in equation 2.2 during the learning process in equation 2.4. The exponential increase in the number of rules identified with the increase in the dimension of input space and/or the number of input labels has been reported in Quek and Zhou (2001). Although the LazyPOP algorithm was proposed in Quek and Zhou (2001) to reduce the number of rules identified, it relies on heuristic thresholds. Thus, the accuracy and interpretability of the rules identified are heavily dependent on the selection of these threshold values. In contrast, RSPOP enhances POP using the rough set reduction and identifies the reduced number of fuzzy rules without the use of heuristic thresholds. A quality control check on the accuracy with other neuro-fuzzy systems in Table 9 showed superior performance in terms of MSE for RSPOP-CRI. This shows that both the interpretability and the accuracy of POPFNN-CRI(S) using RSPOP to identify fuzzy rules improved significantly on example 3.
232
K. Ang and C. Quek
In addition, Table 10 shows that the time taken to train RSPOP-CRI compared to POP-CRI is reduced from 1,082,866 ms to 184 ms. This shows that the proposed RSPOP algorithm executes significantly faster than the POP algorithm on example 3. Therefore, these experimental results show that as the dimensionality of the data set increases, significant improvement in both the interpretability and accuracy, as well as computation complexity, is obtained for POPFNN-CRI(S) using the proposed RSPOP algorithm to identify fuzzy rules compared to the POP algorithm. 5.2 Traffic Flow Prediction. This section outlines the experiments performed to evaluate the effectiveness of the fuzzy rules identified using the proposed RSPOP algorithm in the POPFNN-CRI(S) architecture on the modeling of traffic flow data where a precise mathematical model cannot be obtained. The raw traffic flow data for the experiments is obtained from Tan (1997). The raw data were collected at site 29 located at exit 15 along the eastbound Pan Island Expressway (PIE) in Singapore using loop detectors embedded beneath the road surface. The inductive loop detectors were preinstalled by the Land Transport Authority of Singapore (LTA) in 1996 along major roads to facilitate traffic flow data collection. Figure 7 shows a picture of the location where the traffic flow data were collected. There are five lanes at the site: two exit lanes (lanes 4–5) and three straight lanes (lanes 1–3) for the main traffic. In the experiment, only the traffic flow data for the three straight lanes (lanes 1–3) are used. The traffic flow data set consists of four attributes: time and traffic density of lanes 1–3. The POPFNN-CRI(S) architecture is used to model the traffic flow trend, and the trained neuro-fuzzy system is used to predict traffic density for time t + τ where τ = 5, 15, 30, 45, 60 minutes. Figure 8 shows the traffic flow density data for the three straight lanes spanning a period of six days: September 5–10, 1996. The data set is divided into three cross-validation groups of training and test sets, exactly as in Tung and Quek (2004), on the traffic flow prediction case study using generic self-organizing fuzzy neural network (GenSoFNN) (Tung & Quek, 2002b). These three data set groups are named CV1, CV2, and CV3 and are extracted from the traffic flow data set as shown in Figure 8. The recall of POPFNN-CRI(S) using MLVQ to generate gaussian membership functions and RSPOP to identify fuzzy rules (abbreviated as RSPOPCRI) with CV1, CV2, CV3 and the prediction results of CV2 and CV3, CV1 and CV3, CV1 and CV2 for τ = 5 minutes are presented in Figure 9. The square of the Pearson product-moment correlation value (R2 ) and the MSE are then computed and averaged to form the prediction results of lane 1 traffic flow density for τ = 5 minutes. A total of 15 prediction results from RSPOP-CRI are subsequently consolidated in Figure 10 in prediction of lanes 1 to 3 for τ = 5, 15, 30, 45, 60 minutes. Figure 10 also presents the consolidated traffic flow prediction results of GenSoFNN (from Tung & Quek, 2004), POPFNN-CRI(S) with the same gaussian membership functions gen-
4 5 10
25 35 50
192 1,920 3,000,000
46 485 1,082,866
0.270 5.630 × 105 76.221
MSE of Test Data
RSPOP-CRI
22 24 50
188 202 184
20 14 43
17 14 29
Number of Rules Train Time Number of Rules Number of Rules Identified (ms) After Attributes After Rule Reduction Reduction
0.383 2.124 × 105 24.859
MSE of Test Data
RSPOP-CRI POP-CRI GenSOFNN-CRI(S) EFuNN DENFIS
14.4 40.0 50.0 234.5 9.7
0.146 0.173 0.164 0.186 0.153
Neuro-fuzzy Systems Number of Rules MSE
Table 11: Benchmark on the Average Number of Rules and Average MSE of Using RSPOP-CRI Against POP-CRI and Other Neuro-Fuzzy Systems on Traffic Flow Prediction.
1 2 3
Number of Number of Number of Rules Train Time Example Attributes Training Tuples Identified (ms)
POP-CRI
Table 10: Benchmark of POPFNN-CRI(S) Using POP Against RSPOP on Nakanishi Data Sets.
RSPOP 233
234
K. Ang and C. Quek
Figure 7: Location of site 29 along the Pan Island Expressway in Singapore, where traffic flow data are collected.
erated but using POP to identify the fuzzy rules (abbreviated as POP-CRI), evolving fuzzy neural networks (EFuNN; Kasabov, 2001), and the dynamic evolving neural-fuzzy inference system (DENFIS; Kasabov & Song, 2002). Table 11 shows a benchmark on the average number of rules and the average MSE of RSPOP-CRI compared to POP-CRI and other neuro-fuzzy systems. Figure 10 clearly shows a superior prediction accuracy of RSPOP-CRI in terms of both R2 and MSE compared against other neuro-fuzzy systems. Table 11 shows that on average, RSPOP identifies significantly fewer rules with increased accuracy compared to POP on the same POPFNN-CRI(S) architecture. Table 11 also shows that RSPOP-CRI uses an average of 14.4 fuzzy rules, fewer than other neuro-fuzzy systems except DENFIS, which uses a lower average of 9.7 fuzzy rules. However, DENFIS is based on the TSK model (Sugeno & Kang, 1988; Takagi & Sugeno, 1985) used for precise fuzzy modeling, which generally yields decreased interpretability but increased accuracy (Casillas et al., 2003). In contrast, POPFNN-CRI(S) is based on the Mamdani model (Mamdani & Assilian, 1975) used for linguistic fuzzy modeling, which is focused on interpretability. The objective of comparing RSPOP-CRI with DENFIS is to obtain a quality check on the prediction accuracy of RSPOP-CRI since interpretability is not directly comparable between these two models. The experimental result shows that the proposed RSPOP
RSPOP
235
Lane 1
Lane 2
Lane 3
6
Normalized Density
5
CV2
CV1 4
5/9 Thur
6/9 Fri
7/9 Sat
CV3
8/9 Sun
9/9 Mon
10/9 Tue
3 2 1 0
0.68
0.18
0.70
0.20
0.70
0.18
0.68
0.18
0.68
0.18
Normalized Time
Figure 8: Traffic flow density of three straight lanes along PIE at site 29.
algorithm is able to identify effective fuzzy rules, which improves the accuracy of POPFNN-CRI(S) used for linguistic fuzzy modeling to the extent that it is comparable to DENFIS used for precise fuzzy modeling. Therefore, the experimental results once again show that significant improvement in both interpretability and accuracy is obtained for POPFNN-CRI(S) using the proposed RSPOP algorithm to identify fuzzy rules. 6 Conclusion A novel rough set pseudo outer-product (RSPOP) algorithm is proposed in this article for the identification of fuzzy rules in pseudo outer-productbased fuzzy neural networks (POPFNN) (Ang et al., 2003; Zhou & Quek, 1996; Quek & Zhou, 1999). The proposed RSPOP algorithm addressed the problem of exponential increase in the number of fuzzy rules identified as well as increased computational complexity faced by the pseudo outerproduct (POP) fuzzy rule identification algorithm (Quek & Zhou, 2001) with high-dimensional data. Instead of pruning learned fuzzy rules through the use of certain heuristic thresholds in existing rule identification algorithms, RSPOP integrates the sound concept of knowledge reduction from rough set theory with the POP algorithm. RSPOP not only performed feature selection through the reduction of redundant attributes but also extends the
236
K. Ang and C. Quek
3
Lane 1 τ =5 Recall for CV1
2 1.5 1 0.5 0
3
100
200 300 400 Time Instance
Lane 1 τ =5 Recall for CV2
Normalized Density
Normalized Density
1
1
200
400 600 Time Instance
800
Lane 1 τ =5 Prediction of CV1 & CV3 Actual Predicted
2.5
0.5
2 1.5 1 0.5
100
200 300 400 Time Instance
0
500
Lane 1 τ =5 Recall for CV3
3 Actual Recall
2 1.5 1 0.5
200
400 600 Time Instance
800
Lane 1 τ =5 Prediction of CV1 & CV2 Actual Predicted
2.5 Normalized Density
2.5 Normalized Density
1.5
3 Actual Recall
1.5
0
2
0
500
2
3
Actual Predicted
0.5
2.5
0
Lane 1 τ =5 Prediction of CV2 & CV3
2.5 Normalized Density
2.5 Normalized Density
3 Actual Recall
2 1.5 1 0.5
100
200 300 400 Time Instance
500
0
200
400 600 Time Instance
800
Figure 9: Recall and prediction of lane 1 traffic density of RSPOP-CRI for τ = 5.
reduction to the fuzzy rules without redundant attributes. Because there are many possible reducts for a given rule set, an objective measure is developed based on the POPFNN architecture to identify the reducts that improve rather than deteriorate the inferred consequence after attribute and rule reduction. Using the rough set approach and the objective measure, the proposed RSPOP algorithm is able to identify a significantly reduced number of fuzzy rules compared to the POP algorithm, which improves the interpretability of POPFNN. However, as neuro-fuzzy systems generally
RSPOP
237 Lane 1 Prediction
0.9
Average R 2
0.86
0.2 Average MSE
0.88
Lane 1 Error
0.25
RSPOP-CRI POP-CRI GENSOFNN EFuNN DENFIS
0.84 0.82 0.8
0.15 RSPOP-CRI POP-CRI GenSOFNN EuFNN DENFIS
0.1
0.78 10
20 30 40 Time Interval τ
0.05 0
60
10
Lane 2 Prediction
0.86
0.82
20 30 40 Time Interval τ
0.78 0.76
0.15
RSPOP-CRI POP-CRI GenSOFNN EuFNN DENFIS
0.1
0.74 10
20 30 40 Time Interval τ
10
20 30 40 Time Interval τ
Average MSE
0.2 0.15 RSPOP-CRI POP-CRI GenSOFNN EuFNN DENFIS
0.1
20 30 40 Time Interval τ
50
60
0.25
0.8
10
50
Lane 3 Error
0.3
0.85
0.75 0
0.05 0
60
RSPOP-CRI POP-CRI GENSOFNN EFuNN DENFIS
0.9 Average R 2
50
Lane 3 Prediction
0.95
60
0.2
0.8
0.72 0
50
Lane 2 Error RSPOP-CRI POP-CRI GenSOFNN EFuNN DENFIS
0.84 Average R 2
50
Average MSE
0.76 0
60
0.05 0
10
20 30 40 Time Interval τ
50
60
Figure 10: Consolidated experimental results of RSPOP-CRI on traffic flow prediction compared to POP-CRI and other neuro-fuzzy systems.
suffer from an interpretability versus accuracy dilemma, experiments are performed to evaluate both the interpretability and accuracy of using the proposed RSPOP algorithm. Two sets of experiments were performed on the pseudo outer-productbased fuzzy neural networks using the compositional rule of inference and singleton fuzzifier (POPFNN-CRI(S)) architecture using a modified variant of learning vector quantization (MLVQ) algorithm to generate gaussian
238
K. Ang and C. Quek
membership functions and the proposed RSPOP algorithm to identify fuzzy rules. In the first set of experiments, the fuzzy rules identified using RSPOP are studied on the published data sets in the work of Nakanishi et al. (1993). Three experiments were performed using group A and B for training and group C for testing from each data set, similar to the experiments conducted in Nakanishi et al. (1993). The fuzzy rules identified after each step of using RSPOP are clearly tabulated for each experiment. A semantic interpretation of the gaussian membership functions and the fuzzy rules identified are provided using the first experimental results. Close examination of the reduced fuzzy rules obtained showed that some of the attributes selected after RSPOP attribute reduction are similar to the attributes selected in Nakanishi et al. (1993). The reduced fuzzy rules showed that the RSPOP rule reduction process is capable of partial attribute selection, whereas the variable selection process in Nakanishi et al. (1993) is capable only of crisp attribute selection. In addition, the RSPOP rule reduction process operates mainly on the identified fuzzy rules, not on the raw training data. This gives RSPOP a stronger advantage on ultra-large data sets since the identified fuzzy sets used by the rules present a horizontally reduced input (Lin & Cercone, 1997). RSPOP subsequently complements this horizontal reduction with vertical reduction using attribute reduction and rule reduction. Most significant, experimental results showed that the number of fuzzy rules identified using the proposed RSPOP algorithm are bounded from above by the number of tuples of the training data. Thus, the number of rules identified using RSPOP are significantly reduced compared to POP. Furthermore, computational complexity is extensively reduced and accuracy in terms of MSE improved as the dimensionality of the data set increases. The modeling accuracy of POPFNN-CRI(S) using the proposed RSPOP algorithm is also compared to three other reasoning methods: the Mamdani’s version of Zadeh’s CRI, Turksen’s version of interval-valued CRI, and Sugeno’s version of position and gradient type of reasoning in Nakanishi et al. (1993). In addition, it is compared against other neuro-fuzzy systems: adaptive-network-based fuzzy inference systems (ANFIS; Jang, 1993), evolving fuzzy neural networks (EFuNN; Kasabov, 2001), and dynamic evolving neural-fuzzy inference system (DENFIS; Kasabov & Song, 2002). ANFIS and DENFIS yielded significantly good accuracy on all three data sets. As these two neuro-fuzzy systems are based on the TSK model (Sugeno & Kang, 1988; Takagi & Sugeno, 1985) used for precise fuzzy modeling, these models generally yield decreased interpretability but increased accuracy (Casillas et al., 2003). In contrast, POPFNN-CRI(S) is based on the Mamdani model (Mamdani & Assilian, 1975) used for linguistic fuzzy modeling, which is focused on interpretability. The objective of the comparison with the recent TSK-based models is to obtain a quality check on the accuracy performance of POPFNN-CRI(S) using RSPOP since interpretability is not directly comparable between these two models. The consolidated experimental results showed that with an increase in the dimensionality of the data, the pro-
RSPOP
239
posed RSPOP algorithm identifies a significantly reduced number of fuzzy rules, which improves interpretability and yields superior accuracy in the POPFNN-CRI(S) architecture compared to other neuro-fuzzy systems. The second set of experiments were performed to evaluate the effectiveness of fuzzy rules identified in the POPFNN-CRI(S) architecture using RSPOP on a real-world application involving traffic flow prediction. Experimental results are compared to POPFNN-CRI(S) using the same gaussian membership functions generated but fuzzy rules identified using POP, generic self-organizing fuzzy neural network (GenSoFNN) using BackPOLE for supervised learning, evolving fuzzy neural networks (EFuNN) (Kasabov, 2001), and dynamic evolving neural-fuzzy inference system (DENFIS) (Kasabov & Song, 2002). Comparing the experiment results of POPFNNCRI(S) using RSPOP against POPFNN-CRI(S) using POP, the attribute and rule reduction of the former not only reduced the number of fuzzy rules identified but also improved prediction accuracy. Comparing the results of the POPFNN-CRI(S) using RSPOP against GenSoFNN, the former again yielded better prediction accuracy with fewer rules. This shows that the fuzzy rules identified by the proposed RSPOP algorithm are effective and require no additional supervised tuning. Experimental results also showed that the POPFNN-CRI(S) using RSPOP yielded better prediction accuracy compared to EFuNN and DENFIS. However, the former used an average of 14.4 fuzzy rules, whereas DENFIS used only an average of 9.7 fuzzy rules. Although interpretability in terms of number of fuzzy rules is not directly comparable between the neuro-fuzzy systems, experimental results showed that the proposed RSPOP algorithm is able to identify effective fuzzy rules, which improves the accuracy of POPFNN-CRI(S) used for linguistic fuzzy modeling to the extent that it is comparable to DENFIS used for precise fuzzy modeling. Therefore, the experimental results once again show that significant improvement in both interpretability and accuracy is obtained for POPFNN-CRI(S) using the proposed RSPOP algorithm to identify fuzzy rules. The potential of the proposed RSPOP algorithm discussed in this article is exciting because this algorithm extends the application of POPFNN for ultra-large and redundant data without an exponential increase in rules and computational complexity. The proposed RSPOP algorithm not only improves the interpretability of POPFNN by identifying significantly fewer rules, it also improves the accuracy of POPFNN. The strong attribute and rule-reduction properties of RSPOP also rid the user of POPFNN from the tedious process of identifying the influential input attributes. This research is in line with the research direction of Centre for Computational Intelligence (formerly the Intelligent Systems Laboratory in Nanyang Technological University Singapore). The Centre for Computational Intelligence undertakes active research in intelligent neuro-fuzzy systems for the modeling of complex, nonlinear, and dynamic problem domains. Examples of neural and neuro-fuzzy systems developed are modified cerebellar artic-
240
K. Ang and C. Quek
ulation controller (MCMAC; Ang & Quek, 2000), generic self-organizing fuzzy neural network (GenSoFNN; Tung & Quek, 2002b), and pseudo outerproduct-based fuzzy neural network (POPFNN; Zhou & Quek, 1996). These have been applied to novel and interesting applications such as automated driving (Pasquier, Quek, & Toh, 2001), signature forgery detection (Quek & Zhou, 2002), gear control for automotive continuous variable transmission (Ang, Quek, & Wahab, 2002), and fingerprint verification (Quek, Tan, & Sagar, 2001). References Ang, K. K., & Quek, C. (2000). Improved MCMAC with momentum, neighborhood, and averaged. IEEE Transactions on Systems, Man and Cybernetics, Part B, 30(3), 491–500. Ang, K. K., Quek, C., & Pasquier, M. (2003). POPFNN-CRI(S): Pseudo outer product–based fuzzy neural network using the compositional rule of inference and singleton fuzzifier. IEEE Transactions on Systems, Man and Cybernetics, Part B, 33(6), 838–849. Ang, K. K., Quek, C., & Wahab, A. (2002). MCMAC-CVT: A novel on-line associative memory based CVT transmission control system. Neural Networks, 15(2), 219–236. Buckley, J. J., & Hayashi, Y. (1994). Fuzzy neural networks: A survey. Fuzzy Sets and Systems, 66(1), 1–13. Buckley, J. J., & Hayashi, Y. (1995). Neural nets for fuzzy systems. Fuzzy Sets and Systems, 71(3), 265–276. Casillas, J., Cordon, ´ O., Herrera, F., & Magdalena, L. (2003). Interpretability issues in fuzzy modeling. Berlin: Springer-Verlag. Castro, J. L. (1995). Fuzzy logic controllers are universal approximators. IEEE Transactions on Systems, Man and Cybernetics, 25(4), 629–635. Chakraborty, D., & Pal, N. R. (2004). A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification. IEEE Transactions on Neural Networks, 15(1), 110–123. Chiu, S. L. (1994). Fuzzy model identification based on cluster estimation. Journal of Intelligent and Fuzzy Systems, 2(3), 267–278. Curry, B. (2003). Rough sets: Current and future developments. Expert Systems, 20(5), 247–250. Guillaume, S. (2001). Designing fuzzy inference systems from data: An interpretability–oriented review. IEEE Transactions on Fuzzy Systems, 9(3), 426–443. Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufmann. Hayashi, I., Nomura, H., Yamasaki, H., & Wakami, N. (1992). Construction of fuzzy inference rules by NDF and NDFL. International Journal of Approximate Reasoning, 6(2), 241–266. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley.
RSPOP
241
Ishibuchi, H., Tanaka, H., & Okada, H. (1994). Interpolation of fuzzy if-then rules by neural networks. International Journal of Approximate Reasoning, 10(1), 3– 27. Jang, J.-S. R. (1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man and Cybernetics, 23(3), 665–685. Jin, Y. (2000). Fuzzy modeling of high-dimensional systems: Complexity reduction. IEEE Transactions on Fuzzy Systems, 8(2), 212–221. Johansen, T. A., & Babuˇska, R. (2003). Multiobjective identification of TakagiSugeno fuzzy models. IEEE Transactions on Fuzzy Systems, 11(6), 847–860. Kasabov, N. (2001). Evolving fuzzy neural networks for supervised/unsupervised online. IEEE Transactions on Systems, Man and Cybernetics, Part B, 31(6), 902–918. Kasabov, N. K., & Song, Q. (2002). DENFIS: Dynamic evolving neural-fuzzy inference system and its application for time-series prediction. IEEE Transactions on Fuzzy Systems, 10(2), 144–154. Kaynak, O., Jezernik, K., & Szeghegyi, A. (2002). Complexity reduction of rule based models: A survey. In Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002 IEEE International Conference (Vol. 2, pp. 1216–1221). Piscataway, NJ: IEEE. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Lin, C.-T., & Lee, C. S. G. (1991). Neural-network-based fuzzy logic control and decision system. IEEE Transactions on Computers, 40(12), 1320–1336. Lin, C.-T., & Lee, C. S. G. (1996). Neural fuzzy systems: A neuro-fuzzy synergism to intelligent systems. Upper Saddle River, NJ: Prentice Hall. Lin, T. Y., & Cercone, N. (1997). Rough sets and data mining: Analysis of imprecise data. Boston: Kluwer. Lozowski, A., Cholewo, T. J., & Zurada, J. M. (1996). Crisp rule extraction from perceptron network classifiers. In Proceedings of International Conference on Neural Networks, vol. Plenary, Panel and Special Sessions (pp. 94–99). Washington, D.C. Mamdani, E. H., & Assilian, S. (1975). An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Man-Machine Studies, 7(1), 1– 13. Mitra, S., & Hayashi, Y. (2000). Neuro-fuzzy rule generation: Survey in soft computing framework. IEEE Transactions on Neural Networks, 11(3), 748–768. Moser, B. (1999). Sugeno controllers with a bounded number of rules are nowhere dense. Fuzzy Sets and Systems, 104(2), 269–277. Nakanishi, H., Turksen, I. B., & Sugeno, M. (1993). A review and comparison of six reasoning methods. Fuzzy Sets and Systems, 57(3), 257–294. Nauck, D. D. (2003). Measuring interpretability in rule-based classification systems. In Fuzzy Systems, 2003. FUZZ ’03. 12th IEEE International Conference (Vol. 1, pp. 196–201). Piscataway, NJ: IEEE. Nauck, D., Klawonn, F., & Kruse, R. (1997). Foundations of neuro-fuzzy systems. New York: Wiley. Pal, N. R., & Pal, T. (1999). On rule pruning using fuzzy neural networks. Fuzzy Sets and Systems, 106(3), 335–347.
242
K. Ang and C. Quek
Pasquier, M., Quek, C., & Toh, M. (2001). Fuzzylot: A novel self-organising fuzzy–neural rule–based pilot system for automated vehicles. Neural Networks, 14(8), 1099–1112. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Boston: Kluwer. Quek, C., Tan, K. B., & Sagar, V. K. (2001). Pseudo-outer product based fuzzy neural network fingerprint verification system. Neural Networks, 14(3), 305– 323. Quek, C., & Zhou, R. W. (1999). POPFNN-AAR(S): A pseudo outer-product based fuzzy neural network. IEEE Transactions on Systems, Man and Cybernetics, Part B, 29(6), 859–870. Quek, C., & Zhou, R. W. (2001). The POP learning algorithms: Reducing work in identifying fuzzy rules. Neural Networks, 14(10), 1431–1445. Quek, C., & Zhou, R. W. (2002). Antiforgery: A novel pseudo-outer product based fuzzy neural network driven signature verification system. Pattern Recognition Letters, 23(14), 1795–1816. Shann, J. J., & Fu, H. C. (1995). A fuzzy neural network for rule acquiring on fuzzy control systems. Fuzzy Sets and Systems, 71(3), 345–357. Shen, Q., & Chouchoulas, A. (1999). Combining rough sets and data-driven fuzzy learning for generation of classification rules. Pattern Recognition, 32(12), 2073–2076. Shen, Q., & Chouchoulas, A. (2002). A rough-fuzzy approach for generating classification rules. Pattern Recognition, 35(11), 2425–2438. Sugeno, M., & Kang, G. T. (1988). Structure identification of fuzzy model. Fuzzy Sets and Systems, 28(1), 15–33. Sugeno, M., & Yasukawa, T. (1993). A fuzzy-logic-based approach to qualitative modeling. IEEE Transactions on Fuzzy Systems, 1(1), 7. Swiniarski, R. W., & Skowron, A. (2003). Rough set methods in feature selection and recognition. Pattern Recognition Letters, 24(6), 833–849. Takagi, T., & Sugeno, M. (1985). Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man and Cybernetics, 15(1), 116–132. Tan, G. K. (1997). Feasibility of predicting congestion states with neural network models. Unpublished final year project report, Nanyang Technological University, Singapore. Tikk, D., & Baranyi, P. (2003). Exact trade-off between approximation accuracy and interpretability: Solving the saturation problem for certain FRBSs. In J. Casillas, O. Cordon, ´ F. Herrera, & L. Magdalena (Eds.), Interpretability issues in fuzzy modeling (pp. 587–604). Berlin: Springer-Verlag. Tikk, D., Koczy, ´ L. T., & Gedeon, T. D. (2003). A survey on universal approximation and its limits in soft computing techniques. International Journal of Approximate Reasoning, 33(2), 185–202. Tung, W. L., & Quek, C. (2002a). DIC: A novel discrete incremental clustering technique for the derivation of fuzzy membership functions. In Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence (pp. 178– 187). Berlin: Springer-Verlag.
RSPOP
243
Tung, W. L., & Quek, C. (2002b). GenSoFNN: A generic self-organizing fuzzy neural network. IEEE Transactions on Neural Networks, 13(5), 1075–1086. Tung, W. L., & Quek, C. (2004). Supervised learning in neural fuzzy systems (Part 2)—BackPole: A back propagation algorithm based on objective learning errors. Manuscript submitted for publication. Yager, R. R. (1994). Modeling and formulating fuzzy knowledge bases using neural networks. Neural Networks, 7(8), 1273–1283. Yen, J., Wang, L., & Gillespie, C. W. (1998). Improving the interpretability of TSK fuzzy models by combining global learning and local learning. IEEE Transactions on Fuzzy Systems, 6(4), 530–537. Ying, H., Ding, Y., Li, S., & Shao, S. (1999). Comparison of necessary conditions for typical Takagi-Sugeno and Mamdani fuzzy systems as universal approximators. IEEE Transactions on Systems, Man and Cybernetics, Part A, 29(5), 508–514. Zadeh, L. A. (1975a). The concept of a linguistic variable and its application to approximate reasoning—I. Information Sciences, 8(3), 199–249. Zadeh, L. A. (1975b). The concept of a linguistic variable and its application to approximate reasoning—II. Information Sciences, 8(4), 301–357. Zadeh, L. A. (1975c). The concept of a linguistic variable and its application to approximate reasoning—III. Information Sciences, 9(1), 43–80. Zadeh, L. A. (1994). Fuzzy logic, neural networks, and soft computing. Communications of the ACM, 37(3). Zhou, R. W., & Quek, C. (1996). POPFNN: A pseudo outer-product based fuzzy neural network. Neural Networks, 9(9), 1569–1581. Received September 4, 2003; accepted June 2, 2004.
REVIEW
Communicated by Satinder Singh
Temporal Sequence Learning, Prediction, and Control: A Review of Different Models and Their Relation to Biological Mechanisms Florentin Worg ¨ otter ¨
[email protected] Bernd Porr
[email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland
In this review, we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spike-timing-dependent plasticity (STDP). This review introduces the most influential models and focuses on two questions: To what degree are reward-based (e.g., TD learning) and correlationbased (Hebbian) learning related? and How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We first compare the different models in an open-loop condition, where behavioral feedback does not alter the learning. Here we observe that reward-based and correlation-based learning are indeed very similar. Machine control is then used to introduce the problem of closed-loop control (e.g., actor-critic architectures). Here the problem of evaluative (rewards) versus nonevaluative (correlations) feedback from the environment will be discussed, showing that both learning approaches are fundamentally different in the closed-loop condition. In trying to answer the second question, we compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basal-ganglia, thalamus, and cortex) and the molecular biophysics of glutamatergic and dopaminergic synapses. Finally, we discuss the different algorithms used to model STDP and compare them to reward-based learning rules. Certain similarities are found in spite of the strongly different timescales. Here we focus on the biophysics of the different calciumrelease mechanisms known to be involved in STDP. 1 Introduction Flexible reaction in response to events requires foresight. This holds for humans and animals but also for robots or other agents that interact with their environment. Thus, predicting the future has been a central objective not only for soothsayers in ancient times and their more modern equivalents— the stock market analysts—but for every agent that needs to survive in a Neural Computation 17, 245–319 (2005)
c 2005 Massachusetts Institute of Technology
246
F. Worg ¨ otter ¨ and B. Porr
sometimes hostile environment. Accurate, and thus useful, predictions can be made only if they are based on prior knowledge, which must be obtained by analyzing the relations between past events. Many events encountered by an agent are linked in a temporally causal way through the laws of physics and the causal structure of the world. Examples from the living and the technical world, some more sensor related and others more related to motor actions, make this clear: 1. Smell precedes taste when foraging, and the sound of moving prey may precede its smell. 2. The sequence “warm,” “hot,” “hotter” should be followed by “very hot” or by “pain.” 3. Leaning sideways will lead to a reflex-like antagonistic motor action in order to prevent falling. 4. The motor action of feeding will lead to a taste sensation. 5. The bell precedes the food in Pavlov’s classical conditioning experiments. 6. The position of a robot arm is preceded by the force patterns that converge onto it. In these examples we have deliberately mixed sensor-sensor events (e.g., items 1, 2) with motor-motor (item 3) and sensor-motor events (item 4). We will see that theoretical treatment needs to take care of these distinctions. At the same time, we have also introduced unimodal (temperature over time in item 2) and multimodal events (all others), which will also require different treatment. The list of such causally related events or actions, which will occur during the lifetime of an agent, is endless. All of these events have in common that they are temporally (auto- or cross-) correlated with each other such that they form a temporal sequence. Agents that are able to act in response to earlier events in such a sequence chain have, without doubt, an evolutionary advantage. As a consequence, learning to correctly interpret temporal sequences and deduce appropriate actions is a major incentive for any agent fostering its survival. Actually, the problem with which we are faced is twofold: learn to predict an event, and learn to perform an anticipatory action. This has been called the prediction- and the control-problem (Sutton, 1999). 1.1 Structure of This Review. The goal of this review is threefold. (1) We use machine control to introduce the basic reinforcement learning terminology. (2) We then compare the different neuronal models for reward-based TD learning, which are used in so-called actor-critic architectures of animal control, with each other and with the anatomy and physiology of the basal gan-
Temporal Sequence, Prediction, and Control
247
Figure 1: Cross links between the different fields (see the text). Small equals signs denote that these algorithms produce identical results after convergence.
glia. Here will strongly focus on the biophysics of dopaminergic synapses and try to provide a summary of the most essential synaptic mechanisms that have been identified so far. (3) Finally, we compare reward-based learning mechanisms (e.g., TD learning) with correlation-based Hebbian learning, trying to relate them to the basic mechanisms of long-term potentiation (LTP) and long-term depression (LTD) as found in spike-timing-dependent plasticity (STDP). Figure 1 shows the organization of this review. This figure depicts the most important links between the different subfields and algorithms; the most influential ones are shown in red. The article proceeds from left (machine learning) to right (synaptic plasticity). The flow in the sections (direction of the arrows) is always from top to bottom, except for all “green” components, which are concerned with biophysical aspects. Here we proceed upward. For specific reviews on the three different fields we refer readers to Reinforcement learning (Kaelbling, Littman, & Moore, 1996; Sutton & Barto, 1998), animal conditioning (Sutton & Barto, 1990; Balkenius & Moren, 1998), the dopamine reward system of the brain (Schultz, 1998, 2002; Schultz & Dickinson, 2000), and data and models of synaptic plasticity (Martinez & Derrick, 1996; Malenka & Nicoll, 1999; Bennett, 2000; Bi & Poo, 2001; van Hemmen, 2001; Bi, 2002).
248
F. Worg ¨ otter ¨ and B. Porr
2 The Basic Concepts of Reinforcement Learning It is fair to say that currently, almost all methods for predictive control in robots rely on reinforcement learning (RL), and strong indications exist in the literature that this type of learning plays a central role in animals too. A major thread that is followed throughout this article concerns the distinction between reward-based versus correlation-based learning methods, or between evaluative versus nonevaluative feedback (see Figures 7 and 9). RL, at least in its original formulation, is strictly reward based. Hence, an early conclusion would be that animals rely heavily on reward-based learning. Synaptic plasticity, on the other hand, is correlation based (Hebbian). How can this apparent conflict be resolved? How can reward-based mechanisms be reformulated such that they can be captured by correlation-based learning rules? And what are the problems when trying to do this? These are the questions we address in this review. The central message of this article is that reward-based and correlationbased methods are very similar, if not identical, in the open-loop condition, where the actions of the learner will not influence the learning, while they are clearly different in the closed-loop condition, hence during the normally existing behavioral feedback. The algorithms for reinforcement learning are treated to a great extent in the literature (Sutton & Barto, 1998) and shall not be described here, but we must describe the basic assumptions of RL to be able to compare it to animal learning. To this end we will only discuss systems with a finite number of discrete states, commonly known as finite Markov decision problems (MDP).1 We assume that an RL agent is able to visit these states and that these states will convey information about the decision problem. RL further assumes that in visiting a state, a numerical reward will be collected, where negative numbers may represent punishments. Each state has a changeable value attached to it. From every state, there are subsequent states that can be reached by means of actions. The value of a given state is basically defined by the averaged future reward, which can be accumulated by starting actions from this particular state. Here, we could look at all or just at some of the follow-up actions that are possible in starting from the given state. The different algorithms for RL take different approaches
1 Thus, RL also assumes that such systems “follow the Markov property.” Essentially this means that it is unimportant along which path a certain state has been reached. Once there, the state itself contains all relevant information for future calculations. Many times the Markow property cannot be guaranteed in real-world decision problems, which poses a practical problem when wanting to employ RL methods. Also we note that conventional RL needs to be augmented by additional mechanisms if one wants to employ it to more complex (e.g., time and space continuous) systems. These aspects shall not be discussed here, but see Reynolds (2002), and Santos and Touzet (1999a, 1999b).
Temporal Sequence, Prediction, and Control
249
toward reward averaging, which shall not be discussed here. Actions will follow a policy, which can also change. The goal of RL is to maximize the expected cumulative reward (the “return”) by subsequently visiting a subset of states in the MDP. This, for example, can be done by assuming a given (unchangable) policy. Often, however, a subgoal of RL is to try to achieve this in an optimal way by finding the best action policy to travel through state-space. In this context, RL methods are mainly employed to address two related problems: the prediction and the control problem. 1. Prediction only. RL is used to learn the value function for the policy followed. At the end of learning, this value function describes for every visited state how much future reward we can expect when performing actions starting at this state. 2. Control. By means of RL, we wish to find that particular set of policies that maximizes the reward when traveling through state-space. This way, we have at the end obtained an optimal policy that allows for action planning and optimal control. Several algorithms have been designed to calculate or approximate value functions or to find optimal action policies, most notably, dynamic programming (Bellman, 1957), Monte Carlo prediction and control, TD learning (Sutton, 1988), SARSA (Rummery, 1995; Sutton, 1996), and Q learning (Watkins, 1989; Watkins & Dayan, 1992). They are reviewed in Sutton and Barto (1998). Most of the above-named algorithms for RL rely on the method of temporal differences (TD methods). The formalism of this method is described in the appendix and its neuronal version later in the main text. Using the TD method, the rewards, which determine learning, enter the formalism in an additive way (see equation A.3). Thus, TD methods are in the first instance strictly noncorrelative as opposed to Hebb rules, where a multiplicative correlation of pre- and postsynaptic activity drives the learning. Thus, at first it seems hard to introduce a correlative relation in RL to make it compatible with synaptic plasticity, hence with Hebbian learning. Two observations can be made to mitigate this situation. First, we note that the backward TD(λ) method (see the appendix) contains so-called eligibility traces x, which enter the algorithm in a multiplicative way (see equation A.11) and can thus be used to define a correlation-based process. The concept of such traces shall be explained in a more neuronal context in greater detail in section 3. Here it suffices to say that this way, TD learning becomes formally related to Hebbian learning. There is, however, a more interesting, but also more problematic, way to introduce a correlation-based process that is relevant for biological or biomimentic agents: one could try to define the rewards (or punishments) by means of signals derived from sensor inputs. Naively, pain is a punishment, while pleasure is a reward. Thus, rewards can be correlated to sensor events.
250
F. Worg ¨ otter ¨ and B. Porr
This seems to be a simple and obvious way to introduce a correlation-based process mainly used in so-called actor-critic models of machine or animal control, which are introduced later (in section 4.1). Actor-critic architectures are closed-loop control structures, where the actions of the agent (the animal) will influence its own (sensor) inputs. Thus, before we can discuss those, we must first describe the different algorithms in their open-loop versions. 3 Neuronal Architectures for Prediction and Control—Open Loop Condition The goal of the next sections is to transfer these concepts of RL, for example, TD learning, to neural networks related to different brain structures. At the same time, we will discuss other algorithms that are less directly related to the RL formalism. 3.1 Relating RL to Classical Conditioning. Classical conditioning represents in its simplest form a learning paradigm where the pairing of two subsequent stimuli is learned such that the presentation of the first stimulus is taken as a predictor of the second one. In the descriptions above, we have discussed the prediction problem but have treated only the (action-driven) transition between states. Only a vague indication has been given so far about how to use temporally correlated signals for learning. Thus, the goal of this section is to show how to augment these aspects such that we can use correlations for learning. This way, we will see that reinforcement learning and (differential) Hebbian learning are related to each other. 3.1.1 Early Neural Differential Hebbian Architectures. In the traditional example of Pavlov’s dog (Pavlov, 1927), we have the unconditioned stimulus (US), food, which is preceded by the conditioned stimulus (CS), bell. The CS predicts the US, and after learning, the dog starts salivating in response to the CS. This represents an open-loop paradigm: the action of salivating will not influence the presentation of the stimuli. This is different from instrumental conditioning, which represents a closed-loop paradigm discussed later in the context of actor-critic architectures. A model of classical conditioning should produce an output that, before learning, responds to the US only; after learning, the output should occur earlier in response to the CS. Furthermore, we note that the US and CS are temporally correlated. To capture this, we need a correlative property in reinforcement learning that in its machine-learning formalism (see the appendix) is not immediately visible, because machine learning relies entirely on the transition between states by means of actions, and correlations do not play any role.
Temporal Sequence, Prediction, and Control
X1
251
CS
x
e-trace
X0
Dw1 + w1 w0 = 1
v(t)
US (reward)
Figure 2: Simple correlative temporal learning mechanism applying an eligibility trace (e-trace) at input x1 (CS) to ensure signal overlap at the moment when x0 (US) occurs; ⊗ denotes the correlation.
This, however, can be achieved by using eligibility traces.2 To this end, we need to restructure the formalism in a neuronal way, where stimuli converge via synapses at a neuron. These architectures are in the context of classical conditioning generally called stimulus substitution architectures, because the first, earlier stimulus after learning substitutes in effect the second, later stimulus. The goal is to generate an output signal at this neuron that will, at the end of learning, directly respond to the CS. This way, the output can be used as a predictor signal for other processes, such as for control. This should be achieved by strengthening the synapse that transmits the CS during learning. Figure 2 shows such a structure at the beginning of learning. Note that this model level is still quite far removed from the biophysical models of synaptic plasticity, which we discuss later. The US converges with a strong, driving synapse ω0 at the neuron. To keep things consistent with the descriptions above, one could call the US “reward.” The CS precedes the US and converges with an initially weak synapse ω1 . However, the CS elicits a decaying trace (e-trace) at its own synapse. This, biophysically unspecified, trace is meant to capture the idea that the CS synapse remains eligible for modification for some time after the CS has ended. We can now correlate the output V, which at the beginning of learning mirrors the US, with this eligibility trace and use the result of this correlation, depicted by ω1 , to change the weight of the CS-synapse. This idea was first formalized by Sutton and
2 Historically eligibility traces were first developed in the context of classical conditioning models (Hull, 1939, 1943; Klopf, 1972, 1982) and only later were introduced in the context of machine learning (Sutton, 1988; Singh & Sutton, 1996). Our review emphasizes the structural and functional similarities of the different algorithms; therefore, we do not follow the historical route.
252
F. Worg ¨ otter ¨ and B. Porr
Barto (1981) in their classical modeling study:3 ω1 (t + 1) = ω1 (t) + γ [v(t) − v(t)]x(t),
(3.1)
where they introduced two eligibility traces, x(t) at the input and v(t) at the output, given by x(t + 1) = αx(t) + x(t)
(3.2)
v(t + 1) = βv(t) + (1 − β) v(t),
(3.3)
with control parameters α and β. Mainly they discuss the case of β = 0, where v(t) = v(t − 1), which turns their rule into ω1 (t + 1) = ω1 (t) + γ [v(t) − v(t − 1)]x(t).
(3.4)
Before learning, this neuron will respond only to the US, while after learning, it will respond to the CS as well. In a later review, Sutton and Barto (1990) discuss that their older model is faced with several problems. For example, for short or negative interstimulus intervals (ISIs) between CS and US, the model predicts strong inhibitory conditioning (see Figure 8 in Sutton & Barto, 1990). Some reports exist that show weak inhibitory conditioning, but mostly weak excitatory conditioning seems to be found in these cases as well (Prokasy, Hall, & Fawcett, 1962; Mackintosh, 1974, 1983; Gormezano, Kehoe, & Marshall, 1983).4 3.1.2 Isotropic Sequence Order Learning. Some of the problems of the Sutton and Barto (1981) model come from the fact that input lines are not treated identically, which leads to an asymmetrical behavior of this model. In a more recent approach, we have achieved this by employing a specific differential Hebbian learning rule onto all synaptic weights (Porr & Worg ¨ otter, ¨ 2002, 2003a, 2003b). The main distinguishing features of isotropic sequence order learning (ISO learning) are that (1) all input lines are treated equal, so there is no a priori built-in distinction between CS and US and all lines learn; (2) eligibility traces are created by bandpass filtering the inputs; (3) learning is purely correlation based and synapses can grow or shrink depending on 3 Here we briefly mention that we deliberately have not discussed some older work, like the Rescorla-Wagner rule (Rescorla & Wagner, 1972, Fig. 1) or the δ rule of Widrow and Hoff (1960), which is at the root node of all these algorithms. Indeed, analytical proofs exist that the results obtained with the δ rule are identical to those obtained with the RescorlaWagner rule (Sutton & Barto, 1981) and the TD(1) algorithm (Sutton, 1988), denoted by the small equals signs in Figure 1. 4 More problems are discussed in Sutton and Barto (1990), for example, with respect to the so-called delay-conditioning paradigm, which can be solved by some specific modifications of the Sutton and Barto (1981) model. We refer readers to Sutton and Barto (1990) for these specific issues.
Temporal Sequence, Prediction, and Control
253
Figure 3: Isotropic sequence order learning. (A) Structure of the algorithm for N + 1 inputs. For notations, see the text. A central property of ISO learning is that all weights can change. (B) Weight change curve calculated analytically for two inputs with identical resonator characteristics (h). The optimal temporal difference for learning is denoted as Topt . (C) Linear development of ω1 for two active inputs (x0 , x1 ). At time-step 40,000, input x0 is switched off, and as a consequence of the orthogonality property of ISO learning, ω1 stops to grow. (D) Development of 10 weights ω1i , i = 0, . . . , 9 in a robot experiment (Porr & Worg ¨ otter, ¨ 2003a). All weights are driven by input x1 but are connected to different resonators hi1 , which create a serial compound representation of x1 (see Figure 10A). The robot’s task was obstacle avoidance. At around t = 150 s, it has successfully mastered it, and the input x0 , which corresponds to a touch sensor, is not triggered again. As a consequence, we observe that the weights ω1i stop to change (compare to C).
the temporal sequence of their inputs; and (4) inputs can take any form of being analog or pulse coded. As a consequence, this approach is linear. Figure 3 A shows the structure of the algorithm The system consists of ¯ N + 1 linear filters h receiving inputs x and producing outputs x: x¯ i = xi ∗ hi ,
(3.5)
254
F. Worg ¨ otter ¨ and B. Porr
where the asterisk denotes a convolution. The transfer functions h shall be those of bandpass filters. They are specified by h(t) =
1 at e sin(bt), b
(3.6)
with: a := Re(p) = −π f/Q, b := Im(p) =
(2π f )2 − a2 .
(3.7)
f is the frequency of the oscillation and Q the damping characteristic. To get an idea what these filters do, we consider pulse inputs. In this case, the filtered signals will consist of damped oscillations (Grossberg & Schmajuk, 1989; Grossberg, 1995; Grossberg & Merrill, 1996), which span across some temporal interval until they fade. Thus, bandpass filtering essentially amounts to applying an eligibility trace to all inputs. The filtered signals connect with corresponding weights ω to one output unit v. The output v(t) is given as v(t) =
N
ωi x¯ i .
(3.8)
i=0
Learning takes place according to a differential Hebbian learning rule, d ωi = µx¯ i v dt
µ 1,
(3.9)
where v is the temporal derivative of v. Note that µ is very small. (See Porr & Worg ¨ otter, ¨ 2003a, for a complete description of the ISO learning algorithm and its properties.) First, we note that the system is linear, and weight changes can be calculated analytically (see Figure 3B), as shown in Porr and Worg ¨ otter ¨ (2003a). If we consider just one input, then we find that it is after filtering orthogonal to the derivative of its output. Intuitively this can be understood when looking at two pulse inputs x0 , x1 . In this case, both trace signals x¯ 0 , x¯ 1 are damped sine waves, and the derivative of the output is thus a sum of damped cosine waves. Thus, if one input becomes zero, the derivative of the output will be orthogonal to the remaining input. Mathematically, it can be shown that this holds for more than two inputs too (Porr & Worg ¨ otter, ¨ 2003a). Thus, inputs will not influence their own synapses, and learning is strictly hetero-synaptic. As a consequence, a very nice feature emerges for pairs of synapses: weight change will stop as soon as one input becomes silent (see Figure 3C). This leads to an automatic self-stabilizing property for the network in control applications (Figure 3D; see also section 4.3.1).
Temporal Sequence, Prediction, and Control
255
3.1.3 The Basic Neural TD Formalism and Its Implementation. Most influential in the field of neuronal temporal sequence learning algorithms currently is the TD formalism. To define it in a neuronal way, we replace the “states” from traditional RL with “time-steps” and assume that rewards can be retrieved at each such time-step. Furthermore, we assume that distant rewards will count less than immediate rewards, introducing a discounting factor γ . Then we define the total reward (called the “return”) as the discounted sum of all expected future rewards (similar to the formalism in the appendix): R(t) = r(t + 1) + γ r(t + 2) + γ 2 r(t + 3) + . . . ,
(3.10)
and this can be rewritten as R(t) = r(t + 1) + γ R(t + 1).
(3.11)
Now we define a neuron v such that it will function as a prediction neuron. To this end, we assume that the output v is able to predict the reward. As a consequence, we would hope that R(t + 1) ≈ v(t + 1) and that R(t) ≈ v(t), replacing (compare equation A.5), v(t) ≈ r(t + 1) + γ v(t + 1).
(3.12)
Since this is only approximately true (until convergence), we can define an error (compare equation A.9) with δ(t) = r(t + 1) + γ v(t + 1) − v(t),
(3.13)
realizing that this error function contains a derivative-like term, v(t+1)−v(t), which is shifted one step into the future. This results from the fact that the update of the value of a state requires visiting at least the next following state. In an artificial neural network such as that depicted in Figure 4, this does not produce problems, but in a rigorous neuronal implementation, one should be careful to avoid this acausality, for example, applying the appropriate delays or a more realistic eligibility trace (see below). Now we can update the synaptic weights with ωi ← ωi + α δ(t) xi (t),
(3.14)
where xi (t) is the eligibility trace associated with stimulus xi . In an older review, Sutton and Barto (1990) discuss to what degree this specific TD model can explain the experimental observations in classical conditioning, and they show that it surpasses their older approach in many aspects. This shall not be discussed here, though, because we wish to concentrate next on the neuronal aspects of TD learning—its network implementations, its possible neuronal counterparts in the brain, and its synaptic biophysics.
256
F. Worg ¨ otter ¨ and B. Porr
A #1
Start: w0 = 0 w1 = 0
reward, US
Predictive Signals
Start: w0 = 1 w1 = 0
#2
X1 x
X0 x
v
~ v’
End: w0 = 1 w1 = 0
~ d=v’+ r reward
x
Xn X1
End: w0 = 1 w1 = 1
d x
#3
v
~ v’ (n-i)t
~ v’ v(t)
X0
~ d=v’+ r
B
C r d
x
w0 = 1
d
x
t t
~ v’
X1 X0 reward
r
v(t-t)
v(t)
X1
v(t)
X0 reward
w0 = 1
Figure 4: Basic neural implementation of the TD rule. (A) The first three learning steps (#1,#2,#3) performed with the circuit drawn in the shaded inset, which represents what probably is the most basic version of a neuronal TD implementation using a serial compound stimulus representation. The symbol v˜ represents a forward-shifted difference operation: v˜ (t) = v(t + 1) − v(t). Other symbols are explained in the text. (B,C) Same circuit but using x0 as reward signal. (C) The acausal forward shift of the difference operation is avoided by introducing unit delays τ into the learning circuits. This allows calculating the derivative by means of an inhibitory interneuron (black).
Temporal Sequence, Prediction, and Control
257
The circuit diagram in Figure 4 shows a simple implementation of the TD rule in a neuronal model. It is constructed to most closely relate to the basic neuronal TD formalism introduced above, keeping the number of necessary parameters minimal. The first neuronal models with an architecture similar to Figure 4 were devised by Montague, Dayan, Person, and Sejnowski (1995). The goal of our implementation is to arrive at an output v, which reacts after learning to the onset of the CS denoted as xn and maintains its activity until the reward terminates. Such an ideally rectangular signal is reminiscent of the response of so-called reward-expectation neurons, which shall be shown in section 3.3. To achieve this goal, we represent the CS internally by a chain of n + 1 delayed pulses xi , with a unit delay τ , which is equal to the pulse width. We construct this chain such that the signal x0 coincides with the reward. In some sense, the pulse chain xi represents a special (distributed) kind of eligibility trace; it is often called a serial compound representation (Sutton & Barto, 1990) first used in a neural network architecture by Montague, Dayan, & Sejnowski (1996). The pulse protocols 1, 2, and 3 in Figure 4 show the first three trials for this algorithm starting with all weights at zero. Thus, in the first trial (#1), the output v is zero, and a positive prediction error δ occurs together with the reward. The weight change is calculated using equation 3.14 with ωi = δ xi . Thus, in the first trial, 1, only the multiplication of x0 with δ will yield a result different from zero, and only this weight changes. In the next trial, v reacts together with x0 ; the derivative is shown below already time-shifted by one step forward as demanded above. As a consequence, δ also moves forward (Montague et al., 1996), and now the correlation between x1 and δ yields 1. This process continues in trial 3, where we show only a few traces, and so on until the last weight ωn grows. As a result, v becomes a rectangle pulse starting at xn and ending at the end of the reward pulse. Essentially we have implemented a backward TD(1) algorithm (see the appendix) this way. It is obvious that the special treatment of the reward as an extra line leading into the learning subcircuitry is not really necessary. Figure 4B shows that the input signal x0 can replace the reward signal when setting ω0 = 1 from the beginning. This shows that reward-based architectures are in some cases equivalent to stimulus-substitution architectures, and we are again approaching the original architecture of the old Sutton and Barto (1981) model. Note that the circuit diagram shown in Figures 4A and 4B cannot be implemented with analog (neuronal) hardware because it contains the forward shift of the derivative term denoted by v˜ . Figure 4C shows how to modify the circuit in order to ensure a causal structure. This requires adding the unit delay τ in the learning subcircuits to all input lines (and the reward, if an extra reward line exists). As a result, the δ signal will appear with the same delay. This modification, however, allows us also to introduce a small circuit for calculating the difference term by means of an inhibitory interneuron. The architecture in Figure 4C is presumably the simplest way to implement
258
F. Worg ¨ otter ¨ and B. Porr
A Sutton & Barto, 1981
B Sutton and Barto,1988 TD-learning
r
X
E x1
w1
S
S
X
v’
E x1
v
w1 S
x0
v’ v
C ISO-Learning X
x1
h1
x0
h0
wi
S
v’ v
X Figure 5: (A,B) Comparison between the two basic models of Sutton and Barto and ISO learning (C). Inputs = x, eligibility trace = E, reward = r, resonators = h. The symbol v denotes the difference operation between subsequent output values in A, B and the differentiation in C. The small amplifier symbol represents a changeable synaptic weight ω.
a TD rule neuronally without violating causality. 3.2 Comparing Correlation Based- and TD-Methods. Let us compare the formalism of the old Sutton and Barto (1981) model, ω1 (t + 1) = ω1 (t) + γ [v(t) − v(t − 1)]x(t),
(3.15)
with the neuronal TD procedure given by ωi ← ωi + α [r(t + 1) + γ v(t + 1) − v(t)] xi (t).
(3.16)
Learning in Sutton and Barto (1981) is correlative, as in Hebbian learning, but it depends on the difference of the output v ≈ v(t) − v(t − 1) (see Figure 5A, differential Hebbian learning).5 Furthermore, since v(t) = i ω i xi , i ≥ 0 5 Other early differential Hebbian models have been devised by Kosco (1986) and Klopf (1986), who employ a derivative at the CS input to account for its possible inner transient representation.
Temporal Sequence, Prediction, and Control
259
in equation 3.15, we notice that the “reward” x0 does indeed occur in the stimulus substitution model of Sutton and Barto, but not as an independent term as in the reward-based TD-procedure. Reward-based architectures, like TD learning, allow treating the reward as an independent entity, which arises in addition to the sensor signals x (see Figure 5B). Thus, a reward signal could also be an intrinsically generated signal in the brain, which is not necessarily, or only indirectly, related to sensor inputs, as discussed above. In view of neuronal response recorded in the basal ganglia, this concept has strong appeal. This was one of the reasons that the TD formalism has been adopted to classical conditioning. How does ISO learning (see Figure 5C) compare to these methods? ISO learning uses different eligibility traces than Sutton and Barto (1981) and Sutton (1988) and employs them at all input pathways. Hence, they will influence learning via the output v (more specifically via v ). In the models of Sutton and Barto (1981) and Sutton (1988), an eligibility trace is applied only at input x1 and influences only the learning circuit but not the output v. Thus, for more than two inputs, we find for ISO learning that v = ωi x¯ i , while for TD learning, we have v = ωi xi . If we assume, as before, that x0 is identical to the reward signal and by handling the notation concerning the derivatives in a somewhat sloppy way, we can rewrite the weight change in TD learning (see equation 3.16) as dωk ωi xi x¯ k (3.17) = (x0 + v )x¯ k = x0 + dt i>0 and that of ISO learning as dωk ωi x¯ i x¯ k . = v x¯ k = dt i=0
(3.18)
As opposed to ISO learning, the derivative in TD learning is not applied to the reward, and the output is an unfiltered (no traces) sum of the weighted inputs. These differences prevent TD learning from being orthogonal between inputs and outputs. 3.2.1 Conditions of Convergence. Furthermore, we note that conditions of convergence are different in ISO as compared to TD learning. Trivially, weight growth of ω1 stops in both algorithms when x1 = 0. Otherwise, weight growth stops in TD learning when r(t + 1) + γ v(t + 1) − v(t) = 0, a condition that requires the output to take a certain value (output condition). The orthogonality between input and output in ISO learning, on the other hand, leads to the situation that dω1 /dt = x¯ 1 v = 0 if x0 = 0, which is an input condition. It is interesting to discuss how this would affect animal learning in closed-loop situations with behavioral feedback. Animals cannot measure their “output.” All they can do is sense the consequences of an
260
F. Worg ¨ otter ¨ and B. Porr
action, hence sense the consequences of some produced output (behavior) after it has been transmitted back to the animal via the environment (see Figure 7). Thus, immediate output control can be performed only by an external observer, which might, however, misjudge the validity of an action, leading to bad convergence properties of such an algorithm. Alternatively, augmented output control could be performed by the learner after environmental filtering via its input sensors. This, however, might also go wrong if the filtering is not benign. Thus, there is a clear difference between input and output control, as discussed below (see section 4). 3.3 Neuronal Reward Systems. Before we try to embed the different algorithms into “behaving systems,” hence into closed-loop architectures, we would like to open a bracket and show why the reward paradigm appears to be so strong in animal learning. So far, we are still at the level of artificial neural network implementations of the TD rule and we mentioned only in passing that a neurophysiological motivation for this exists: the dopaminergic reward system. The computational focus of this review does not allow for an in-depth discussion of this topic, but some of the most important facts should be described before discussing the corresponding modeling approaches in section 4.2. Different response types are found in the mammalian brain that seem to be related to reward processing. We associate a few of them rather directly with the nomenclature of TD learning, noting that this is an oversimplification which is necessary to fit them into the perspective of a theoretician. (For more detailed discussions, see Schultz, 1998, 2002, and Schultz & Dickenson, 2000.) Most important, one finds: • Prediction-error neurons. These are dopamine neurons (DA neurons) in the pars compacta of the substantia nigra and the medially adjoining ventral tegmental area. These neurons seem to essentially capture the properties of the δ signal in TD learning (Miller, Sanghera, & German, 1981; Schultz, 1986; Mirenowicz & Schultz, 1994; Hollerman & Schultz, 1998, for reviews, see Schultz, 1998, 2002; Schultz & Dickenson, 2000; but see Redgrave, Prescott, & Gurney, 1999). In addition, they respond to novel, salient stimuli that have attentional and possibly rewarding properties. Before learning, these neurons respond to an unpredicted reward (see Figure 6A). During learning, this response diminishes (see Figure 6D; Ljungberg, Apicella, & Schultz, 1992; Hollerman & Schultz, 1998) and the neurons will now increase their firing in response to the reward-predicting stimulus (see Figure 6B). The more predictable the reward becomes, the less strongly the neuron fires when it appears. The δ signal in TD learning essentially corresponds to δ = reward occurred − reward predicted. Thus, these neurons respond with an inhibitory transient as soon as a predicted reward fails to be delivered (“Omission,” Figure 6C). Furthermore, these neurons
Temporal Sequence, Prediction, and Control Novelty Response: no prediction, reward occurs
A
After learning: predicted reward occurs
r
no CS
E
Continuous decrease of novelty response during learning
B
After learning: predicted reward does not occur
r
CS
D
261
CS
1.0 s
Response Transfer (Population Response)
L
G
K
0.5 s
J In
Reward Expectation
Tr
r
Reward Expectation (Population Response)
Tr Tr
r
F
H
I r
C
L
r
1.0 s
1.5 s
Figure 6: Neuronal response in the basal ganglia concerned with reward processing. (A) Response of a dopamine neuron in monkey to the presentation of a reward (r, drop of liquid) without preceding CS. (B) Response of the same neuron after learning that the CS will predict the reward. The neuron now responds to the CS but not to the reward. (C) If the expected reward fails to be delivered, the neuron will be inhibited (A–C, recompiled from Schultz et al. (1997). (D) Reduction of novelty response during learning. From top to bottom, the number of trials increases in chunks of five equivalent to a growing learning experience (D recompiled from Hollerman & Schultz, 1998). (E–J) Response transfer to the earliest reward-predicting stimulus (recompiled from Schultz, 1998). (E) Control situation, presentation of a visual stimulus (light, L) will not lead to a response. (F) Novelty response, situation similar to A. (G) Response to a reward predicting Trigger (Tr) stimulus (similar to B, left part of diagram) and (H) failure to respond to a correctly predicted reward (similar to B, right part of diagram). (I) Response to a newly learned Instruction (In) stimulus, which precedes the trigger. (J) In the same way, no response to the correctly predicted reward. (K) Response of a putamen neuron, which gradually increases and maintains its firing after a trigger until the reward is delivered ((K) recompiled from Hollerman et al., 1998). (L) Population diagram of 68 striatal neurons that show an expectation of reward type response ((L) recompiled from Suri & Schultz, 2001).
262
F. Worg ¨ otter ¨ and B. Porr
show response transfer properties: when an earlier reward-predicting stimulus is introduced, the response of the neuron during learning will move forward in time and begin to coincide with the new, earlier stimulus (see Figures 6E–J). • Reward-expectation neurons. These are in the striatum, orbitofrontal cortex, and amygdala (Hollerman, Tremblay, & Schultz, 1998; Tremblay, Hollerman, & Schultz, 1998; Tremblay & Schultz, 1999).6 They respond with a prolonged maintained discharge following a trigger stimulus until the reward is delivered (see Figures 6K and 6L). Thereby, they are similar to the “neuron” in Figure 4. Delayed delivery prolongs the response, while earlier delivery shortens it. Early, during learning, those neurons will always start to fire following the trigger stimulus, apparently naively expecting a reward in every trial. Only with experience do responses become specific to the actual reward-predicting trials. In addition, it has been suggested that these neurons fire mainly as a consequence of the motivational value of the reward and less strongly following its actual physical properties. Hence, in trials where rewards are compared, they seem to estimate the relative value of the different rewards with respect to each other instead of reacting to all of them in an absolute manner. • Goal-directed neurons. These are in the striatum, the supplementary motor area, and the dorsolateral premotor cortex (Kurata & Wise, 1988; Schultz & Dickinson, 2000). These neurons show an enhanced response prior to an internally planned motor action toward external rewards but in the absence of a triggering stimulus. Some neurons continue to fire until the reward is retrieved; others stop as soon as or just before the motor action is initiated. Using the terminology from above, one can interpret the first two types of neurons as being concerned with the prediction problem, while the third type seems to be involved in addressing the control problem. These single cell data have more recently been augmented by a substantial number of fMRI studies of the human brain, that also support the idea that the ventral striatum and other parts of the brain (e.g., amygdala, nucleus accumbens, orbitofrontal cortex) could be involved in the processing of reward- or expectation-related activity (Schoenbaum, Chiba, & Gallagher, 1998; Nobre, Coull, Frith, & Mesulam, 1999; Delgado, Nystrom, Fissell, Noll, & Fiez, 2000; Elliott, Friston, & Dolan, 2000; Berns, McClure, Pagoni, & Montague, 2001; Breiter, Aharon, Kahneman, Dale, & Shizgal, 2001; Knutson, Fong, Adams, Varner, & Hommer, 2001; O’Doherty, Rolls, Francis, Bowtell, & McGlone, 2001; O’Doherty, Deichman, Critchley, & Dolan, 2002; Pagoni, Zink, Mon6 These neurons are named according to Schultz and Dickinson (2000). Sometimes they are also called “reward-prediction neurons” (Suri & Schultz, 2001).
Temporal Sequence, Prediction, and Control
263
tague, & Berns, 2002) possibly related to TD-models (O’Doherty, Dayan, Friston, Critchley, & Dolan, 2003). However, the functional connectivity and the action of dopamine are by far more complex than suggested by the above paragraphs. Dopamine neurons ramify very broadly essentially all over the striatum. As a consequence, every striatal neuron (Lynd-Balta & Haber, 1994; Groves et al., 1995) and large numbers of neurons in superficial and deep layers of the frontal cortex (Berger, Trottier, Verney, Gaspar, & Alvarez, 1988; Williams & Goldman-Rakic, 1993) are contacted by each DA neuron. In addition, the responses of dopaminergic neurons vary to a large degree, and only a small number resemble those discussed here. Thus, the above interpretations of how the different neuron types are concerned with prediction and control may be too coarse, and the DA system may also to a large degree carry gating, motivational, novelty, saliency (Zink, Pagnoni, Martin, Dhamala, & Berns, 2003) or other context-dependent signals too. On the other hand, it has been found that dopamine-deficient mice mutants can easily perform reward learning (Cannon & Palmiter, 2003). Evidence that N-methyl-D-aspartate (NMDA) receptors and not dopamine (D2) receptors seem to be involved in specific reward processing (Hauber, Bohn, & Giertler, 2000) points in the same direction. (See also the discussion about the synaptic biophysics of the dopaminergic system section 5.) The fMRI studies also cannot help to resolve these issues, because they have shown that a variety of areas can be involved in reward processing, and some of them may be strongly modulated by attentional effects as well influencing learning (Dayan, Kakade, & Montague, 2000). In addition, the restricted spatiotemporal resolution of fMRI makes it of little help in actually designing neuronal models. This complexity cannot be exhaustively discussed in this article, and we refer readers to the literature (Schultz, 1998; Berridge & Robinson, 1998; Schultz & Dickinson, 2000; Dayan & Balleine, 2002). 4 Closed-Loop Architectures 4.1 Evaluative Feedback by Means of Actor-Critic Architectures. The action of animals or robots will normally always influence their sensor inputs. Thus, to be able to address the control problem, the algorithms discussed above will have to be embedded in closed-loop structures. To this end, so-called actor-critic models, which are strongly related to control theory, have been widely employed (Witten, 1977; Barto, Sutton, & Anderson, 1983; Sutton, 1984; Barto, 1995). Figure 7A shows a conventional feedback control system. A controller provides control signals to a controlled system, which is influenced by disturbances. Feedback allows the controller to adjust its signals. In addition, a set point is defined. In the equilibrium (without disturbance), the feedback signal X0 will take the negative value of the set point, which represents the “desired state” of the complete system. In the simplest case (set point = 0), this is zero too. The set point can
264
F. Worg ¨ otter ¨ and B. Porr
A
Disturbances
Set-Point Controller
X0
Control Signals
Controlled System
Feedback
B Context Critic Reinforcement Signal
Disturbances
Actor
Actions
Environment
(Controller)
(Control Signals)
(Controlled System)
X0 Feedback
Figure 7: Actor-critic architecture (modified from Barto, 1995). (A) Conventional feedback loop controller. (B) Actor-critic control system, where a critic influences action selection by means of a reinforcement signal.
be associated with the control goal of the system; reaching it by means of the feedback could be interpreted in the way that the system has attained homeostasis. Figure 7B shows how to extend this system into an actor-critic architecture. The critic produces evaluative, reinforcement feedback for the actor by observing the consequences of its actions. The critic takes the form of a TD error, which gives an indication if things have gone better or worse than expected with the preceding action. Thus, this TD error can be used to evaluate the preceding action. If the error is positive, the tendency to select this action should be strengthened or else lessened. Thus, actor and critic are adaptive through reinforcement learning. This relates these techniques to advanced feedforward control and feedforward compensation techniques. However, the set point is in this more general context replaced by context information. This indicates that that control goal can now be somewhat softened, which makes these architectures go beyond conventional, advanced model-free feedforward controllers. Many different ways exist to implement actor-critic architectures (see Sutton & Barto, 1998, for an example). They have become especially influential when discussing animal control, and we note that these specific architectures (see Figure 7) represent regulation problems to which animal control also belongs (Porr & Worg ¨ otter, ¨ 2003c). Other actor-critic architectures can also be designed but shall not be discussed here.
Temporal Sequence, Prediction, and Control
265
Two aspects are especially important for the discussion later: actor-critic architectures rely on the return maximization principle, which is common to all reinforcement learning paradigms, trying to maximize the expected return by choosing the best actions. Furthermore, they use evaluative feedback from the environment. Hence, feedback signals that come from the environment are not value free; instead, they are labeled reward (positive) or punishment (negative). In sections 7.1 and 7.4, we discuss how these two assumptions may pose problems when considering autonomous creatures. 4.2 Neuronal Actor-Critic Architectures. In general, all neuronal models for prediction and control that have been described in the literature so far follow an actor-critic architecture (Barto, 1995) and are focused on the interactions between the basal ganglia and the cortex (Houk, Adams, & Barto, 1995), sometimes including other brain structures as well. Most of the time the critic (predictor) is implemented in these models with great detail, while the actor (controller) in the earlier studies is only rather generally described and detail is added only in recent models. We first compare the different models of the critic and then the actors. 4.2.1 The Critic. The implementation of a TD critic shown in Figure 4C is called a reciprocal architecture because it assumes a reciprocal connection from the central summation neuron via the neuron, which calculates δ back to the synaptic modification circuit of the central summation neuron (see also Figures 8B and 8C). These types of models capture some of the basic properties of the observed neuronal responses. For example, the δ signal resembles the response of prediction-error neurons, showing the properties of response transfer and omission, while the v signal is to some degree similar to the reward-expectation neurons. Figure 8 shows a simplified circuit diagram of the basal ganglia, together with its most important inputs and outputs. We will now compare the existing models to this diagram. (For an in-depth treatment of this topic, see Daw, 2003.) • Parallel reciprocal architectures; Houk’s model. The first attempt to match the abstract architecture of a critic to the structure of cortex and basal ganglia was made by Houk et al. (1995) (see Figure 8B). So-called striosomal modules fulfill the functions of the adaptive critic. (Striosomal modules consist of striatal striosomes, subthalamic nucleus, and the DA neurons of the substantia nigra pars compacta. They are called the limbic striatum.) The prediction error (δ-) characteristics (see equation 3.13) of the DA neurons of the critic are generated by (1) equating the reward r with excitatory input from the lateral hypothalamus, (2) equating the term v(t) with indirect excitation at the DA neurons, which is initiated from striatal striosomes and channeled through the subthalamic nucleus onto the DA neurons, and (3) equating the term
266
F. Worg ¨ otter ¨ and B. Porr Frontal Cortex
Cortex (C)
A
Thalamus
VP SNr GPi
STN GPe
Striatum (S)
DA-System (SNc,VTA,RRA)
Houk et al.
C S
idealized reciprocal architecture (Montague, Suri)
B
C
C
STN +
C
S
E
Berns & Sejnowski
- DA
r
+
- DA
S
Contreras-Vidal & Schultz
D
S(VS)
+
r
- DA
Brown et al.
r
F
C (PFC)
(C) STN
S excitation
GP +
- DA inhibition
S unspecified (net excitation)
-
+DA
r
dopaminergic projection that modifies target weight
Figure 8: Matching TD learning to the basal ganglia. (A) Schematic wiring diagram of the basal ganglia showing its main inputs and outputs. VP = ventral pallidum, SNr = substantia nigra pars reticulata, SNc = substantia nigra pars compacta, GPi = globus pallidus pars interna, GPe = globus pallidus pars externa, VTA = ventral tegmental area, RRA = retrorubral area, STN = subthalamic nucleus. (B–F) Simplified circuit diagrams redrawn from the approaches of different groups adopting the same structure to make them comparable. Cortex = C, striatum = S, DA = dopamine system, PFC = prefrontal cortex, VS = ventral striatum, GP = globus pallidus, r = reward. B and C represent parallel-reciprocal architectures where both input streams to the DA system arise in parallel from the striatum, which in turn receives DA signals in a reciprocal way. Accordingly, D is a divergent (nonparallel) reciprocal architecture where the input to the DA system originates in the cortex via two separate striato-nigral pathways. The architecture in E is parallel, nonreciprocal, where the cortex is not explicitly mentioned, and that in F is divergent, nonreciprocal. Dashed lines with bullets denote a dopamine synapse. They end close to another (glutamatergic) synapse, which is the one that is modified.
Temporal Sequence, Prediction, and Control
267
v(t−1) with direct, long-lasting inhibition from striatal striosomes onto the DA neurons. This principle, depicted in Figure 8B, is a variation of Figure 4C, where we have used an interneuron to create the subtractive term. The long-lasting inhibitory component used to calculate the subtractive v(t − 1) term by Houk et al. (1995) subserves the same purpose. Is this model supported by anatomy, and will it produce the correct neuronal responses? At first, this comes down to the central question of whether a reciprocal architecture is supported by the connectivity between striatum and the DA system. Many models, like Houk’s, make the even more specific assumption that striosomes contain the main part of the reward-predicting neurons in the striatum with reciprocal connections (Houk et al., 1995; Brown, Bullock, & Grossberg, 1999; Contreras-Vidal & Schultz, 1999). This assumption is supported by the work of Gerfen (Gerfen, 1984, 1985, 1992; Gerfen, Herkenham, & Thibault, 1987). These studies in rats have shown that there is a reciprocal connection between the striosomes of the dorsal striatum and a small group of DA neurons of the SNc and SNr. In other species, anatomical evidence also supports the notion that at least weak reciprocal connections between both structures exist (Haber, Fudge, & McFarland, 2000; Joel & Weiner, 2000). There is, however, little support for the notion that this reciprocity is subcompartmentalized, because it does not seem to be the case that mostly the striosomal neurons would take part in a reciprocal connection pattern (Joel & Weiner, 2000). Rather, it seems that such connections can exist for striosomal and matrisomal neurons of the striatum in a similar way. • Comparing Houk’s model to idealized parallel reciprocal architectures. Figure 8C shows the idealized reciprocal architecture required to implement a TD rule by means of interactions between cortex (C) and striatum (S). This architecture assumes a direct excitatory pathway and an indirect inhibitory pathway. In reality the situation is reversed (Bunney, Chiodo, & Grace, 1991; Pucak & Grace, 1994; Haber et al., 2000; Joel & Weiner, 2000), and we observe as a second problem that the direct action of striatal activity onto DA neurons is inhibitory: excitation arises only indirectly as the consequence of disynaptic disinhibition (ventral striatal inhibition of GABAergic neurons in the ventral pallidum, which projects to most of the DA system; see Figure 8A). Thus, we would obtain a sign inverted situation with −v(t) and +v(t − 1), which is not in accordance with the TD rule. In conclusion, there is no support for implementing a reciprocal critic architecture relying on striosomal neurons only. However, even when relying on the entire population of striatal neurons (including the matrisomal neurons), we are still faced with the sign-inversion problem (Joel, Niv, & Ruppin, 2002). To avoid this, models must make specific
268
F. Worg ¨ otter ¨ and B. Porr
assumptions about the time courses of the involved signals. One possible solution to this problem might lie in the different, sometimes very slow, time courses of the different chemical compounds involved in the second messenger chains leading to dopamine-related synaptic modification (Houk et al., 1995). We discuss this in section 5. From a functional point of view, we find that the long-lasting inhibitory component used to calculate the subtractive v(t) term in Houk’s approach leads to the situation that this model cannot account for the precise timing of the depression during omission of a predicted reward (see Figure 6). Other timing problems arise as the consequence of the fact that this model has not implemented any kind of serial compound stimulus representation. This was first done by Montague et al., 1995, 1996) in the context of a reciprocal architecture and in the following in several models by Suri and coworkers (Suri & Schultz, 1998, 1999, 2001; Suri, Bargas, & Arbib, 2001). Suri et al. do not make an explicit link to anatomy but seem to imply that they are essentially following the approach of Houk. On closer look, however, their models are more strongly related to the ideal reciprocal architecture (see Figure 8C) than to Houk’s model (see Figure 8B). • Divergent reciprocal architectures. So far we have discussed what we would call parallel-reciprocal architectures (see the legend of Figures 8B and 8C). These models assume that two parallel pathways exist from the striatum to the DA system. An alternative source for these two streams is the limbic prefrontal cortex (PFC). Indeed, evidence exists that the PFC projects directly to the DA system (reviewed in Overton & Clark, 1997) and that an additional projection exists to the ventral striatum (Groenewegen, Berendse, Wolters, & Lohman, 1990; Parent, 1990). Via this pathway, delayed inhibition could be provided to the DA system. This is supported by findings that responses in the ventral striatum show reward expectation activity (Schultz, Apiccela, Scarnati, & Ljungberg, 1992). A model that uses such a divergent-reciprocal architecture was devised by Brown, Bullock, & Grossberg (1999; see Figure 8D). This model uses a special kind of serial compound stimulus representation by assuming a spectral timing approach (Grossberg & Schmajuk, 1989; Grossberg, 1995; Grossberg & Merrill, 1996) of a set of bandpass filtered pulses (resembling in shape that of an EPSP) with different onset times that cover the whole interstimulus interval between CS and US. In their model, these pulses represent the suprathreshold internal Ca2+ concentration. The cortex provides excitatory input to the striosomes, which project inhibitorily to the DA system (see Figure 8D). At the same time, the cortex also projects to the ventral striatum. This signal gets transferred via several stages to the DA system, where it finally exerts an excitatory action. Cortico-striatal synapses are modified in
Temporal Sequence, Prediction, and Control
269
the striosomes as well as in the ventral striatum. This architecture still fulfills the basic properties of a TD architecture; the nested, multistage computation and the synaptic modification rules used, however, no longer provide a direct link to the TD rule. The model reproduces most of the known experimental data. This is mainly a virtue of the different timing properties along both legs of the divergent input pathways to the DA system. This additional degree of freedom avoids several of the timing problems that have been found in the older models (see the discussion in Brown et al., 1999). • Parallel nonreciprocal architectures. Berns and Sejnowski (1998) have devised a parallel nonreciprocal architecture (see Figure 8E) that in essence resembles the model of Houk et al. (D), but implements several pathways, more accurately following the known anatomical structures than any of the other models. This architecture is called nonreciprocal because the weight modification does not take place in the striatum. In the model of Berns and Sejnowski (1998), weights of the STN→GP connection as well as those of the striato-nigral connection (S→DA) are modified, for which there is currently no direct evidence. The learning rule used is a three-factor Hebbian learning rule. Two factors come from the pre- and postsynaptic activity of the concerned connection; the third is a so-called error term e, calculated from the difference between the activity of the GP→DA and the S→DA connection. This error term is used as the reinforcement signal. Thus, this model does not use any primary reinforcement (reward r), and cortical input is also not explicitly mentioned. • Divergent nonreciprocal architectures. The model of Contreras-Vidal and Schultz (1999; see Figure 8F) most strongly deviates from the traditional TD rule. As in Berns and Sejnowski (1998), it assumes that the striato-nigral and not the cortico-striatal synapses are modified by means of the DA activity. In general, their model focuses on the action of the prefrontal cortex, which provides indirect input to the striosomes and from there on to the DA system. Thus, this model represents a divergent nonreciprocal architecture. The model is fairly opaque, and it seems as if this input line will lead to net excitation at the DA system, which is opposite to the assumption of the other models. Accordingly, the second pathway from the prefrontal cortex to the DA system acts inhibitorily in their scheme (divergent architecture). In addition, they have implemented an adaptive resonance network (ART-2; Carpenter & Grossberg, 1987) and use this to model an attentional and orienting subsystem. • Problems with serial compound representations. All of the more recent models use some kind of serial compound stimulus representation to cover the long temporal intervals between the stimuli. The problem of
270
F. Worg ¨ otter ¨ and B. Porr
a straightforward serial compound representation, like the one used in Figure 4, is that it predicts a gradual shift of the δ signal forward in time during learning (see steps 1–3 in Figure 4; Schultz, Dayan, & Montague, 1997). This, however, is in general not observed in real recordings where the response of the DA neurons remains in place and gradually diminishes during learning (see Figure 6D), while the response to the predictive stimulus increases. Furthermore, the model in Figure 4 also will not produce novelty responses to new and salient stimuli, which are very often observed in DA neurons. Accordingly, more recent models were designed, solving these problems by means of a more elaborate serial compound stimulus representation. For example, Suri and Schultz (1998, 1999) use a representation where only one delay of 100 ms is used between the first and all other xi , but where all xi , i > 1 are stretched in time with increasing duration and decreasing amplitude. With this architecture, they were able to more accurately reproduce the properties of real DA neurons (prediction-error neurons), including novelty responses and avoiding the forward shift of the δ signal. In this study, they found that only the one specific weight grows, which represents the temporal interval between prediction and reward. As a consequence, these models can learn only a single interstimulus interval correctly. When using other ISIs, the model will produce incorrect responses. In their later models, Suri and Schultz (2001) and Suri et al. (2001) return more closely to the architecture in Figure 4C. In addition to the serial compound representation of the stimulus, they also employ an analog eligibility trace at every xi , which they say speeds up learning and also leads to the specific shapes of the output responses. They do not show responses of prediction-error DA neurons (δ signals) during learning, and the complexity of this model is such that it is not immediately evident if it will again produce some kind of forward shift of the δ-signal. One study of Suri and Schultz (2001) focuses on the responses of reward-expectation neurons in the putamen and the orbitofrontal cortex. The other study implements a so-called extended TD model, which not only gets external stimulus input but is also internally driven by thalamic activity and by action-related signals. Thus, Suri et al. (2001) make serious attempts to implement an actor. This aspect is discussed in the next section. There is evidence that neurons in the striatum produce varying response latencies and durations (Schultz, 1998, 2002; Schultz & Dickinson, 2000) that could support the idea of a serial compound representation. Furthermore, the different models discussed so far can reproduce a wealth of physiological findings. Still, the valid parameter ranges of the existing models appear rather narrow, and the models have to be specifically tuned to reproduce the different response properties of the neurons. In addition, the wiring pattern for the serial compound rep-
Temporal Sequence, Prediction, and Control
271
resentation has to be rather specifically set up, and the unknown delay between CS and US requires a large number of neurons in the serial compound representation of xi . In general, one finds only a few shared principles in the different models. This concerns structure, where sometimes strongly differing anatomical assumptions are found between the models, as well as function, where different learning rules are used. Parallel reciprocal architectures (see Figures 8B, 8C, and 8D), which can be used to implement a neuronal version of the TD rule, are possibly supported by the connectivity of the basal ganglia. The assumption that such reciprocal connections are restricted to striosomes (which are used for the critic) does not seem to be valid, though. Idealized reciprocal architectures (see Figure 8C), which are essentially used in the models of Montague and Suri, are not directly supported by anatomy because of the sign inversion problem discussed above. From a functional point of view, accurate reproduction of the experimental data requires a fairly complex serial compound representation of the stimulus. This assumption may prove to be too inflexible, too. Divergent architectures (see Figures 8D and 8F) and nonreciprocal architectures (see Figures 8E and 8F) are not necessarily related to the TD rule anymore. Especially the divergent architectures offer additional degrees of freedom in adjusting the timing along the different input lines to the DA system. A serial compound representation is also needed in these models, and the question arises if the spectral timing approach of Brown et al. (1999) would be more flexible than the more conventional setup used in the other models. 4.2.2 The Actor. In general, less attention has been paid so far to the implementation of the actor. This may have to do with the fact that there are only few single-cell recordings available that could be matched to the activity of actor cells (goal direction neurons; see section 3.3), as opposed to the critic, where a wealth of data exist. The performance of the actor is thus judged by observing the obtained actions (“behavior”). The study of Houk et al. (1995) was also the first to suggest how to design an actor. The matrix modules (consisting of the striatal matrix, subthalamic nucleus, globus pallidus, thalamus, and frontal cortex and are called sensorimotor striatum) integrate the information from the striosomal modules (see above) and from cortical inputs in order to create output to the frontal cortex relayed through the thalamus. In contrast to the detailed discussion of the critic, Houk et al. (1995) provide only a rather general scheme of the implementation of the actor. According to their model, matrix modules generate signals that command actions or represent plans that organize other systems to generate the actual commands. Montague and colleagues (Montague et al., 1995, 1996) implement decision units as the actor in their models. This is not meant to be related to basal ganglia anatomy; rather, it represents an abstract binary decision net-
272
F. Worg ¨ otter ¨ and B. Porr
work. The TD error δ is used to influence the decision probability and to adjust the weights at the decision units appropriately during learning. This architecture allows only for binary decisions. In their 1999 model, Suri and Schultz also model a control task with an actor-critic architecture. In this model, the actor consists of one layer of neurons, each of which represents a certain action. It learned stimulus-action pairs based on the prediction error signal provided by the critic. A winnertake-all rule, implemented by means of lateral inhibition between the units, ensured that only one action was selected at a given time. In combination with the more sophisticated critic (discussed above), this still rather simple actor model was able to solve several relatively complex behavioral tasks. All of these actor models were still rather simplistic. In a recent model, Suri et al. (2001) made serious attempts to implement a more detailed actor, trying to match it to anatomy as well. As in Houk et al. (1995), the actor uses striatal matrisomes. These provide direct inhibition to the GPi/SNr complex (compare Figure 8A) and indirect excitation via the GPe and the STN. The GPi/SNr complex projects inhibitorily to the thalamus and from there on excitatorily to the cortex, which elicits the actions. The DA system provides input to the matrisomal neurons of the striatum. Its exerts an effect on the membrane potential of these cells but also on the weight of the corticostriatal synapses ω essentially with ω = DA × Pre × Post, comprising a three-factor learning rule, which relies on the activity of the DA system (DA) as well as the pre- and postsynaptic activity at the striatum cell (Miller et al., 1981; Schultz, 1998). The membrane activity of these cells follows the physiological observation that they can be in a depolarized “up” and a hyperpolarized “down” state. (We refer readers to section 5 for a more detailed description of these states.) Here, it seems that this property can improve reaction times as well as learning properties as compared to a model version without such membrane states. As in their older model, actor neurons represent single actions each, and actions are selected again by a winner-takes-all mechanism such that the winning inhibition will disinhibit the thalamus. As a consequence, the number of possible actions is limited to the number of actor neurons. The use of internal signals in the extended TD model of the critic in the model of Suri et al. (2001) leads to the situation that this model can now also perform some kind of planning. Planning refers to the behavioral strategy that an action can be selected without primary reinforcement, just as the consequence of having memorized previously experienced situation-action pairs. For example, if a reward has always occurred at the right side in a T-maze, the animal after learning will turn right without any external stimulation. Suri et al. (2001) achieve this by means of internal signals that enter the extended TD model and from there on the actor neurons. One could say that these internal signals form an internal, explicit representation of a chain of events that can be used for learning and acting.
Temporal Sequence, Prediction, and Control
273
Figure 9: Correlation-based control architecture, where the control signal is derived from the correlations between two temporally related input signals.
The performance of this model is rich and reproduces many existing data sets. However, its architecture is fairly advanced, and as a consequence it contains many free parameters, few of which have direct physiological support. The assumption that striosomes form part of the critic while matrisomes belong to the actor is not supported by the anatomy of the basal ganglia, as discussed above. Therefore, it is questionable to employ two different learning rules (TD rule versus three-factor rule) at these two sets of neurons. From a functional point of view, up- and down-membrane states are also found across all medium spiny striatal neurons (for a review, see Nicola, Surmeier, & Malenka, 2000) and not only in the matrix neurons. From a more conceptional point of view, two aspects may be problematic in the currently existing actor models. First, the number of possible actions is matched to the number of actor neurons. Thus, these neurons represent some kind of “grandmother cell” concept, where only a rather limited number of discrete decisions is possible. This may work in restricted-choice lab situations. The complexity of real-world situations, however, requires a different type of action network. Second, in general, the basal ganglia include many indirect pathways, which often amount to disinhibition. Disinhibition, however, normally acts permissive or facilatory and cannot directly be equated with excitation. Specific motor commands, on the other hand, require the concerted action of many specific and sometimes temporally rather finely tuned excitations. We believe that the release of inhibition at the thalamus performed by the actor pathway in, for example, the model of Suri et al. (2001) can at best act gating on such motor actions. Thus, action selection by means of such a gating process seems to be too crude a model for animal (motor) control. 4.3 Nonevaluative Correlation-Based Control. We have stated that actor-critic architectures follow the return maximization principle by means of evaluative feedback from the environment. Figure 9 suggests a schematic architecture that accommodates a different approach: nonevaluative feedback and disturbance minimization. This architecture utilizes the basic feed-
274
F. Worg ¨ otter ¨ and B. Porr
back loop controller from Figure 7A, but it assumes that the environment will, in a temporal sequence learning situation, provide temporally correlated signals about upcoming events like those mentioned in the introduction (e.g., smell predicts taste). This architecture follows the learning goal: learn to keep the later signal (x0 ), whatever it is, minimal by employing the earlier signal (x1 ) to elicit an appropriate action. In conventional actor-critic architectures, the critic provides an evaluation of the action (e.g., “good” or “bad”), and this evaluation influences future action selections. Such evaluations are, however, always subjective. In correlation-based control, the situation is fundamentally different. Here, the system relies on the objective difference between “early” and “late,” which arises from the structure of the input signals. Evaluations do not take place at this point. Instead the “re-”action of the feedback control loop in response to x0 , be it an attraction or a repulsion reaction, will be shifted forward in time to occur earlier now in response to x1 . Thus, in this system, evaluations do not take place during learning; instead, they are implicitly built into the (sign of the) reaction behavior of the inner x0 loop: repulsion or attraction. As a consequence, critic and actor are not necessarily separate entities anymore and can be merged into the same architectural building block. Furthermore, we note that this control strategy does not seek to maximize returns. On a complex decision topology, such maxima (or even near-maximal regions) may be hard to find by means of RL. Disturbance minimization, on the other hand, offers the advantage that the associated regions will almost always be much bigger, promising better convergence properties. 4.3.1 ISO Control: Merging Critic and Actor. In this section, we describe how ISO learning can be used to implement correlation-based control introduced in an abstract way in the previous section. The central assumption of ISO control is that any control system should start with a stable negative feedback loop (see Figure 9), for example, a reflex loop. Feedback controllers, however, suffer from a major disadvantage: they will always react only after a disturbance has taken place (see the inner loop in Figure 10A). Thus, the desired state (e.g., x0 = 0; see also Figure 9) cannot be maintained all the time. In other words, disturbances will not yet be minimal when employing feedback control. ISO control can improve on this if a temporal correlation exists between the primary disturbance and some other earlier-occurring signal (denoted by the delay τ between the inner and the outer loop in Figure 10). The ISO learning algorithm allows for learning this correlation, and as a result, the primary reflex reaction will be “shifted forward” in time, now occurring earlier, that is, before the primary reflex would have been triggered. Thus, if learning is successful, the primary reflex will be fully avoided, and disturbances are now minimal (ideally zero). Note that in the architecture shown in Figure 10A, critic and actor are no longer separate (compare Figure 7B with Figure 9). There is indeed recent experimental evidence that neurons exist where behavior and reward (actor and critic) are
Temporal Sequence, Prediction, and Control
275
Figure 10: Applying ISO learning in a control task. (A) This architecture is reminiscent of an actor-critic architecture (see Figures 7 and 9), but here the system does not use evaluative feedback (“rewards”) from the environment. Instead, it relies only on correlations between the inputs. Hence, it is assumed that the “organism” receives temporally correlated inputs, where x1 arrives earlier (e.g., a signal from a range finder reflected from an obstacle) and x0 arrives later (e.g., a signal from a touch sensor triggered at the moment of touching the obstacle). P0 , P1 denote environmental transfer functions, D a signal (“disturbance”), which arrives at the inputs: undelayed at x1 and with delay τ at x0 . Other symbols are as in Figure 3. Here we have also implemented a filter bank of 10 filters with different frequencies, all driven by the same input x1 , creating something like a serial compound representation. This was done purely to speed up learning and to create smoother output signals. Note that a filter bank approach does not destroy orthogonality and weights will still self-stabilize (see Figure 3D). After successful learning, the output V will fully compensate the disturbance D at the summation node of the inner loop leading to x0 = 0, which is equivalent to a functional elimination of the inner loop. The system has learned the inverse controller of the inner loop (Porr, von Ferber, & Worg ¨ otter, ¨ 2003). (B,C) Trajectory of a real robot early (B) and late (C) during learning in an arena with three obstacles (boxes). Collisions are denoted by the small circles (forward = solid, backward = dashed). Only forward collisions can be used for learning. In such an environment, the robot never needed more than 12 forward collisions to learn the task. This way, it is as fast as the best RL algorithms, which require sophisticated credit structuring and temporal assignment mechanisms. (Touzet, 1999; also Touzet, pers. communication, August 2003).
276
F. Worg ¨ otter ¨ and B. Porr
more closely linked (Hollerman & Schultz, 1998; Kawagoe, Takikawa, & Hikosaka, 1998; Hassani, Cromwell, & Schultz, 2001). This principle has been employed in several real-robot experiments (see Figures 10B and 10C), (http://www.cn.stir.ac.uk/predictor). We simulate touch and range-finder signals. Before learning, the simulated robot will perform a built-in retraction reaction when touching an obstacle (primary reflex reaction). All weights ωk are initially zero except the weights that belong to the touch sensor inputs, which we set to one. Thus, the output is at this stage just the signal v = x¯ 0 , where x¯ 0 is the bandpass-filtered touch senor input x¯ 0 = h0 ∗x0 . This signal is sent sign-inverted (negative feedback), but otherwise unaltered, to the motors,7 which leads to a retraction reaction. The range-finders provide the necessary earlier signal because they respond before the touch sensor is triggered. ISO learning learns this correlation. After learning, the output is v = ωk x¯ k , where k ≥ 1, because the touch sensors (x0 ) are no longer triggered. This signal will now, in same way as before but earlier, lead to a retraction reaction, and the primary reflex will be avoided. Interestingly, the principle of disturbance minimization by reflex avoidance can be employed in the same way to learn a food-retrieval task (see Porr & Worg ¨ otter, ¨ 2003b). Here a behavior emerges that looks like reward retrieval (or return maximization) but actually follows the disturbanceminimization principle. 5 Biophysics of Synaptic Plasticity in the Striatum The more recent of the discussed models of the basal ganglia begin to make rather specific assumptions about cellular and subcellular mechanisms. The striatum is currently the best-understood substrate concerning the interactions between glutamatergic (Glu) and dopaminergic (DA) synapses. Such interactions also take place in many other brain structures, but shall not be discussed here. Furthermore, we will restrict the discussion to facts that are now fairly generally acknowledged. This is meant to facilitate the computational perspective and should allow us to design restricted but at least valid models without too many open degrees of freedom. 5.1 Basic Observations. Figure 11 gives a summary of the different types of synaptic modifications at corticostriatal glutamatergic synapses targeting a spiny projection neuron in the striatum. Three possible influences can in principle affect the synapse: (1) presynaptic stimulation of the corticostriatal pathway, (2) postsynaptic activation of the projection neuron, and (3) activa7 This description is slightly simplified, because we employ steering and accelerating control, and thus two sets of neurons. The correct cross-wiring is described in Porr and Worg ¨ otter ¨ (2003a). Of importance here is that ISO control essentially works without any signal postprocessing or conditioning.
Temporal Sequence, Prediction, and Control
Nigrostriatal (”DA”) DA
Corticostriatal (”pre”) Glu
Medium-sized Spiny Projection Neuron in the Striatum (”post”)
277 pre
post
DA
Result
1
X
0
0
0
2
0
X
0
0
3
0
0
X
0
4
X
X
0
LTD (dl)
5
X
0
X
0
6
0
X
X
0
7
X
X
X
LTD (DA tonic) LTP (DA phasic)
LTP (dm)
*
Figure 11: Effects of the different activation protocols on the plasticity of corticostriatal synapses. The small x in the table denotes an active influence. Only paired pre- and postsynaptic activation (row 4), with or without DA activation (row 7) will lead to synaptic plasticity. For further explanations, see the text. References for the table: row 1: Calabresi et al., 1992a; Choi & Lovinger, 1997; Calabresi, Centonze, Gubellini, Marfia, & Bernardi, 1999. Row 2: Calabresi et al., 1992b; Choi & Lovinger, 1997; Calabresi et al., 1999. Row 3: Calabresi et al., 1987, 1992b; Umemiya & Raymond, 1997. Row 4: (LTD) Calabresi et al., 1992b; Lovinger, Tyler, & Merritt, 1993; Walsh, 1993; Wickens et al., 1996. Row 4: (LTP and mixed effects): Charpier & Deniau, 1997; Charpier, Mahon & Deniau, 1999; Akopian, Musleh, Smith, & Walsh, 2000; Partridge et al., 2000; Spencer & Murphy, 2000. Row 5 (nucleus accumbens) Pennartz, Ameeron, Groenewegan, & Lopes da Silva, 1993. Row 5: (striatum): Calabresi et al., 1999. Row 6: Calabresi et al., 1999. Row 7 (LTP with Mg2+ removed): Calabresi et al., 1992b; Walsh & Dunia, 1993. Row 7 (LTD with normal conditions, i.e., with Mg2+ ): Calabresi et al., 1992a, 1992b; Tang et al., 2001. Row 7 (LTP with pulsed DA application): Wickens et al., 1996.
tion of the nigrostriatal, dopaminergic pathway. By itself, none of the three influences can affect the Glu synapse (top; for literature references, see the figure legend). If, however, presynaptic corticostriatal stimulation is paired with postsynaptic activity, early studies have unequivocally reported that LTD occurs at the Glu synapse. More recently, these observations have been augmented by findings showing that LTP occurs mainly in the dorsomedial (dm) part of the striatum, whereas LTD is found in the dorsolateral (dl) part following the gradient of D2-like receptor density (Joyce & Marshall, 1987; Russell, Allin, Lamm, & Taljaard, 1992). In addition, the expression of LTP or LTD will also depend on the applied stimulation protocol (for review, see Reynolds & Wickens, 2002). In general, the tendency for LTD seems to be stronger than that for LTP, indicated by the different font sizes in the figure. Failing to pair pre- and postsynaptic activation, on the other
278
F. Worg ¨ otter ¨ and B. Porr
hand, will induce neither LTP nor LTD, regardless of the activation state of the dopaminergic pathway. The most robust protocol to induce LTP or LTD consists of the stimulation of all three influences. It was found that tonic, long-lasting activation of the DA pathway will induce LTD (Calabresi, Maj, Mercuri, Bernardi, 1992a; Calabresi, Maj, Pisani, Mercuri, & Bernaldi, 1992b; Tang, Low, Grandy, & Lovinger, 2001), whereas phasic, pulse-like activation leads to a less strong D1-like receptor desensitization (Memo, Lovenberg, & Hanbauer, 1982) will lead to LTP (Wickens, Begg, & Arbuthnott, 1996). Thus, this has been called a “three-factor” learning rule (Miller et al., 1981; Schultz, 1998). This terminology, however, may be misleading. Dopamine is a potent modulator of synaptic plasticity, but should not enter the equation in a multiplicative way, because the conjoint action of pre- and postsynaptic influences (middle of table, row 4) seems to suggest that two factors already suffice for synaptic modification, and in this case we would ideally set dopamine = 0. It is, however, currently a matter of debate whether the dopamine concentration indeed approaches zero under these experimental conditions, because traces of DA can be detected by high-pressure liquid chromatography (HPLC) following high-frequency stimulation of the corticostriatal pathway (Calabresi et al., 1995). Thus, it is still conceivable that under physiological conditions, the lack of a DA signal would prevent LTP or LTD altogether, because chronic near total depletion of DA prevents LTD (Calabresi et al., 1992a) and LTP (Centonze et al., 1999) (chronic denervation protocols). As a consequence, the validity of a three-factor rule cannot be conclusively ruled out. 5.2 Intracellular Processes. Figure 12 shows a summary of the intracellular actions that take place at DA and Glu synapses as far as they are understood today. The diagram is to some degree simplified; and more detailed accounts can be found in Greengard, Allen, and Nairn (1999) and Centonze, Picconi, Gubellini, Bernardi, and Calabresi (2001). Both LTP and LTD are introduced in the striatum by repetitive stimulation of the corticostriatal fibers, and this event produces massive release of both glutamate and dopamine. In the striatum, dopamine acts mainly via two receptor subtypes (D1-like, and D2-like; Sibley & Monsma, 1992). We consider here the action of glutamate onto alpha-amino-3-hydroxy-5-methyl-4-propionate (AMPA) and N-methyl-D-aspartate (NMDA) receptors only. The more modulatory action of metabotropic glutamate receptors shall not be discussed. Synaptic strength, measured, for example, by the size and slope of EPSPs, can partly be associated to the number of phosphorilated AMPA and NMDA receptors (p-AMPA and p-NMDA) which are integrated into the membrane, while dephosphorilated receptors are not active. In general, one observes that LTP occurs only when NMDA channels are active (Calabresi, Pisani, Mercuri, & Bernardi, 1996; Yamamoto et al., 1999; Partridge, Tang, & Lovinger, 2000), which in experimental in vitro conditions can be achieved by removing Mg2+ from the medium (Calabresi et al., 1992c; Calabresi, Pisani, Mercuri,
Temporal Sequence, Prediction, and Control
A
279
DA
Receptor types Dopamine Glu
NO
DA DA
Ca
AMPA
P
cGMP
2+
Ca
PKG
PP-1
5)
p-DARPP32
(T7
AMPA L-Ca
CaMKII
T75
PKA
2+
NMDA
CaM
(T34)
cAMP 2+
Ca
Glutamate
NMDA
D1-like D2-like
PP-2B
T34
DARPP32
B
Glu
AMPA
P
DA
Ca
2+
NMDA
CaMKII
PP-1
2+
(T34)
cAMP
CaM
p-DARPP32
Ca
2+
Ca
PP-2B
PKA
T34
DARPP32
DA
C
Glu DA
NO
AMPA
NMDA
P
DA
Ca2+
2+
Ca
PP-1
cGMP
CaM CaMKII
cAMP
)
(T75 PKA
p-DARPP32
PKG
PP-2B
T75
DARPP32
Figure 12: Biophysics of striatal synapses (see the text).
& Bernardi, 1996; Centonze et al., 1999). In vivo, this would require a depolarized state of the neuron, where the Mg2+ block at the NMDA channels is lessened or removed. LTD, on the other hand, is independent of the NMDAchannels (Calabresi et al., 1996; Yamamoto et al., 1999; Partridge et al., 2000) and can thus also take place in vivo during less depolarized membrane states. Basically we distinguish three pathways that can lead to the modification of the synaptic strength of the Glu synapse (see Figure 12A).
280
F. Worg ¨ otter ¨ and B. Porr
1. Glu-NMDA-Ca2+ -CaMKII pathway (right side of the diagram). This is the traditional pathway involved in the generation of LTP. During elevated calcium levels, the increased activity of Calcium/Calmodulindependent Protein Kinase II (CaMKII) leads to an increase in the phosphorilated receptors at the membrane. A counteraction arises, however, from Protein Phosphatase-2B (Calcineurin, PP-2B), which gets stimulated by Ca2+ and leads to a dephosphorilation of DARPP32 (dopamine and cyclic adenosine 3’-5’-monophosphate-regulated phosphoprotein, 32kDa) (King et al., 1984; Nishi, Snyder, & Greengard, 1997). This acts by dephosphorilating the receptors along the second pathway, discussed next. The complete cascade involved in the first pathway is more complex than drawn here (see section 6.1 for a more detailed discussion of the literature). 2. DA-D1-PKA-pDARPP32 Pathway (outer left part of the diagram). The action of dopamine onto D1-like receptors leads to an elevated level of cyclic AMP (cAMP) and increased Protein Kinase A (PKA) stimulation (Stoof & Kebabian, 1981). PKA can directly act by phosphorilating AMPA channels (Roche, O’Brien, Mamnen, Bernhardt, & Huganir, 1996; Tingley et al., 1997). At the same time, PKA shifts the equilibrium of DARPP32 toward its active phosphorilated form (pDARPP32) by phosphorilating its threonine-34 residue (T34 in the diagram: DA induced, Nishi et al., 1997; adenosine induced, Svenningsson et al., 1998), which inhibits the action of PP-1 (protein phosphatase1, Hemmings, Greengard, Tung, & Cohen, 1984). The protein PP-1 normally acts by dephosphorilating Glu receptors. Thus, for this pathway, we find an inhibition (by p-DARPP32) of dephosphorilation (by PP-1). In a side path, PKA also helps phosphorilating L-type Ca2+ channels (Surmeier, Bargas, Hemmings, Nairn, & Greengard, 1995), which adds to the calcium pool. 3. DA/NO-D2-PKG-pDARPP32 Pathway (inner left part of the diagram). The action of dopamine onto D2-like receptors seems less well understood. Recent observations from Calabresi et al. (2000) suggest that there is at least a twofold action possible. It seems that dopamine can directly act on the D2-like receptors, which leads to an inhibition of PKA (Centonze et al., 2001). However, at the same time, there exists an indirect pathway via Nitrousoxide (NO)-syntase positive interneurons (Morris et al., 1997). At these neurons, dopamine acts on D1-like receptors, which leads to the release of NO at their terminals.8 NO enhances the level of cyclic Guanosine monophosphate (cGMP) (Al-
8 This explanation is currently used to explain why a conjoint stimulation of D1-like and D2-like receptors (however, at different neuron subtypes) leads to LTD, while the stimulation of D1-like receptors only will lead to LTP (Centonze et al., 2001).
Temporal Sequence, Prediction, and Control
281
tar, Boyar, & Kim, 1990) and stimulates Protein Kinase G (PKG). PKG increases the level of phosphorilated DARPP32, this time, however, phosphorilating Thr34 and Thr75 (T75 in the diagram; (Calabresi et al., 2000). This type of phosphorilated DARPP32 acts inhibiting onto PKA (Bibb et al., 1999) but cannot efficiently inhibit PP-1. As a consequence, dephosphorilation can take place more easily at the Glu receptors. In summary, pathway 1 acts actively phosphorilating, pathways 2 acts mainly “permissive,” preventing dephosphorilation of the Glu receptors, and pathway 3 facilitates dephosphorilation. The question arises under which membrane-potential conditions these pathways are likely to be triggered. It is known that in vivo striatal projection neurons fluctuate between a hyperpolarized “down-state” and a depolarized “up-state” (for a review, see Nicola et al., 2000). The downstate corresponds approximately to the resting membrane level in a slice. Dopamine exerts different effects in the down- and the up-state. In the down-state, dopamine increases via D1-like receptors the activity of rectifying potassium currents (Pacheco-Cano, Bargas, Hernandez-Lopez, Tapia, & Galarraga, 1996) and thereby counteracts possible depolarizing influences. Near the up-state, one still observes reduced excitability from the action of the D1 receptors which leads to a decrease of Na+ as well as N- and P-type Ca2+ currents (Calabresi, Mercuri, Stanzione, Stefani, & Bernardi, 1987; Surmeier & Kitai, 1993; Surmeier et al., 1995). If sustained depolarization (e.g., from the corticostriatal pathway) drives the neuron into the up-state, dopamine will act differently. In this case, it influences again via the D1-like receptors an L-type Ca2+ -current (Hernandez-Lopez, Bargas, Surmeier, Reyes, & Galarraga, 1997) and stabilizes the depolarization this way. Thus, one could say that dopamine will in both states introduce a hysteresis to the membrane potential behavior, trying to keep it at the currently existing level. This principle has recently been modeled to some detail by Gruber, Solla, Surmeier, and Houk (2003), who report such a hysteresis effect in a membrane model that includes some aspects of the D1 cascade and L-type Ca2+ -currents. Note that such a hysteresis will also act as a thresholding mechanism, filtering weak inputs and letting stronger ones through to the pallidum (Yim & Mogenson, 1982; Brown & Arbuthnott, 1983; Toan & Schultz, 1985). Above, we have stated that in vivo LTP requires an elevated membrane potential level to remove the Mg2+ block from the NMDA-channels. Figure 12B shows in a schematic way how this would affect the concentrations of the different compounds. Active NMDA channels will lead to a strongly elevated level of Ca2+ , and this will lead, via CaMKII, to an increased tendency to phosphorilate Glu-receptors (pathway 1). Thus, plain NMDA influence, without the action of dopamine, can already lead to LTP. This conforms with the observation that pre- and postsynaptic stimulation without DA can induce LTP. It seems, however, that the depolarization level reached by this
282
F. Worg ¨ otter ¨ and B. Porr
protocol is often too weak to allow for a large enough Ca2+ influx. Thus, LTD occurs many times, which is normally associated with lower levels of Ca2+ (Malenka & Nicoll, 1999). However, if at the same time D1-like receptors are activated, depolarization gets stabilized (up-state), and pathway 2 becomes active. The increased action of L-type Ca2+ channels by PKA will increase the calcium level, and the active form of p-DARPP32 is substantially enhanced. This removes the possible dephosphorilation via PP-1 and enhances the LTP effect (preventing LTD). A balancing effect arises, though, from the increased level of Ca2+ , which stimulates PP-2B, which in turn reduces p-DARPP32 (glutamate-mediated dephosphorilation of p-DARPP32; King et al., 1984; Nishi et al., 1997). During a less depolarized membrane state (see Figure 12C), pathway 1 is largely inactive, because less Ca2+ can enter via the NMDA channels. This leads to a strongly reduced tendency of Glu-receptor phosphorilation. On the other hand, we also find that lack of Ca2+ leads to a reduced stimulation of PP-2B. The final balance of these two opposing effects, however, shifts the DARPP32 equilibrium toward its dephosphorilated, inactive form. The action of dopamine in conjunction with LTD is less well understood. If D2-like receptors get stimulated via the indirect NO pathway, DARPP32 gets activated via PKG. In vitro studies suggest that PKA and PKG, in a similar way, phosphorilate threonine-34 at DARPP32 but PKG in addition phosphorilates Thr75 (Calabresi et al., 2000). This difference may explain why p-DARPP32 activated by PKG is less efficient in inhibiting PP-1, but final, conclusive experimental evidence for this is still missing. At the moment, however, it seems safe to say that, taken together, the expected level of active, p-DARPP32 (Thr34) should be substantially lower than in the depolarized state discussed above. As a consequence, PP-1 does not get inhibited much, and it can exert its dephosphorilizing action. In parallel, the PKA pathway is inhibited by the action of the D2-like receptor cascade. Thus, the direct phosphorilizing action of PKA does also not take place. As the final result, we expect a reduction of phosphorilated Glu receptors, which leads to LTD. 5.3 Reassessment of the Different Learning Rules in View of Synaptic Biophysics. The basic neuronal formalism for the TD-rule has been given in equation 3.14 as ωi ← ωi + α δ(t) xi (t),
(5.1)
which states that a multiplication of the prediction error signal with the predictive stimulus (-trace) should drive the synaptic weight change. This rule would in principle permit learning by the correlation of the nigrostriatal DA input with the corticostriatal Glu input only, without requiring postsynaptic activity. This, however, has not been found (see the case marked by an asterisk in Figure 11). The fact that the δ signal of the TD rule is in turn derived from the neuron’s output may, however, implicitly account for this, because
Temporal Sequence, Prediction, and Control
283
trace or other temporal representation
d X1
DA
w1 x BP v(t) X0
reward
w0 = 1
Figure 13: Schematic diagram of the modulatory action of dopamine inputs (DA), calculated by a reciprocal TD architecture, onto the synaptic weight modification of a cortico-striatal synapse ω1 (gray box). The main mechanisms for the weight change is assumed to be Hebbian by means of the correlation x between the trace of a presynaptic signal (x1 ) with a postsynaptic signal (BP) in the form of a backpropagating spike, which arises in response to the input x0 .
unavoidably there will be postsynaptic activity at the striatal neuron as soon as this circuit becomes activated. The discussion of the biophysical mechanisms above strongly supports the view that the “traditional” Hebbian correlation between input (presynaptic) signal and output (postsynaptic) signal should be fundamental to the weight change, while the DA signal performs a strong modulatory action. Currently it is widely believed that Hebbian correlation between input and output, which can lead to synaptic plasticity, is mediated by backpropagating action potentials (BP spikes) or dendritic spikes (DS-spikes), which are actively or passively transmitted to the regarded synapse (Stuart & Sakmann, 1994; Buzsaki, Penttonen, Nadasdy, & Bragin, 1996; Hoffman, Magee, Colbert, & Johnston, 1997; Migliore, Hoffmann, Magee, & Johnston, 1999; for a review, see Linden, 1999, but see Goldberg, Holthoff, & Yuste, 2002). The underlying biophysical mechanisms shall be discussed later (section 6.1). In this simplified connection scheme, it is, however, conceivable that the activation of the circuit that calculates the δ signal will be associated with a backpropagating spike that travels to the synapse and provides the necessary postsynaptic depolarization. This would lead to a three-factor rule combining input, output, and the DA signal, where the DA signal and the postsynaptic activation are causally coupled (Miller et al., 1981; Schultz, 1998; Suri et al., 2001). There are, however, still some problems remaining. For example, if the result holds that pure pre-post correlation without DA signal will drive LTP and LTD, then we should not use the DA input in a multiplicative way at all. In addition, the problem occurs that we have to deal with the effects of all causally connected
284
F. Worg ¨ otter ¨ and B. Porr
events whenever presynaptic activity will drive the cell into (postsynaptic) firing. This occurs in a rather complex way in all neuronal TD models, which use a serial compound stimulus representation. Thus, the question of how all these inputs will drive the cell before and during learning, generating BP spikes, which by themselves will influence learning (together with DA), needs to be carefully addressed when trying to model temporal sequence learning by a rule that combines inputs and outputs causally. While the answer to this is still unknown, it is nonetheless clear that a plain three-factor rule is too simple to account for this. Figure 13 tries to capture these aspects in a schematic way. The circuit on the right side would be suited to calculate the δ signal by computing v(t) − v(t − 1) using delayed inhibition via an interneuron. Synaptic modification of the x1 synapse requires the correlation between pre- and postsynaptic activity (see the gray box with x) and the postsynaptic activity is supposed to be transmitted to the synapse via a BP spike (dashed line). The DA signal acts modulatory onto this correlation, but we will not rule out the possibility that it could have a gating (multiplicitive) influence, preventing LTP when it is absent. In comparison to TD, ISO learning does not attempt to model DA responses. In section 6.2.3, we will show that its algorithmic structure is more strongly related to models of spike-timing-dependent plasticity correlating presynaptic signals with postsynaptic signals only (see below). 6 Fast Differential Hebb—Relations to Spike-Timing-Dependent Plasticity The discussion has shown that there are certain—at least formal—similarities between the TD rule, the three-factor rule, and conventional correlationbased learning. Specifically, differential Hebb rules in general lead to weight growth for causally related temporal sequences, while weight shrinkage occurs when the signals are inverse causally coupled. Such a bimodal characteristic was first observed in the classical model of Sutton and Barto (1981), where inhibitory conditioning was found for negative interstimulus intervals (ISIs) between CS and US (an unwanted effect in this context). On much shorter timescales, effects have been observed at real neurons, where the sequence of pre- and postsynaptic activation influences synaptic modification. A different set of differential Hebbian learning rules has been developed to explain these effects. We will discuss some of these algorithms in this section. We will also treat the biophysics of spike-timing-dependent plasticity in order to put the learning rules into their biophysical context. 6.1 Spike-Timing-Dependent Plasticity: A Short Account. 6.1.1 Basic Observations. Hebbian (correlation-based) learning requires that pre- and postsynaptic spikes arrive within a certain small time window, which leads to an increase of the synaptic weight (Hebb, 1949). Originally
Temporal Sequence, Prediction, and Control
285
Synaptic change [%]
100
0
-60
-80
0 T (ms) 80
Figure 14: Experimental data and exponential curve fits for spike-timingdependent plasticity in a hippocampal glutaminergic neuron after repetitive correlated firing (pulses at 1Hz). The right part of the diagram shows LTP, the left part LTD. LTP occurs when the presynaptic activation precedes postsynaptic firing (defined as T > 0); for LTD the order of pre- and post- is reversed (defined as T < 0). Recompiled from Bi and Poo (2001).
it had been supposed that the temporal order of both signals is irrelevant (Bliss & Lomo, 1970, 1973; Bliss & Gardner-Edwin, 1973). However, rather early first indications arose that temporal order is indeed important (Levy & Steward, 1983; Gustafsson, Wigstrom, Abraham, & Huang, 1987; Debanne, Gahwiler, & Thompson, 1994). This notion has been extended into a clear picture of spike-timing-dependent plasticity (STDP) by the experiments of Markram, Lubke, ¨ Frotscher, and Sakmann, (1997) as well as Magee and Johnston (1997). Markram et al. (1997) used whole-cell recordings from two interconnected layer V cortical pyramidal neurons, and they found that repetitive costimulation of the two cells, with postsynaptic spikes following presynaptic spike by 10 ms, induced LTP, while reversing the pulse-protocol would induce LTD. Magee and Johnston (1997) found in the hippocampus that subthreshold synaptic inputs into a CA1 pyramidal cell could amplify backpropagating spikes when paired with postsynaptic stimulation. This results in Ca2+ influx and LTP. In the following, a large number of studies have confirmed that the expression of LTP or LTD in many cells depends on the order of pre- and postsynaptic stimulation (Bell, Han, Sugawara, & Grant, 1997; Bi & Poo, 1998; Debanne, Gahwiler, & Thompson, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Egger, Feldmeyer, & Sakmann, 1999; Feldman, 2000; Nishiyama, Hong, Mikoshiba, Poo, & Kato, 2000). Figure 14 shows a typical example of how a synapse in the hippocampus changes when ap-
286
F. Worg ¨ otter ¨ and B. Porr
plying a protocol of pairing single pre- and postsynaptic depolarizations (Bi & Poo, 2001). The curve shows that this synapse decreases in strength when the postsynaptic signal precedes the presynaptic signal (defined here as T < 0), while it grows if the temporal order is reversed (thus, T > 0) (Markram et al., 1997; Magee & Johnston, 1997; Bi & Poo, 2001). T denotes the temporal interval between post- and presynaptic signals (T := tpost −tpre ). This fairly symmetrical diagram (see Figure 14) represents in some sense the most generic type of STDP. However, several other timing-dependent synaptic modification curves have also been found, which shall only be briefly mentioned here (for a review, see Bi, 2002; Roberts & Bell, 2002). For example, Feldman (2000) found that in layer II/III pyramidal cells, the LTD part of the curve is strongly extended. In the cerebellum (electric fish), inverse STDP curves have been observed, where T > 0 induces LTD and T < 0 LTP (Bell et al., 1997; Han, Grant, & Bell, 2000). Other cell types (e.g., spiny stellate cells in layer IV of the cortex; Markram et al., 1997) seem to follow a more traditional Hebbian learning characteristic. In addition, one observes that individual STDP curves can show a high degree of variability even within the same cell class and that the zero crossing between the LTP and LTD part of the curve does not necessarily occur at T = 0. 6.1.2 Synaptic Biophysics of STDP. Currently it is widely believed that Hebbian correlation between input and output, which can lead to synaptic plasticity, is mediated by backpropagating action potentials (BP spike) which are actively or passively transmitted to the regarded synapse (Stuart & Sakmann, 1994; Buzsaki et al., 1996; Hoffman et al., 1997; Migliore et al., 1999) by means of passive and active properties of dendrites (Lasser-Ross & Ross, 1992; Regehr, Konnerth, & Armstrong, 1992; Johnston, Magee, Colbert, & Cristie, 1996). At distal dendrites, where a BP spike might already have faded, dendritic spikes can serve the same purpose (Stuart, Spruston, Sakmann, & H¨ausser, 1997; Golding, Kath, & Spruston, 2001; Saudargiene, Porr, & Worg ¨ otter, ¨ 2004; Mehta, 2004). It is generally believed that a transient increase in intracellular Ca2+ is of crucial importance for STDP as in most other forms of synaptic plasticity (Artola & Singer, 1993; Zucker, 1999). In hippocampal slices, increased Ca2+ influx is correlated with the induction of LTP (Magee & Johnston, 1997). In addition, STDP is found to depend critically on NMDA receptors that are highly permeable to Ca2+ (Magee & Johnston, 1997; Markram et al., 1997; Bi & Poo, 1998; Debanne et al., 1998; Zhang et al., 1998; Feldman, 2000; Nishiyama et al., 2000). A simple model of the mechanisms that underlie LTP and LTD can be summarized as follows: high-level intracellular Ca2+ elevation activates the calcium/calmodulin-dependent kinase II (CaMKII, see Figure 12) as well as other protein kinases and leads to subsequent LTP, whereas moderate-level Ca2+ elevation activates phosphatases (e.g., calcineurin) and results in LTD (Lisman, 1989; Malenka et al., 1989; Malinow, Schulman, & Tsien, 1989; Artola & Singer, 1993; Mulkey, Endo, Shenolikar,
Temporal Sequence, Prediction, and Control
287
& Malenka, 1994). In a first approximation, this model can also be applied to STDP, because when presynaptic stimulation immediately precedes a postsynaptic spike, the backpropagating postsynaptic spike can remove the Mg2+ block at the NMDA receptors (Mayer, Westbrook, & Guthrie, 1984; Nowak, Bregestovski, Ascher, Hebert, & Prochiantz, 1984), thereby causing high-level Ca2+ influx into the synapse (Magee & Johnston, 1997; Koester & Sakmann, 1998), whereas when presynaptic stimulation follows the postsynaptic spike, only low-level Ca2+ influx may occur through voltage-gated calcium channels and the partially blocked NMDA receptors. This model, however, poses a problem at large values of T. Experimentally, one still observes weak LTP, but the model would suggest that Ca2+ influx should be low for large T, rather leading to LTD. Thus, one would expect to find a second LTD window at larger T. However, such an additional LTD window was not observed in most studies of STDP (Bi & Poo, 1998; Debanne et al., 1998; Zhang et al., 1998; Feldman, 2000; Sjostr ¨ om, ¨ Turrigiano, & Nelson, 2001; Yao & Dan, 2001; Froemke & Dan, 2002) except for hippocampal slices, where spike timing of 20 ms indeed resulted in LTD (Nishiyama et al., 2000). A possible explanation for the missing secondary LTD window would be that not only the level but also the transient of Ca2+ determines if LTP or LTD will be induced. Indeed, LTP or LTD induced by postsynaptic photolysis of caged Ca2+ depends critically on not only the level but also the time course of the light-induced Ca2+ transient (Yang, Tang, & Zucker, 1999; Zucker, 1999). In contrast, when the process of intracellular Ca2+ elevation is slow, the condition for subsequent enzymatic reaction may be closer to a steady-state situation; therefore, the overall level of Ca2+ could become the only crucial parameter in determining the resultant synaptic modification. Currently, there is no direct evidence for this hypothesis, but the complexity of the postsynaptic density argues for a differential action when comparing a steady-state situation with a situation of a steep Ca2+ gradient. For example, Ca2+ influxes from different channels appear to activate preferentially different kinase signaling pathways (Deisseroth, Heist, & Tsien, 1998; Graef et al., 1999; Dolmetsch, Pajvani, Fife, Spotts, & Greenberg, 2001; West et al., 2001). Furthermore it is known that CaMKII can respond differently to Ca2+ oscillation at different frequencies (DeKoninck & Schulmann, 1998). Additional complexity comes from the dynamics of intracellular Ca2+ stores (Berridge, 1998; Svoboda & Mainen, 1999; Rose & Konnerth, 2001) mainly consisting of the lumen of the endoplasmatic reticulum, which stores Ca2+ at a high concentration. These stores can be accessed through ryanodine receptors as well as inositol 1,4,5-trisphosphate (IP3) receptors (Berridge, 1998), which are both Ca2+ sensitive. In addition, IP3 receptors are also activated by metabotropic glutamate receptors. Note, however, that Ca2+ release from intracellular stores is generally slow but long lasting and more global as compared to Ca2+ influx through synaptic channels. The role of Ca2+ stores in STDP is unclear, but it has been shown that in CA1 of the hippocampus, inhibiting store-release blocks the induction of LTD and some-
288
F. Worg ¨ otter ¨ and B. Porr
Figure 15: Theoretical results for STDP. (A) Analytically calculated final weight distributions after reaching equilibrium from the model of van Rossum et al. (2000). The unimodal curve is obtained from their original model assuming a multiplicative weight normalization mechanism. The bimodal curve is obtained when potentiation and depression do not depend on the weight (additive model with hard boundaries). (B–D) Ca2+ concentration (B,C) and weight change curve (D) obtained by Shouval et al. (2002). The insets in B and C show the timing of the backpropagating spike (“needle”) relative to the NMDA potential (elongated curve). Only for positive T (T = 10 ms, (C)) a strong rise in Ca2+ is observed, while it remains moderate for T < 0 (T = −10 ms, (B)). (D) The inset shows the shape of Shouval’s function. They assume that LTP occurs if the Ca2+ concentration passes threshold “P”, while LTD occurs if it stays below threshold “D.” The weight change curve (D) has the characteristic shape, but an additional negative peak is found for large T, which may not correspond to experimental findings as discussed in section 6.1.
times reverses it to LTP (Futatsugi et al., 1999; Nishiyama et al., 2000). Such effects, however, may be highly sensitive to the experimental protocol and cell-type. In summary, it seems that a rapid change to a high-level Ca2+ concentration at the postsynaptic density, probably through activated NMDA receptors, is most likely responsible for inducing LTP. Slow and prolonged Ca2+ increase, on the other hand, possibly from different sources, will lead to LTD. 6.2 Models of Fast Differential Hebbian Learning. During the past few years a wide variety of different models for STDP has been designed,
Temporal Sequence, Prediction, and Control
289
which can roughly be subdivided into two groups with different biophysical complexity; spike based and rate based. 6.2.1 Network Models of STDP. The first group of models, which began to emerge around 1996 (Gerstner, Kempter, van Hemmen, & Wagner, 1996) and, thus, anticipating the experimental findings published in 1997 (Markram et al., 1997; Magee and Johnston, 1997), are relatively abstract and do not implement any biophysical mechanism in order to generate the weight change curve. Instead these models assume a certain shape of the weight change curve as the learning rule (Gerstner et al., 1996; Song, Miller, & Abbott, 2000; Rubin, Lee, & Sompolinsky, 2001; Kempter, Leibold, Wagner, & van Hemmen, 2001b; Gerstner & Kistler, 2002), which is most often an exponential fit to both parts of the data sets in Figure 14. These models look at general network properties that arise from learning. It has long been known that plain Hebbian learning without additional mechanisms will lead to the divergence of all synaptic weights in a network, because even random correlations will lead to weight growth. Thus, network stability has been one of the central issues in theories of correlationbased learning, and several add-on mechanisms have been discussed in the literature to solve this problem. In view of this problem, the property that LTP and LTD can occur at the same synapse depending on the pre-post correlations led to the attention of theoreticians. Indeed, it was found that STDP leads to self-stabilization of weights and rates in the network as soon as the LTD-side dominates over the LTP side in the weight change curve (Song et al., 2000; Kempter, Gerstner, & van Hemmen, 2001a). As a second problem of high theoretical interest, networks need to operate competitively. Thus, normally it does not make sense to drive all synapses into similar states by means of learning, because the network will then no longer contain inner structure. Instead, most real networks in the brain are highly structured, forming maps or other subdivisions. Thus, the problem of synaptic competition also needs to be solved by the applied learning mechanisms. Accordingly, more recently network STDP models have also been successfully applied to generate (i.e., to develop) some physiological properties such as map structures (Song & Abbott, 2001; Leibold, Kempter, & van Hemmen, 2001; Kempter et al., 2001b; Leibold & van Hemmen, 2002), direction selectivity (Buchs & Senn, 2002; Senn & Buchs, 2003), or temporal receptive fields (Leibold & van Hemmen, 2001). In addition, it was found that such networks can store patterns (Abbott & Blum, 1996; Seung, 1998; Abbott & Song, 1999; Fusi, 2002). In general one can subdivide the different network models of STDP into those that use an additive versus a multiplicative learning rule. Additive models assume that changes in synaptic weights do not scale with synaptic strength, and hard boundaries are imposed to prevent divergence (Abbott & Blum, 1996; Gerstner et al., 1996; Eurich, Pawelzik, Ernst, Cowan, & Milton, 1999; Kempter, Gerstner, & van Hemmen, 1999; Roberts, 1999; Song et al.,
290
F. Worg ¨ otter ¨ and B. Porr
2000; Levy, Horn, Meilijson, & Ruppin, 2001; Cateau, Kitano, & Fukai, 2002). These models exhibit strong competition but produce unstable dynamics, most often resulting in binary synaptic distributions (see Figure 15A). As an additional problem, one finds that these models often subdivide the neuronal population not according to the actual input correlations but rather as a consequence of the self-amplification of small, initially existing deviations from symmetry. This effect also results from the strong competition taking place in these networks. Multiplicative models, on the other hand, assume a linear attenuation of the weight change as soon as the upper or lower boundaries are approached (Kistler & van Hemmen, 2000; van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001). Such a mechanism seems to correspond more closely to the experimental data, because Bi and Poo (1998) have reported that the growth of synaptic weights becomes smaller if the weight is already strong. The smoothing effects of such an attenuation lead to stable network dynamics, but because of the reduced competition, all synapses are driven into a similar equilibrium value (see Figure 15A). In a recent model devised by Gutig, ¨ Aharonov, Rotter, and Sompolinsky (2003), it has been suggested to introduce a nonlinearity in the STDP rule to control the attenuation of the weight change. As a consequence, this model sits in between the hard-threshold additive approaches and the linear-attenuation multiplicative models, and the authors can show that competition and stability are obtained across a much larger parameter space. Rate-based models rely on the average firing rate of a neuron, not on its precise firing time (Sutton & Barto, 1981; Kosco, 1986; Klopf, 1986, 1988; Kempter et al., 2001b; Kistler, 2002; Porr & Worg ¨ otter, ¨ 2002, 2003a). As such they are at first sight unrelated to the spike-based models. The earlier models, which appeared before 1990, used an implicit functional description (an equation), which relied on the derivative of the signal(s) to define their learning rule. The same “differential Hebbian” relation was reintroduced by Gerstner et al. in 1996 in the context of a spike-based model and by means of a parametric description (a bimodal curve) of the learning rule. More recently, physiological results became available that pointed out that spike timing as well as the rate can influence synaptic plasticity (Sjostr ¨ om ¨ et al., 2001; Froemke & Dan, 2002). Thus, the question arises how to relate both model classes. Roberts (1999) as well as Xie and Seung (2000) have observed that it is possible to derive a rate-based model from a spiking model but the opposite cannot be achieved without additional assumptions. Gerstner and Kistler (Gerstner & Kistler, 2002; Kistler, 2002) provide a relatively complex unifying approach to these ends, using their spike-response model of firing, which, however, makes several oversimplifying assumptions with respect to the biophysics of neurons. 6.2.2 Critique of the Network STDP Models. The central advantage of network STDP models lies in the fact that they can be treated mathematically
Temporal Sequence, Prediction, and Control
291
to a large extent. Thus, theoretical statements, for example, about network stability and competition, could be obtained, albeit at the cost of biophysical realism. In network models of STDP, the learning rule (the assumed weight change curve) remains unchanged across the local properties of the cell. The complexity of the biophysical mechanisms that control the Ca2+ influence at any given synapses does not support this assumption. It seems much more likely that different shapes of weight change curves can exist, for example, along a dendrite or at spines. As a consequence, all network models of STDP are rather homogeneous in their learning behavior. Furthermore, these models do not attempt to implement different cell types, for example, inhibitory cells. This, however, may be problematic in view of the problems of stability and competition, and the question arises to what degree these problems would still surface in more realistic networks. In addition, normally these models assume that only pairwise interactions exist between pre- and postsynaptic signals (adiabatic condition). Thus, bursts of input spikes, common to most brain areas, are not considered, which could substantially influence local Ca2+ gradients. In a similar way, triplets (or multiplets) of spikes would normally create complex sequences of pre- and postsynaptic events (Froemke & Dan, 2002) are also not considered in these models. 6.2.3 Biophysical Models of STDP and Their Relation to Differential Hebb. These models are no longer concerned with the effect of STDP on network behavior; instead, they want to explain how the characteristic shape of the STDP curves is generated by means of intra-cellular mechanisms. Some use kinetic models (Senn, Markram, & Tsodyks, 2000; Castellani, Quinlan, Cooper, & Shouval, 2001). Others follow a state-variable approach trying to find a set of appropriate differential equations to describe STDP (Rao & Sejnowski, 2001; Abarbanel, Huerta, & Rabinovich, 2002; Karmarkar & Buonomano, 2002; Karmarkar, Najarian, & Buonomano, 2002; Shouval, Bear, & Cooper, 2002; Abarbanel, Gibb, Huerta, & Rabinovich, 2003). The kinetic models implement a rather high degree of biophysical detail, sometimes including calcium, transmitter, and enzyme kinetics. The power of such models lies in the chance to understand and predict intra- or subcellular mechanism, for example, the aspect of AMPA receptor phosphorilation (Castellani et al., 2001), which is known to centrally influence the synaptic strength (Malenka & Nicoll, 1999; Luscher ¨ & Frerking, 2001; Song & Huganir, 2002). Furthermore, both approaches (Senn et al., 2000; Castellani et al., 2001) can show a relation between the designed model and Hebbian as well as BCM rules (Bienenstock, Cooper, & Munro, 1982), which proposes a sliding synaptic modification threshold. (For an elaboration on the mathematical similarity between STDP and BCM see Izhikevich & Desai, 2003.)
292
F. Worg ¨ otter ¨ and B. Porr
The approaches of Rao and Sejnowski (2001), Shouval et al. (2002), and Karmarkar and coworkers (Karmarkar & Buonomano, 2002; Karmarkar, Najarian, & Buonomano, 2002) follow a state-variable approach. Rao and Sejnowski (2001) implemented a rule based on activity differences, which makes their approach related to differential Hebbian learning and less to TD learning (Dayan, 2002). The other two models investigate the effects of different calcium concentration levels by assuming certain (e.g., exponential) functional characteristics to govern its changes. This allows them to address the question of how different calcium levels will lead to LTD or LTP (Nishiyama et al., 2000) and one of the models (Karmarkar & Buonomano, 2002) proposes to employ two different coincidence detector mechanisms to this end. The model of Shouval et al. (2002) (see Figures 15B–15D) implicitly assumes a differential Hebbian characteristic by the bimodal shape of their function (see Figure 15D, inset), which they used to capture the calcium influence. This group discussed, among other aspects, the role of the shape of the BP spike, and they concluded that a slow after-depolarization potential (more commonly known as repolarization) must exist in order to generate STDP. Thus, in their study, the shape of the BP spike will influence the shape of the weight change curve. In general, they find that the LTP part of the curve is stronger than the LTD part. This observation would prevent selfstabilization of the activity in network models (Song et al., 2000; Kempter et al., 2001a), which require a larger LTD part for achieving this effect. Interestingly, however, Shouval et al. find a second LTD part for larger positive values of T, which could perhaps be used to counteract such an activity amplification. In the hippocampus, there is currently conflicting evidence if such a second LTD part exists for large T (Pike, Meredith, Olding, & Paulsen, 1999; Nishiyama et al., 2000). ISO learning generically produces a bimodal weight change curve (see Figure 3B). Thus, at the right timescale, it should be possible to use this algorithm also in the context of STDP, provided there is a justifiable match between the components of ISO learning and the biophysical synaptic mechanisms. Figure 16 shows how to associate the components of a membrane model to those of ISO learning. We assume a plastic NMDA synapse ω1 , which receives the presynaptic activation.9 An NMDA potential gˆ N from this synapse is shown in the inset. This signal is matched to x¯ 1 = x1 ∗ h1 from ISO learning, where the asterisk denotes a convolution. The postsynaptic signal x0 arises from a BP spike. For simplicity, the BP spike is also modeled as a conductance change, integrated into the membrane potential V. Thus, the
9 The more realistic case of a mixed AMPA/NMDA synapse is discussed in the original article (Saudargiene et al., 2004).
Temporal Sequence, Prediction, and Control
A
V (mV) -30
postsyn. event x0
t
BP-spikes of different shapes
B
presyn. event x1 at plastic synapse
T
293
-40 -50 -60
Plastic Synapse
x1
g h1
w1 NMDA
gN
S
dV = g (E -V) i i dt
-70
N
0
C 0
ms
100
(gBP ) BP-Spike
x0
Dw1
50
100
150 ms
Weight-change-curves
0.06 0.04 0.02 0 -0.02 -0.04 -0.06 -0.08 -80
-40
0
40 T (ms)
Figure 16: Applying the ISO learning algorithm to STDP. (A) Schematic diagram of the model. The inset shows a comparison between an NMDA potential gˆ N and a resonator response h1 from the original ISO learning algorithm. (B) BP spikes of different shapes. (C) Weight change curves obtained when using the BP spikes from B as the postsynaptic signals.
ISO learning rule (see equation 3.9) turns into dω1 = µx¯ 1 v = µ gˆ N V , dt
(6.1)
where V is the derivative of the membrane potential. It is known that the shape of a BP spike changes with the parameters of the dendrite, getting flatter and more spread out at distal dendrites. In particular, at distal dendrites, BP spike may not play a large role anymore (Stuart et al., 1997; Golding et al., 2001) and local signals, such as dendritic spikes will provide the depolarizing signal (Schiller, Schiller, Stuart, & Sakmann, 1997; H¨ausser, Spruston, & Stuart, 2000; Golding, Staff, & Spruston, 2002; Saudargiene et al., 2004; Mehta, 2004). The ISO model captures these effects, because it can model arbitrary depolarizing signals. Figure 16B demonstrates this aspect, discussing the special case of an attenuated BP spike. Figure 16C shows that the weight change curves obtained with these different BP spikes are significantly different in shape. This emphasizes that fact that different STDP characteristics are to be expected at different locations on the dendrite or at spines and this would lead to different computational properties. The differential Hebbian rule we employed leads to the observed results as the consequence of the fact that the derivative of any generic (unimodal) postsynaptic membrane signal (like a BP-spike) will lead to a bimodal curve.
294
F. Worg ¨ otter ¨ and B. Porr
The relative temporal location of the presynaptic depolarization signal with respect to the positive (or negative) hump of this bimodal curve will then determine if the product in the learning rule is positive (weight growth) or negative (weight shrinkage). This can also lead to the fact that differential Hebbian learning turns into plain Hebbian learning without having to change the learning rule if the rise time of the BP spike is shallow (see Figure 5 in Saudargiene et al., 2004). The ISO model of STDP is a spike-based model derived from a rate-based model (ISO learning). Central to the transfer between both domains was the realization that at any synapse, pulse coding turns into analog (rate) coding as the consequence of the filtering properties of the membrane, especially the transmitter-receptor interactions. This shows that at the level of a synapse, the distinction between spike-based STDP models and rate-based STDP models may not be of great importance anymore. 6.2.4 Critique of the Biophysical STDP Models. As opposed to network models, the biophysical ones in principle have the potential to explain the subcellular mechanisms that underlie STDP in more detail. However, at the moment, the level of biophysical realism is still not very high in these models. The model of Senn et al. (2000) is restricted to the kinetics of transmitter release without implementing an explicit Ca2+ mechanism. The others implicitly or explicitly assume that low levels of Ca2+ will lead to LTD, while high levels lead to LTP. The aspect that the Ca2+ gradient also seems to be important is not addressed in most models, and more complex reaction chains are also not yet implemented in the context of STDP.10 The state-variable models (Shouval et al., 2002; Karmarkar & Buonomano, 2002; Karmarkar et al., 2002) were designed to produce a zero crossing (transition between LTD and LTP) at T = 0, which is not always the case in the measured weight change curves, which show transitions between more LTP- and more LTD-dominated shapes depending on the cell type and the stimulation protocol (Roberts & Bell, 2002). An important advantage over network models, however, is that biophysical models permit different STDP characteristics at different synaptic sites at the same cell. The weight change curve depends in all of these models on subcellular parameters and on the shape of the potential changes at the synapses (Rao & Sejnowski, 2001; Saudargiene et al., 2004). The question of differently shaped weight change curves is of relevance for aspects of dendritic synaptic development and, consequentially, also for dendritic computations. Therefore, some groups have started to address this problem by explicitly parameterizing weight change curves along dendritic structures (Panchev, Wermter, & Chen, 2002; Sterratt & van Ooyen, 2002).
10 The model of Castellani et al. (2001) focuses on bidirectional synaptic plasticity but does not directly relate this to STDP.
Temporal Sequence, Prediction, and Control
295
6.3 The Time-Gap Problem. In the preceding sections, we have mainly discussed how spike-timing-dependent plasticity can be modeled with correlation-based differential Hebbian learning rules. We started this discussion with a more general account on differential Hebbian learning, first developed by Sutton and Barto (1981) as well as Klopf (1986, 1988) and Kosco (1986). These rules, recently augmented by ISO learning (Porr & Worg ¨ otter, ¨ 2002, 2003a), have been used with rather long temporal intervals between CS and US (seconds). The same rules, however, can almost without modification also be used to account for STDP, albeit on a much faster timescale (milliseconds). This is currently probably one of the biggest computational problems of this whole field: How would it be possible to bridge the gap between the timescales of correlation-based synaptic plasticity and those of behavioral learning? It is by now widely accepted that LTP and LTD are strongly involved in processes of learning and memory (Martin, Grimwood, & Morris, 2000). At the moment, there is no straightforward way that would lead from an STDP model to, for example, a model of classical conditioning, and only a few attempts exist in this direction (Mehta, Lee, & Wilson, 2002; Sato & Yamaguchi, 2003; Melamed, Gerstner, Maass, Tsodyks, & Markram, 2004). Maybe the serial compound stimulus representations, introduced in section 3.1.3, could be “recycled” in this context, but such a representation seems to be a rather crude approximation of the neuronal interactions in the areas involved. Reverberating synfire chains (Abeles, 1991) would be another possible mechanisms to bridge the time gap (Kitano, Okamato, & Fukai, 2003), but this concept is also quite heavily debated. Thus, the relationship between fast and slow differential Hebb rules may turn out to be a mere structural one. 7 Discussion: Are Reward-Based and Correlation-Based Learning Indeed Similar? Part of the description above has shown that neuronal TD models are related to correlation-based approaches, when treating the reward input not differently from any of the other inputs (see Figure 4C). Thus, at least in their open-loop condition, reward-based learning can be equated to correlationbased rules, and they appear similar. The difference between those two approaches, however, is more subtle and surfaces only when considering closed-loop situations. Reward-based learning requires evaluative feedback from the environment, where someone (an external observer) or some process (the “critic”) defines what is rewarding and what is punishing for the learner. Correlation-based learning does not need evaluative feedback. All feedback signals from the environment that reach the learner are totally value free, and only the temporal correlations between the different input signals drive the learning. This difference has been known since the early 1980s (Klopf, 1982, 1988), and reinforcement learning had been introduced in order to address some of
296
F. Worg ¨ otter ¨ and B. Porr
the problems that exist with correlation-based methods. First and foremost, it had been observed that it is hard to define the boundary conditions for correlation-based methods. Wrong boundary conditions will lead to wrong convergence of learning. In addition, it was deemed impossible to learn conflict solving by means of correlation-based methods (McFarland, 1989). Reinforcement learning seems to be better suited to address these problems, and hopes were high that RL techniques could be used to address complex learning tasks. Alas, many robotics researchers can now confirm that RL will not solve their learning problems, and the question arises: Why is this the case? 7.1 Evaluative Versus Nonevaluative Feedback: The Credit Structuring Problem. Evaluative feedback, however, is not unproblematic. Since RL methods rely on rewards, we must find a way to define them. So far we have somewhat naively assumed that we know how much reward can be associated with a given state or situation. This, however, requires a process that places the rewards appropriately (credit structuring problem). 7.1.1 External Evaluative Feedback. Rewards would have to be placed such that learning is efficient; it is not good enough to just reward some final learning outcome (which is normally much easier). For example, the goal “learn to win in chess,” hence placing a reward at the end of a successful game, will lead to unacceptably long times to convergence (if at all) of learning.11 Faster convergence can be achieved by associating appropriate rewards with ideally many intermediate configurations of the board.12 This, however requires an in-depth analysis of the structure of the problem before credit structuring can take place. This analysis may at the end be almost as complex as the learning process itself. Thus, credit structuring is a major challenge, especially in time- and space-continuous control tasks, to which all animal behavioral tasks belong. In these cases, there are an infinite number of state-action pairs, to which rewards or punishments could be associated beforehand by defining a mapping function from the state-action space to the reward-punishment space with a so-called reinforcement function (Santos & Touzet, 1999a, 1999b). Such procedures for credit structuring can be called external evaluative feedback: an external structure, a “teacher,” explicitly provides the rewards. Obviously this situation does not normally apply to animals. 11 TD-Gammon (Tesauro, 1992) is commonly cited as one the major successes of TD learning. One should, however, keep in mind that it took millions of iterations of self-play to converge. Furthermore, it has been discussed that the structure of backgammon may be especially well suited for TD learning (Pollack & Blair, 1997). 12 In general it is observed that RL methods often converge very slowly on fine-grain or time-space continuous problems. Therefore, efforts have been made up to accelerate convergence (e.g., Wiering & Schmidhuber, 1998; see Reynolds, 2002 for a discussion).
Temporal Sequence, Prediction, and Control
297
7.1.2 Internal Evaluative Feedback. A better strategy may be to assume that the agent/algorithm performs credit structuring on its own. This can be achieved, for example, by associating only very general aversive or attractive properties (like good or bad taste of food, pain or pleasure) to states and letting the agent “experience” them via some sensor inputs. We would call such a procedure internal evaluative feedback: The agent itself provides the value of the rewards. The agent consists of a critic who criticizes the actions of the actor, such as discussed in section 4.1. There is, however, a hitch. How does the critic know how to criticize? At least for a constructed artificial agent, someone (the designer) must have provided a frame to allow the critic to do its job. Thus, we still need an external structure (the designer) that explicitly defines the reference frame, the reinforcement function, for what should be rewarding (or punishing). In simple situations, this seems easy—for example, food is rewarding, pain not. However, as soon as one would like the agent to learn a complex control task with multiple and often conflicting demands (think of robot soccer), setting a “good” frame set to drive the learning becomes nontrivial. We believe this problem is related to the famous frame problem of Dennett (1984), initially pointed out by Klopf (1988) in the context of classical conditioning. In brief, Dennett claims that is is impossible to define a complete frame set (a world model) to guide the behavior of an autonomous agent or an AI system, because the world model of the designer will never match the world as experienced (as “sensed”) by the agent. Dennett’s arguments concerned behavioral control. However, in trying to control and guide learning (by means of a critic), it may be equally impossible to define such a frame set for the reward structure as soon as the task of the critic is complex enough. We would like to call this the secondorder frame problem and would argue that the strange lack of success of RL methods in complex state action spaces may be partially due to this problem. 7.1.3 Nonevaluative Feedback. One possible solution out of this secondorder frame problem is to try to design a system that operates strictly with nonevaluative feedback where any external structure (here, the environment) will provide only value-free signals. A possible starting point for such a system would be some kind of very generally prespecified homeostasis, and the only goal of the agent would be not to maximize rewards but to minimize disturbances of this homeostasis. This way, external interference by the designer of the agent would be absolutely minimal, because even the initial state could be generated by evolutionary methods and the agent will in all instances “decide by itself” (i.e., by internal evaluation within its own reference frame) if something is rewarding or punishing. In principle, it would in this case not even be necessary to make this evaluation explicit in terms of a dedicated reward or punishment signal. Without such signals, learning would have to rely exclusively on correlative input properties, and traditional TD architectures cannot be used. Only with secondarily derived
298
F. Worg ¨ otter ¨ and B. Porr
inner reward-related signals, which seem to be represented by the dopamine responses of neurons in the basal ganglia, does TD learning become again feasible. 7.2 Bootstrapping Reward Systems? Compelling evidence exists that the dopaminergic system in the brain is related to reward processing. Thus, the above arguments by which we criticized essentially all evaluative feedback mechanisms must be mitigated to some degree. Animals seem indeed to use evaluative feedback mechanisms successfully for learning. Here, however, the situation is to some degree different. Such reward systems have presumably been bootstrapped by evolution, which has built at least the most basic reward and punishment structures into the “critic,” such that even newborn animals can rely on them. Evolutionary mechanisms of variation and selection are indeed well suited to develop value systems. Futhermore, it is conceivable that more complex internal evaluative structures can be developed on top of these early reward mechanisms. This could be done by either correlation-based learning or by learning secondary (derived) value systems from primary ones. Several interesting questions arise here. For example, how finely can such reward systems be structured in animals? Is it conceivable that several “critics” operate in parallel in an animal in order to disentangle conflicting situations? Is is possible to bootstrap reward-based systems by means of value-free correlation-based learning? Thus, in general, from this discussion, the question arises: How does the learner (the critic) learn to criticize? 7.3 The Credit Assignment Problem. In a simple summary, one could say that the above discussion has dealt with the “spatial” structure of the world: How and where to place rewards before learning will start? The temporal credit assignment problem, on the other hand, refers to the fact that rewards, especially in fine grained state-action spaces, can be terribly temporally delayed. For example, a robot will normally perform many moves through its state-action space where immediate punishments or rewards are (almost) zero and where more relevant events are rather distant in the future. As a consequence, such reward signals will only very weakly affect all temporally distant states that have preceded it. It is almost as if the influence of a reward gets more and more diluted over time, and this can lead to bad convergence properties of the RL mechanism. Many steps must be performed by any iterative RL algorithm to propagate the influence of delayed reinforcement to all states and actions that have an effect on that reinforcement (Anderson & Crawford-Hines, 1994). In addition, we observe a somewhat paradoxical situation for credit assignment that arises during learning: If an agent successfully avoids a punishment, then it will receive no punishment signal. A “no-signal” situation, however, normally occurs also for every “boring” state-action pair where the machine is far away from a source of reward or punishment. Thus, “no-signal” (or to be more accurate
Temporal Sequence, Prediction, and Control
299
“faint-signal”) states prevail throughout most of an agent’s lifetime and it is almost impossible to tell if such a state has come about through successful punishment avoidance or, more likely, just by chance. 7.4 Maximizing Rewards or Minimizing Disturbances? Speculations About Learning. Return maximization is the central paradigm of reinforcement learning, and this gives rise to the problems of credit structuring and credit assignment. Most actor-critic architectures have also been implemented adhering to this paradigm. The architecture in Figure 9 suggests an alternative where actor and critic are merged. It starts with a negative feedback loop (see Figure 7A), which creates a stable starting condition: whenever a deviation from the desired state (set point) occurs, the feedback loop will try to correct it. This property is then also used to guide the learning by means of a minimization instead of a maximization principle: the learning goal is to try to minimize deviations from the desired state, that is, to minimize disturbances of the homeostasis of the feedback loop. These two paradigms are in most cases not equivalent. Most often, one finds that maximal return is associated with a single (or a few) point(s) on the decision surface. Minimal disturbance, however, will cover a dense manifold of points, all of which represent solutions of the learning problem. Thus, correlation-based temporal sequence learning (see Figure 9) offers the advantage that credit structuring takes place without effort: every signal that enters the reflex loop will drive the learning, and if there is no signal, the situation is stable (desirable) for at least the moment. Rewards do not exist in this scheme. Credit assignment during learning also ceases to be a problem. Since we do not attempt to seek maximal return and are instead satisfied with any disturbance-free solution, we can live without credits almost everywhere on the decision surface and let the agent choose its own path until a (rare) disturbance happens. Early observations have pointed to the fact that pure homeostatic mechanisms will not develop “interesting” or even intelligent behavior (Ashby, 1952). This certainly holds for pure feedback loop control as in Figure 7A. However, as shown above, the stereotyped reflex behavior of such a homeostat can be augmented by anticipatory correlation-based learning adding a second loop (see Figure 10). As long as there are anticipatory signals, we can continue to pile loops on top of loops, creating a complex subsumption architecture (Porr & Worg ¨ otter, ¨ 2003c). However, there exist only a rather limited number of causally related anticipatory sensor events that come from the environment. A chain of three such signals could be given: sound may precede smell of the prey, which in turn precedes taste. As a consequence, the depth of a subsumption architecture, which relies exclusively on sensor events, will remain rather limited. However, in principle, nothing prevents us from using intrinsic signals (“thoughts”) to the same purpose. It does not matter to the learner where an anticipatory signal came from, such that this type of correlation-based learning should also work with internal
300
F. Worg ¨ otter ¨ and B. Porr
signals. Naively, if I think this and a certain event almost always follows, then this “thought should be strengthened.” This is certainly highly speculative, but we would like to add that through such a process of anticipatory correlation-based learning, “values” can actually be developed: anticipatory learning carries the implicit semantics that “earlier is better.” This could be a natural starting point to develop more complex value (hence “reward”) structures without having to assume external interference. 8 Conclusion This article provides an overview across the field of temporal sequence learning. Two problems were mainly discussed: how to match neuronal models to physiology and how to design biologically plausible control methods, possibly using such algorithms. We believe that there are three main questions that need to be addressed in the future: 1. How can we bridge the temporal gap between STDP (or LTP, LTD) and behavioral learning? 2. When do animals (humans) follow the return-maximization principle? Do they sometimes (often?) adopt different strategies like disturbance minimization? And how do they decide when to use the different strategies? 3. How can we implement (evolve, develop) nonevaluative feedback or strictly internal evaluative feedback, without adopting the perspective of an external observer, to arrive at strictly agent-driven prediction and control? Without answering the first question, it will be impossible to explain animal learning by means of biophysical network models of brain function. Without addressing the second set of questions, control agents may be too limited and inflexible in their behavioral and decision-making potential. Without addressing the last question, we will always design only control agents that follow our own intentions (or at least our own frame of reference). Our reference frame depends on our own internal states; thus, it can never accurately match that of any other agent which receives different inputs and performs different internal processing. As a consequence, such an agent is unavoidably doomed to fail when left to its own (really our!) devices in its own (really its!) world. It will fail to be autonomous. Appendix This appendix provides the background on the use of eligibility traces in the TD formalism of reinforcement learning. This way it becomes clear that the concept of eligibility traces has penetrated different algorithms and that the backward TD(λ) algorithms contain a term that can be used to define a
Temporal Sequence, Prediction, and Control
301
correlative process. This way, backward TD(λ) can, if desired, be related to Hebbian learning. A.1 The TD Formalism for the Prediction Problem in Reinforcement Learning. We consider a sequence st , rt+1 , st+1 , rt+2 , . . . , rT , sT . Note that rewards occur downstream (in the future) from a visited state.13 Thus, rt+1 is the next future reward that can be reached starting from state st . For simplicity, actions are not denoted here. The complete return Rt , to be expected in the future from state st , is thus given by Rt = rt+1 + γ rt+2 + γ 2 rt+3 + . . . + γ T−t−1 rT ,
(A.1)
where γ ≤ 1 is the discount factor. Reinforcement learning assumes that the value of a state V(s) is directly equivalent to the expected return E at this state, where π denotes the (here unspecified) action policy to be followed: V(s) = Eπ {Rt |st = s}.
(A.2)
Thus, the value of state st can be iteratively updated with V(st ) ← V(st ) + α[Rt − V(st )].
(A.3)
The left side of this equation represents the new value of st calculated by using the complete return Rt and holding it against the old value of st (right side). We use α as a step-size parameter, which is not of great importance here, though, and can be held constant. Note that if V(st ) correctly predicts the expected complete return Rt , the update will be zero, and we have found the final value. This method is called constant-α Monte Carlo update. It requires waiting until a sequence has reached its terminal state before the update can commence. For long sequences, this may be problematic. Thus, one should try to use an incremental procedure instead. We define a different update rule with V(st ) ← V(st ) + α[rt+1 + γ V(st+1 ) − V(st )].
(A.4)
The elegant trick is to assume that if the process converges, the value of the next state V(st+1 ) should be an accurate estimate of the expected return downstream to this state (i.e., downstream to st+1 ). Thus, we would hope that Rt = rt+1 + γ V(st+1 )
(A.5)
13 We are using a slightly simplified version of the notation conventions from Sutton and Barto (1998). This book should be consulted in case of detailed questions.
302
F. Worg ¨ otter ¨ and B. Porr
holds. Indeed, proofs exist (Sutton, 1988) that under certain boundary conditions, this procedure, known as TD(0), converges to the optimal value function for all states. In principle, the same procedure can be applied all the way downstream, writing, 2 n−1 R(n) rt+n + γ n V(st+n ). t = rt+1 + γ rt+2 + γ rt+3 + · · · + γ
(A.6)
Thus, we could update the value of state st by moving downstream to some future state st+n−1 , accumulating all rewards along the way, including the last future reward rt+n , and then approximating the missing bit until the terminal state by the estimated value of state st+n given as V(st+n ). Furthermore, we can even take different such update rules and average their results in the following way: Rλt = (1 − λ)
∞
λn−1 R(n) t
(A.7)
n=1
V(st ) ← V(st ) + α[Rλt − V(st )],
(A.8)
where 0 ≤ λ ≤ 1. This is the most general formalism for a TD rule known as forward TD(λ)-algorithm (Sutton, 1988), where we assume an infinitely long sequence. Convergence proofs are given by Dayan (1992), Peng (1993), and Dayan and Sejnowski (1994). The disadvantage of this formalism is still that for all λ > 0, we have to wait until we have reached the terminal state until update of the value of state st can commence. There is a way to overcome this problem by introducing so-called eligibility traces re-introduced in the context of RL by Watkins (1989) and Jaakkola, Jordan, and Singh (1995). Let us assume that we came from state A, and now we are currently visiting state B of an MDP. B’s value can be updated by the TD(0) rule after we have moved on by only a single step to, say, state C. We define the incremental update as before as δt = rt+1 + γ V(st+1 ) − V(st ).
(A.9)
Normally we would assign a new value to state B by performing V(sB ) ← V(sB ) + αδB , not considering any other previously visited states. In using eligibility traces, we do something different and assign new values to all previously visited states, making sure that changes at states long in the past are much smaller than those at states visited just recently. To this end, we define the eligibility trace of a state as xt (s) =
γ λxt−1 (s) γ λxt−1 (s) + 1
if if
s = st . s = st
(A.10)
Temporal Sequence, Prediction, and Control
303
Thus, the eligibility trace of the currently visited state is incremented by one, while the eligibility traces of all states decay with a factor of γ λ. Instead of updating just the most recently left state st we will now loop through all states visited in the past of this trial, which still have an eligibility trace larger than zero and update them according to V(st ) ← V(st ) + αδt xt (s).
(A.11)
In our example, we will thus also update the value of state A by V(sA ) ← V(sA ) + α δB xB (A). This means we are using the TD error δB from the state transition B → C (see equation A.9), weight it with the currently existing numerical value of the eligibility trace of state A given by xB (A), and use this to correct the value of state A “a little bit.” This procedure always requires only a single newly computed TD error using the computationally very cheap TD(0)-rule, and all updates can be performed online when moving through the state-space without having to wait for the terminal state. The whole procedure is known as backward TD(λ)-algorithm, and it can be shown that it is mathematically equivalent to forward TD(λ) described above (Sutton & Barto, 1998). All these algorithms have been designed for time- and spacediscrete problems (MDPs). A continuous approach has been described by Doya (2000). Acknowledgments This work was supported by the SHEFC INCITE and the EU ECOVISON grants. We thank P. Dayan, A. Saudargiene, and L. Smith for helpful discussions. References Abarbanel, H. D. I., Gibb, L., Huerta, R., & Rabinovich, M. I. (2003). Biophysical model of synaptic plasticity dynamics. Biol. Cybern., 89(3), 214–226. Abarbanel, H. D. I., Huerta, R., & Rabinovich, M. I. (2002). Dynamical model of long-term synaptic plasticity. Proc. Natl. Acad. Sci. (USA), 99(15), 10132–10137. Abbott, L. F., & Blum, K. I. (1996). Functional significance of long-term potentiation for sequence learning and prediction. Cereb. Cortex, 6, 406–416. Abbott, L. S., & Song, S. (1999). Temporal asymmetric Hebbian learning, spike timing and neuronal response variability. In M. S. Kearns, S. Solla, & D. A. Cohn, (Eds.), Advances in neural information processing systems, (pp. 69–75). Cambridge, MA: MIT Press. Abeles, M. (1991). Corticotronics: Neural circuits of the cerebral cortex. Cambridge. Cambridge University Press. Akopian, G., Musleh, W., Smith, R., & Walsh, J. P. (2000). Functional state of corticostriatal synapses determines their expression of short- and long-term plasticity. Synapse, 38, 271–280.
304
F. Worg ¨ otter ¨ and B. Porr
Altar, C. A., Boyar, W. C., & Kim, H. S. (1990). Discriminatory roles for D1 and D2 dopamine receptor subtypes in the in vivo control of neostriatal cyclic GMP. Eur. J. Pharmacol., 181, 17–21. Anderson, C., & Crawford-Hines, S. (1994). Multigrid q-learning (Tech. Rep. No. CS-94-121), Fort Collins: Colorado State University. Artola, A., & Singer, W. (1993). Long-term depression of excitatory synaptic transmission and its relationship to long-term potentation. Trends Neurosci., 16, 480–487. Ashby, W. R. (1952). Design for a brain. London: Chapman & Hall. Balkenius, C., & Moren, J. (1998). Computational models of classical conditioning: A comparative study. (LUCS 62 ISSN 1101-8453). Lund, Sweden Lund University Cognitive Studies. Barto, A. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 215–232). Cambridge, MA: MIT Press. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 835–846. Bell, C. C., Han, V. Z., Sugawara, Y., & Grankt, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press. Benett, M. R. (2000). The concept of long term potentiation of transmission at synapses. Prog. Neurobiol., 60, 109–137. Berger, B., Trottier, S., Verney, C., Gaspar, P., & Alvarez, C. (1988). Regional and laminar distribution of the dopamine and serotonin innervation in the macaque cerebral cortex: A radioautographic study. J. Comput. Neurol., 273, 99–119. Berns, G. S., McClure, S. M., Pagoni, G., & Montague, P. R. (2001). Predictability modulates human brain responses to reward. J. Neurosci., 21, 2793–2798. Berns, G. S., & Sejnowski, T. J. (1998). A computational model of how the basal ganglia produce sequences. J. Cogn. Neurosci., 10 (1), 108–121. Berridge, K. C., & Robinson, T. E. (1998). What is the role of dopamine in reward: Hedonic impact, reward learning or incentive salience? Brain Res. Rev., 28, 309–369. Berridge, M. J. (1998). Neuronal calcium signaling. Neuron, 21, 13–26. Bi, G. Q. (2002). Spatiotemporal specificity of synaptic plasticity: Cellular rules and mechanisms. Biol. Cybern., 87, 319–332. Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G.-Q., & Poo, M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu. Rev. Neurosci., 24, 139–166. Bibb, J. A., Snyder, G. L., Nishi, A., Yan, Z., Meijer, L., Fienberg, A. A., Tsai, L. H., Kwon, Y. T., Girault, J. A., Czernik, A. J., Huganir, R. L., Hemmings, H. C., Nairn, A. C., and Greengard, P. (1999). Phosphorylation of
Temporal Sequence, Prediction, and Control
305
DARPP-32 by Cdk5 modulates dopamine signalling in neurons. Nature, 402, 669–671. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2(1), 32–48. Bliss, T. V., & Gardner-Edwin, A. R. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the unanaesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 357–374. Bliss, T. V., & Lomo, T. (1970). Plasticity in a monosynaptic cortical pathway. J. Physiol. (Lond.), 207, 61P. Bliss, T. V., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 331–356. Breiter, H. C., Aharon, I., Kahneman, D., Dale, A., & Shizgal, P. (2001). Functional imaging of neural responses to expectancy and experience of monetary gains and losses. Neuron, 30, 619–639. Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. J. Neurosci., 19(23), 10502–10511. Brown, J. R., & Arbuthnott, G. W. (1983). The electrophysiology of dopamine D2 receptors: A study of the actions of dopamine on corticostriatal transmission. Neurosci., 10, 349–355. Buchs, N. J., & Senn, W. (2002). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Simultation results. J. Comput. Neurosci., 13, 167–186. Bunney, B. S., Chiodo, L. A., & Grace, A. A. (1991). Midbrain dopamine system electrophysiological functioning: A review and new hypothesis. Synapse, 9, 79–94. Buzsaki, G., Penttonen, M., Nadasdy, Z., & Bragin, A. (1996). Pattern and inhibition-dependent invasion of pyramidal cell dendrites by fast spikes in the hippocampus in vivo. Proc. Natl. Acad. Sci. USA, 93, 9921–9925. Calabresi, P., Centonze, D., Gubellini, P., Marfia, G. A., & Bernardi, G. (1999). Glutamate-triggered events inducing corticostriatal long-term depression. J. Neurosci., 19, 6102–6110. Calabresi, P., Fedele, E., Pisani, A., Fontana, G., Mercuri, N. B., Bernardi, G., & Raiteri, M. (1995). Transmitter release associated with long-term synaptic depression in rat corticostriatal slices. European J. Neurosci., 7, 1889–1894. Calabresi, P., Gubellini, P., Centonze, D., Picconi, B., Bernardi, G., Chergui, K., Svenningsson, P., Fienberg, A. A., & Greengard, P. (2000). Dopamine and cAMP-regulated phosphoprotein 32 kDa controls both striatal long-term depression and long-term potentiation opposing forms of synaptic plasticity. J. Neurosci., 20(22), 8443–8451. Calabresi, P., Maj, R., Mercuri, N. B., & Bernardi, G. (1992a). Coactivation of D1 and D2 dopamine receptors is required for long-term synaptic depression in the striatum. Neurosci. Letters, 142, 95–99.
306
F. Worg ¨ otter ¨ and B. Porr
Calabresi, P., Maj, R., Pisani, A., Mercuri, N. B., & Bernardi, G. (1992b). Longterm synaptic depression in the striatum: Physiological and pharmacological characterization. J. Neurosci., 12, 4224–4233. Calabresi, P., Mercuri, N., Stanzione, P., Stefani, A., & Bernardi, G. (1987). Intracellular studies on the dopamine-induced firing inhibition of neostriatal neurons in vitro: Evidence for D1 receptor involvement. Neurosci., 20, 757– 771. Calabresi, P., Pisani, A., Mercuri, N. B., & Bernardi, G. (1992c). Long-term potentiation in striatum is unmasked by removing the voltage-dependent magnesium block of NMDA receptor channels. European J. Neurosci., 4, 929–935. Calabresi, P., Pisani, A., Mercuri, N. B., & Bernardi, G. (1996). The corticostriatal projection: From synaptic plasticity to dysfunctions of the basal ganglia. Trends Neurosci., 19, 19–24. Cannon, C. M., & Palmiter, R. D. (2003). Reward without dopamine. J. Neurosci, 23(34), 10827–10831. Carpenter, G. A., & Grossberg, S. (1987). ART-2: Self-organization of stable category recognition codes for analog input pattern. Applied Optics, 26, 4919– 4930. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors. Proc. Natl. Acad. Sci. (USA), 98(22), 12772–12777. Cateau, H., Kitano, K., & Fukai, T. (2002). An accurate and widely applicable method to determine the distribution of synaptic strengths formed by the spike-timing-dependent learning. Neurocomputing, 44(46), 343–351. Centonze, D., Gubellini, P., Picconi, B., Calabresi, P., Giacomini, P., & Bernardi, G. (1999). Unilateral dopamine denervation blocks corticostriatal LTP. J. Neurophysiol., 82, 3575–3579. Centonze, D., Picconi, B., Gubellini, P., Bernardi, G., & Calabresi, P. (2001). Dopaminergic control of synaptic plasticity in the dorsal striatum. European J. Neurosci., 13, 1071–1077. Charpier, S., & Deniau, J. M. (1997). In vivo activity-dependent plasticity at cortico-striatal connections: Evidence for physiological long-term potentiation. Proc. Natl. Acad. Sci. USA, 94, 7036–7040. Charpier, S., Mahon, S., & Deniau, J. M. (1999). In vivo induction of striatal long-term potentiation by low-frequency stimulation of the cerebral cortex. Neurosci., 91, 1209–1222. Choi, S., & Lovinger, D. M. (1997). Decreased probability of neuro-transmitter release underlies long-term depression and postnatal development of corticostriatal synapses. Proc. Natl. Acad. Sci. USA, 94, 2665–2670. Contreras-Vidal, J. L., & Schultz, W. (1999). A predictive reinforcement model of dopamine neurons for learning approach behavior. J. Comput. Neurosci., 6, 191–214. Daw, N. D. (2003). Reinforcement learning models of the dopamine system and their behavioral implications. Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA. Dayan, P. (1992). The convergence of TD(λ). Mach. Learn., 8(3/4), 341–362. Dayan, P. (2002). Matters temporal. Trends Cogn. Sci., 6(3), 105–106.
Temporal Sequence, Prediction, and Control
307
Dayan, P., & Balleine, B. W. (2002). Reward, motivation, and reinforcement learning. Neuron, 36(2), 285–298. Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning and selective attention. Nature Neurosci., 3, 1218–1223. Dayan, P., & Seynowski, T. (1994). TD(λ) converges with probability 1. Mach. Learn., 14, 295–301. Debanne, D., Gahwiler, B. T., & Thompson, S. H. (1994). Asynchronous pre- and postsynaptic activity induces associative long-term depression in area CAI of the rat hippocampus in vitro. Proc. Natl. Acad. Sci. (USA), 91, 1148–1152. Debanne, D., Gahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol. (Lond.), 507, 237–247. Deisseroth, K., Heist, E. K., & Tsien, R. W. (1998). Translocation of calmodulin to the nucleus supports CREB phosphorylation in hippocampal neurons. Nature, 392, 198–202. DeKoninck, P., & Schulmann, H. (1998). Sensitivity of CaM kinase II to the frequency of Ca2+ oscillations. Science, 279, 227–230. Delgado, M. R., Nystrom, L. E., Fissell, C., Noll, D. C., & Fiez, J. A. (2000). Tracking the hemodynamic responses to reward and punishment in the striatum. J. Neurophysiol., 84, 3072–3077. Dennett, D. C. (1984). Cognitive wheels: The frame problem of AI. In C. Hookway (Eds.), Minds, machines and evolution (pp. 129–151). Cambridge: Cambridge University Press. Dolmetsch, R. E., Pajvani, U., Fife, K., Spotts, J. M., & Greenberg, M. E. (2001). Signaling to the nucleus by an L-type calcium channel-calmodulin complex through the MAP kinase pathway. Science, 294, 333–339. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Comp., 12(1), 219–245. Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nature Neurosci., 2, 1098–1105. Elliott, R., Friston, K. J., & Dolan, R. J. (2000). Dissociable neural responses in human reward systems. J. Neurosci., 20, 6159–6165. Eurich, C. W., Pawelzik, K., Ernst, U., Cowan, J. D., & Milton, J. G. (1999). Dynamics of self-organized delay adaptation. Phys. Rev. Lett., 82, 1594–1597. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biol. Cybern., 87, 459–470. Futatsugi, A., Kato, K., Ogura, H., Li, S. T., Nagata, E., Kuwajima, G., Tanaka, K., Itohara, S., & Mikoshiba, K. (1999). Facilitation of NMDAR-independent LTP and spatial learning in mutant mice lacking ryanodine receptor type 3. Neuron, 24, 701–713. Gerfen, C. R. (1984). The neostriatal mosaic: Compartmentalization of corticostriatal input and striatonigral output systems. Nature, 311, 461–464.
308
F. Worg ¨ otter ¨ and B. Porr
Gerfen, C. R. (1985). The neostriatal mosaic. I. Compartmental organization of projections from the striatum to the substantia nigra in the rat. J. Comp. Neurol., 236, 454–476. Gerfen, C. R. (1992). The neostriatal mosaic: Multiple levels of compartmental organization in the basal ganglia. Annu. Rev. Neurosci., 15, 285–320. Gerfen, C. R., Herkenham, M., & Thibault, J. (1987). The neostriatal mosaic: II. Patch- and matrix- directed mesostriatal dopaminergic and nondopaminergic systems. J. Neurosci., 7, 3915–3934. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. M. (2002). Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Goldberg, J., Holthoff, K., & Yuste, R. (2002). A problem with Hebb and local spikes. Trends Neurosci., 25(9), 433–435. Golding, N., Kath, W. L., & Spruston, N. (2001). Dichotomy of action-potential backpropagation in CA1 pyramidal neuron dendrites. J. Neurophysiol., 86, 2998–3010. Golding, N. L., Staff, P. N., & Spruston, N. (2002). Dendritic spikes as a mechanism for cooperative long-term potentiation. Nature, 418, 326–331. Gormezano, I., Kehoe, E. J., & Marshall, B. S. (1983). Twenty years of classical conditioning research with the rabbit. In J. M. Sprague, & A. N. Epstein, (Eds.), Progress of psychobiology and physiological psychology (pp. 198–274). Orlando, FL: Academic Press. Graef, I. A., Mermelstein, P. G., Stankunas, K., Neilson, J. R., Deisseroth, K., Tsien, R. W., & Crabtree, G. R. (1999). L-type calcium channels and GSK-3 regulate the activity of NF-ATc4 in hippocampal neurons. Nature, 401, 703– 708. Greengard, P., Allen, P. B., & Nairn, A. C. (1999). Beyond the dopamine receptor: The DARPP-32/protein phosphatase-1 cascade. Neuron, 23, 435–447. Groenewegen, H. J., Berendse, H. W., Wolters, J. G., & Lohman, A. H. M. (1990). The anatomical relationship of the prefrontal cortex with the striatopallidal system, the thalamus and the amygdala: Evidence for a parallel organization. Prog. Brain Res., 85, 95–118. Grossberg, S. (1995). A spectral network model of pitch perception. J. Acoust. Soc. Am., 98(2), 862–879. Grossberg, S., & Merrill, J. (1996). The hippocampus and cerebellum in adaptively timed learning, recognition and movement. J. Cogn. Neurosci., 8, 257– 277. Grossberg, S., & Schmajuk, N. (1989). Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks, 2, 79– 102. Groves, P. M., Garcia-Munoz, M., Linder, J. C., Manley, M. S., Martone, M. E., & Young, S. J. (1995). Elements of the intrinsic organization and information processing in the neostriatum. In J. C. Houk, J. L. Davis, & D. G., Beiser, (Eds.), Models of information processing in the basal ganglia (pp. 51–96). Cambridge, MA: MIT Press.
Temporal Sequence, Prediction, and Control
309
Gruber, A. J., Solla, S. A., Surmeier, D. J., & Houk, J. C. (2003). Modulation of striatal single units by expected reward: A spiny neuron model displaying dopamine-induced bistability. J. Neurophysiol., 90, 1095–1114. Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y.-Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774–780. Gutig, ¨ R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci., 23(9), 3697–3714. Haber, S. N., Fudge, J. L., & McFarland, N. R. (2000). Striatonigrostriatal pathways in primates form an ascending spiral from the shell to the dorsolateral striatum. J. Neurosci., 20, 2369–2382. Han, V. Z., Grant, K., & Bell, C. C. (2000). Reversible associative depression and nonassociative potentiation at a parallel fiber synapse. Neuron, 27, 611–622. Hassani, O. K., Cromwell, H. C., & Schultz, W. (2001). Influence of expectation of different rewards on behavior-related neuronal activity in the striatum. J. Neurophysiol., 85, 2477–2489. Hauber, W., Bohn, I., & Giertler, C. (2000). NMDA, but not dopamine D2, receptors in the rat nucleus accumbens are involved in guidance of instrumental behavior by stimuli predicting reward magnitude. J. Neurosci., 20(16), 6282– 6288. H¨ausser, M., Spruston, N., & Stuart, G. J. (2000). Diversity and dynamics of dendritic signaling. Science, 11, 739–744. Hebb, D. O. (1949). The organization of behavior: A neuropsychological study. New York: Wiley-Interscience. Hemmings, H. C., J., Greengard, P., Tung, H. Y., & Cohen, P. (1984). DARPP-32, a dopamine-regulated neuronal phosphoprotein, is a potent inhibitor of protein phosphatase-1. Nature, 310, 503–505. Hernandez-Lopez, S., Bargas, J., Surmeier, D. J., Reyes, A., & Galarraga, E. (1997). D1 receptor activation enhances evoked discharge in neostriatal medium spiny neurons by modulating an L-type Ca2 + conductance. J. Neurosci., 17, 3334–3342. Hoffman, D. A., Magee, J. C., Colbert, C. M., & Johnston, D. (1997). K+ channel regulation of signal propagation in dendrites of hippocampal pyramidal neurons. Nature, 387, 869–875. Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neurosci., 1(4), 304–309. Hollerman, J. R., Tremblay, L., & Schultz, W. (1998). Influence of reward expectation on behavior-related neuronal activity in primate striatum. J. Neurophysiol., 80, 947–963. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G., Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press.
310
F. Worg ¨ otter ¨ and B. Porr
Hull, C. L. (1939). The problem of stimulus equivalence in behavior theory. Psychological Review, 46, 9–30. Hull, C. L. (1943). Principles of behavior. New York: Appleton-Century-Crofts. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comp., 15, 1511–1523. Jaakkola, T., Jordan, M. I., & Singh, S. P. (1995). On the convergence of stochastic iterative dynamic programming algorithms. Neural Comp., 6(6), 1185–1201. Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535– 547. Joel, D., & Weiner, I. (2000). The connections of the dopaminergic system with the striatum in rats and primates: An analysis with respect to the functional and compartmental organization of the striatum. Neurosci., 96, 451–474. Johnston, D., Magee, J. C., Colbert, C. M., & Cristie, B. R. (1996). Active properties of neuronal dendrites. Annu. Rev. Neurosci., 19, 165–186. Joyce, J. N., & Marshall, J. F. (1987). Quantitative autoradiography of dopamine D2 sites in rat caudate-putamen: Localization to intrinsic neurons and not to neocortical afferents. Neurosci., 20, 773–795. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285. Karmarkar, U. R., & Buonomano, D. V. (2002). A model of spike-timing dependent plasticity: One or two coincidence detectors? J. Neurophysiol., 88, 507–513. Karmarkar, U. R., Najarian, M. T., & Buonomano, D. V. (2002). Mechanisms and significance of spike-timing dependent plasticity. Biol. Cybern., 87, 373–382. Kawagoe, R., Takikawa, Y., & Hikosaka, O. (1998). Expectation of reward modulates cognitive signals in the basal ganglia. Nat. Neurosci., 1, 411– 416. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E., 59, 4498–4515. Kempter, R., Gerstner, W., & van Hemmen J. L. (2001a). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comp., 13, 2709–2741. Kempter, R., Leibold, C., Wagner, H., & van Hemmen, J. L. (2001b). Formation of temporal-feature maps by axonal propagation of synaptic learning. Proc. Natl. Acad. Sci. (USA), 98(7), 4166–4171. King, M. M., Huang, C. Y., Chock, P. B., Nairn, A. C., Hemmings, H. C., J., Chan, K. F., & Greengard, P. (1984). Mammalian brain phosphoproteins as substrates for calcineurin. J. Biol. Chem., 259, 8080–8083. Kistler, W. M. (2002). Spike-timing dependent synaptic plasticity: A phenomenological framework. Biol. Cybern., 87, 416–427. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and post-synaptic action potentials. Neural Comput., 12, 385–405. Kitano, K., Okamoto, H., & Fukai, T. (2003). Time representing cortical activities: Two models inspired by prefrontal persistent activity. Biol. Cybern, 88(5), 387– 394.
Temporal Sequence, Prediction, and Control
311
Klopf, A. H. (1972). Brain function and adaptive systems—a heterostatic theory. (Tech. Rep. Air Force Cambridge Research Laboratories Special Report No. 133), Alexandria, VA: Defense Technical Information Center, Cameron Station. Klopf, A. H. (1982). The hedonistic neuron: A theory of memory, learning, and intelligence. Washington, DC: Hemisphere. Klopf, A. H. (1986). A drive-reinforcement model of single neuron function. In J. S. Denker (Eds.), Neural networks for computing: AIP Conf. Proc., New York: American Institute of Physics. Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiol., 16(2), 85–123. Knutson, B., Fong, G. W., Adams, C. M., Varner, J. L., & Hommer, D. (2001). Dissociation of reward anticipation and outcome with event-related fMRI. Neuroreport, 12, 3683–3687. Koester, H. J., & Sakmann, B. (1998). Calcium dynamics in single spines during coincident pre- and postsynaptic activity depend on relative timing of back-propagating action potentials and subthreshold excitatory postsynaptic potentials. Proc. Natl. Acad. Sci. USA, 95, 9596–9601. Kosco, B. (1986). Differential Hebbian learning. In J. S. Denker, (Ed.), Neural networks for computing: AIP Conference Proc. New York: American Institute of Physics. Kurata, K., & Wise, S. P. (1988). Premotor and supplementary motor cortex in rhesus monkeys: Neuronal activity during externally- and internally-instructed motor tasks. Exp. Brain Res., 72, 237–248. Lasser-Ross, N., & Ross, W. N. (1992). Imaging voltage and synaptically activated sodium transients in cerebellar Purkinje cells. Proc. R. Soc. B. Biol. Sci., 247, 35–39. Leibold, C., Kempter, R., & van Hemmen, J. L. (2001). Temporal map formation in the barn owl’s brain. Phys. Rev. Lett., 87(24), 248101–1–248101–4. Leibold, C., & van Hemmen, J. L. (2001). Temporal receptive fields, spikes, and Hebbian delay selection. Neural Networks, 14(6-7), 805–813. Leibold, C., & van Hemmen, J. L. (2002). Mapping time. Biol. Cybern., 87, 428–439. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in a cell assembly of spiking neurons. Neural Networks, 14, 815–824. Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neurosci., 8, 791–797. Linden, D. J. (1999). The return of the spike: Postsynaptic action potentials and the induction of LTP and LTD. Neuron, 22, 661–666. Lisman, J. (1989). A mechanism for the Hebb and the anti-Hebb processes underlying learning and memory. Proc. Natl. Acad. Sci. USA, 86, 9574–9578. Ljungberg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons during learning of behavioral reactions. J. Neurophysiol., 67, 145–163. Lovinger, D. M., Tyler, E. C., & Merritt, A. (1993). Short- and long-term synaptic depression in rat neostratium. J. Neurophysiol., 70, 1937–1949. Luscher, ¨ C., & Frerking, M. (2001). Restless AMPA receptors: Implications for synaptic transmission and plasticity. Trends Neurosci., 24(11), 665–670.
312
F. Worg ¨ otter ¨ and B. Porr
Lynd-Balta, E., & Haber, S. N. (1994). Primate striatonigral projections: A comparison of the sensorimotor-related striatum and the ventral striatum. J. Comp. Neurol., 345, 562–578. Mackintosh, N. J. (1974). The psychology of animal learning. Orlando, FL: Academic Press. Mackintosh, N. J. (1983). Conditioning and associative learning. New York: Oxford University Press. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209–213. Malenka, R. C., Kauer, J. A., Perkel, D. J., Mauk, M. D., Kelly, P. T., Nicoll, R. A., & Waxham, M. N. (1989). An essential role for postsynaptic calmodulin and protein kinase activity in long-term potentiation. Nature, 340, 554–557. Malenka, R. C., & Nicoll, R. A. (1999). Long-term potentiation—a decade of progress? Science, 285, 1870–1874. Malinow, R., Schulman, H., & Tsien, R. W. (1989). Inhibition of postsynaptic PKC or CaMKII blocks induction but not expression of LTP. Science, 245, 862–866. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Martin, S. J., Grimwood, P. D., & Morris, R. G. M. (2000). Synaptic plasticity and memory: An evaluation of the hypothesis. Annu. Rev. Neurosci., 23, 649–711. Martinez, J. L., & Derrick, B. E. (1996). Long-term potentiation and learning. Annu. Rev. Psychol., 47, 173–203. Mayer, M. L., Westbrook, G. L., & Guthrie, P. B. (1984). Voltage-dependent block by Mg2 + of NMDA responses in spinal cord neurones. Nature, 309, 261–263. McFarland, D. J. (1989). Problems of animal behavior. Harlow: Longman Scientific & Technical. Mehta, M. R. (2004). Cooperative LTP can map memory sequences on dendritic branches. TINS, 27, 69–72. Mehta, M. R., Lee, A. K., & Wilson, M. A. (2002). Role of experience and oscillations in transforming a rate code into a temporal code. Nature, 417(6890), 741–746. Melamed, O., Gerstner, W., Maass, W., Tsodyks, M., & Markram, H. (2004). Coding and learning of behavioral sequences. Trends Neurosci, 27(1), 11–14. Memo, M., Lovenberg, W., & Hanbauer, I. (1982). Agonist-induced subsensitivity of adenylate cyclase coupled with a dopamine receptor in slices from rat corpus striatum. Proc. Natl. Acad. Sci. USA, 79, 4456–4460. Migliore, M., Hoffmann, D. A., Magee, J. C., & Johnston, D. (1999). Role of an A-type K+ conductance in the back-propagation of action potentials in the dendrites of hippocampal pyramidal neurons. J. Comput. Neurosci., 7, 5–15. Miller, J. D., Sanghera, M. K., & German, D. C. (1981). Mesencephalic dopaminergic unit activity in the behaviorally conditioned rat. Life Sci., 29, 1255–1263. Mirenowicz, J., & Schultz, W. (1994). Importance of unpredictability for reward responses in primate dopamine neurons. J. Neurophysiol., 72(2), 1024–1027. Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728.
Temporal Sequence, Prediction, and Control
313
Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci., 16(5), 1936–1947. Morris, B. J., Simpson, C. S., Mundell, S., Maceachern, K., Johnston, H. M., & Nolan, A. M. (1997). Dynamic changes in NADPH-diaphorase staining reflect activity of nitric oxide synthase: Evidence for a dopaminergic regulation of striatal nitric oxide release. Neuropharmacology, 36, 1589–1599. Mulkey, R. M., Endo, S., Shenolikar, S., & Malenka, R. C. (1994). Involvement of a calcineurin/inhibitor-1 phosphatase cascade in hippocampal long-term depression. Nature, 369, 486–488. Nicola, S. M., Surmeier, J., & Malenka, R. C. (2000). Dopaminergic modulation of neuronal excitability in the striatum and nucleus accumbens. Ann. Rev. Neurosci., 23, 185–215. Nishi, A., Snyder, G. L., & Greengard, P. (1997). Bidirectional regulation of DARPP-32 phosphorylation by dopamine. J. Neurosci., 17, 8147–8155. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M., & Kato, K. (2000). Calcium release from internal stores regulates polarity and input specificity of synaptic modification. Nature, 408, 584–588. Nobre, A. C., Coull, J. T., Frith, C. D., & Mesulam, M. M. (1999). Orbitofrontal cortex is activated during breaches of expectation in tasks of visual attention. Nature Neurosci., 2, 11–12. Nowak, L., Bregestovski, P., Ascher, P., Herbet, A., & Prochiantz, A. (1984). Magnesium gates glutamate-activated channels in mouse central neurones. Nature, 307, 462–465. O’Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 28, 329–337. O’Doherty, J., Deichman, R., Critchley, H. D., & Dolan, R. J. (2002). Neural responses during anticipation of primary taste reward. Neuron, 33, 815–826. O’Doherty, J., Rolls, E. T., Francis, S., Bowtell, R., & McGlone, F. (2001). Representation of pleasant and aversive taste in the human brain. J. Neurophysiol., 85, 1315–1321. Overton, P. G., & Clark, D. (1997). Burst firing in midbrain dopaminergic neurons. Brain Res. Rev., 25, 312–334. Pacheco-Cano, M. T., Bargas, J., Hernandez-Lopez, S., Tapia, D., & Galarraga, E. (1996). Inhibitory action of dopamine involves a subthreshold Cs+ -sensitive conductance in neostriatal neurons. Exp. Brain Res., 110, 205–211. Pagoni, G., Zink, C. F., Montague, P. R., & Berns, G. S. (2002). Activity in human ventral striatum locked to errors of reward prediction. Nature Neurosci., 5, 97–98. Panchev, C., Wermter, S., & Chen, H. (2002). Spike-timing dependent competitive learning of integrate-and-fire neurons with active dendrites. In Lecture Notes in Computer Science. Proc. Int. Conf. Artificial Neural Networks (pp. 896– 901). Berlin: Springer. Parent, A. (1990). Extrinsic connections of the basal ganglia. Trends Neurosci., 13, 254–258.
314
F. Worg ¨ otter ¨ and B. Porr
Partridge, J. G., Tang, K. C., & Lovinger, D. M. (2000). Regional and postnatal heterogeneity of activity-dependent long-term changes in synaptic efficacy in the dorsal striatum. J. Neurophysiol., 84, 1422–1429. Pavlov, P. I. (1927). Conditioned reflexes. New York: Oxford University Press. Peng, J. (1993). Efficient dynamic programming-based learning for control. Unpublished doctoral dissertation, Northeastern University. Pennartz, C. M., Ameerun, R. F., Groenewegen, H. J., & Lopes da Silva, F. H. (1993). Synaptic plasticity in an in vitro slice preparation of the rat nucleus accumbens. European J. Neurosci., 5, 107–117. Pike, F. G., Meredith, R. M., Olding, A. A., & Paulsen, O. (1999). Postsynaptic bursting is essential for ”Hebbian” induction of associative long-term potentiation at excitatory synapses in rat hippocampus. J. Physiol. (Lond.), 518, 571–576. Pollack, J. B., & Blair, A. D. (1997). Why did TD-Gammon work? In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, (pp. 10–16). Cambrdige, MA: MIT Press. Porr, B., von Ferber, C., & Worg ¨ otter, ¨ F. (2003). ISO-learning approximates a solution to the inverse-controller problem in an unsupervised behavioral paradigm. Neural Comp., 15, 865–884. Porr, B., & Worg ¨ otter, ¨ F. (2002). Isotropic sequence order learning using a novel linear algorithm in a closed loop behavioural system. Biosystems, 67(1–3), 195–202. Porr, B., & Worg ¨ otter, ¨ F. (2003a). Isotropic sequence order learning. Neural Comp., 15, 831–864. Porr, B., & Worg ¨ otter, ¨ F. (2003b). Isotropic sequence order learning in a closed loop behavioural system. Proc. Roy. Soc. B. Porr, B., & Worg ¨ otter, ¨ F. (2003c). Isotrpic sequence order learning in a closed loop behavioural system. Roy. Soc. Phil. Trans. Mathematical, Physical, & Engineering Sciences, 361, 2225–2244. Prokasy, W. F., Hall, J. F., & Fawcett, J. T. (1962). Adaptation, sensitization, forward and backward conditioning, and pseudo-conditioning of the GSR. Psychol. Rep., 10, 103–106. Pucak, M. L., & Grace, A. A. (1994). Regulation of substantia nigra dopamine neurons. Crit. Rev. Neurobiol., 9, 67–89. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comp., 13, 2221–2237. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). Is the short-latency dopamine response too short to signal reward? Trends Neurosci., 22, 146–151. Regehr, W. G., Konnerth, A., & Armstrong, C. M. (1992). Sodium action potentials in the dendrites of cerebellar Purkinje cells. Proc. Natl. Acad. Sci. USA, 89, 5492–5496. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory. New York: Appleton Century Crofts. Reynolds, J. N. J., & Wickens, J. R. (2002). Dopamine-dependent plasticity of corticostriatal synapses. Neural Networks, 15, 507–521.
Temporal Sequence, Prediction, and Control
315
Reynolds, S. I. (2002). Reinforcement learning with exploration. Unpublished doctoral dissertation, University of Birmingham. Roberts, P. D. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Comput. Neurosci., 7(3), 235–246. Roberts, P. D., & Bell, C. C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. Roche, K. W., O’Brian, R. J., Mammen, A. L., Bernhardt, J., & Huganir, R. L. (1996). Characterization of multiple phosphorylation sites on the AMPA receptor GluR1 subunit. Neuron, 16, 1179–1188. Rose, C. R., & Konnerth, A. (2001). Stores not just for storage, intracellular calcium release and synaptic plasticity. Neuron, 31, 519–522. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86(2), 364–367. Rummery, G. A. (1995). Problem solving with reinforcement learning. Unpublished doctoral dissertation, Cambridge University. Russell, V. A., Allin, R., Lamm, M. C., & Taljaard, J. J. (1992). Regional distribution of monoamines and dopamine D1- and D2-receptors in the striatum of the rat. Neurochemical Res., 17, 387–395. Santos, J. M., & Touzet, C. (1999a). Dynamic update of the reinforcement function during learning. Connection Science [Special issue], 11 (3–4). Santos, J. M. & Touzet, C. (1999b). Exploration tuned reinforcement function. Neurocomputing [Special issue], 28(1–3), 93–105. Sato, N., & Yamaguchi, Y. (2003). Memory encoding by theta phase precession in the hippocampal network. Neural Computation, 15, 2379–2397. Saudargiene, A., Porr, B., & Worg ¨ otter, ¨ F. (2004). How the shape of pre- and postsynaptic signals can influence STDP: A biophysical model. Neural Comp., 16, 595–626. Schiller, J., Schiller, Y., Stuart, G., & Sakmann, B. (1997). Calcium action potentials restricted to distal apical dendrites of rat neocortical pyramidal neurons. J. Physiol., 505, 605–616. Schoenbaum, G., Chiba, A. A., & Gallagher, M. (1998). Orbitofrontal cortex and basolateral amygdala encode expected outcomes during learning. Nature Neurosci., 1, 155–159. Schultz, W. (1986). Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey. J. Neurophysiol., 56, 1439–1462. Schultz, W. (1998). Predictive reward signal of dopamine neurons. J. Neurophysiol., 80, 1–27. Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36, 241–263. Schultz, W., Apiccela, P., Scarnati, E., & Ljungberg, T. (1992). Neuronal activity in monkey ventral striatum related to the expectation of reward. J. Neurosci., 12, 4595–4610. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Dickinson, A. (2000). Neuronal coding of prediction errors. Annu. Rev. Neurosci., 23, 473–500.
316
F. Worg ¨ otter ¨ and B. Porr
Senn, W. & Buchs, N. J. (2003). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Mathematical analysis. J. Comput. Neurosci., 14, 119–138. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre-and postsynaptic spike timing. Neural Comp., 13, 35–67. Seung, H. S. (1998). Learning continous attractors in recurrent networks. In M. Kearns, M. Jordan, & S. Solla, (Eds.), Advances in neural information processing systems, 10 (pp. 654–660). Cambridge, MA: MIT Press. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. (USA), 99(16), 10831–10836. Sibley, D. R., & Monsma, F. J. J. (1992). Molecular biology of dopamine receptors. Trends Pharmakol. Sci., 13, 61–69. Singh, S. P., & Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Mach. Learn., 22, 123–158. Sjostr ¨ om, ¨ P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, I., & Huganir, R. L. (2002). Regulation of AMPA receptors during synaptic plasticity. Trends Neurosci., 25(11), 578–588. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing-dependent plasticity. Neuron, 32, 1–20. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neurosci., 3, 919– 926. Spencer, J. P., & Murphy, K. P. (2000). Bi-directional changes in synaptic plasticity induced at corticostriatal synapses in vitro. Exp. Brain Res., 135, 497–503. Sterratt, D. C., & van Ooyen, A. (2002). Does morphology influence temporal plasticity? In J. R. Dorronsoro, (Ed.), ICANN 2002 (Vol. LNCS 2415, pp. 186– 191). Berlin: Springer-Verlag. Stoof, J. C., & Kebabian, J. W. (1981). Opposing roles for D-1 and D-2 dopamine receptors in efflux of cyclic AMP from rat neostriatum. Nature, 294, 366–368. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367, 69–72. Stuart, G., Spruston, N., Sakmann, B., & H¨ausser, M. (1997). Action potential initiation and backpropagation in neurons of the mammalian central nervous system. Trends Neurosci., 20, 125–131. Suri, R. E., Bargas, J., & Arbib, M. A. (2001). Modeling functions of striatal dopamine modulation in learning and planning. Neurosci., 103(1), 65–85. Suri, R. E., & Schultz, W. (1998). Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Exp. Brain Res., 121, 350–354. Suri, R. E., & Schultz, W. (1999). A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neurosci., 91(3), 871–890. Suri, R. E., & Schultz, W. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Comp., 13(4), 841–862.
Temporal Sequence, Prediction, and Control
317
Surmeier, D. J., Bargas, J., Hemmings, H. C., J., Nairn, A. C., & Greengard, P. (1995). Modulation of calcium currents by a D1 dopaminergic protein kinase/phosphatase cascade in rat neostriatal neurons. Neuron, 14, 385–397. Surmeier, D. J., & Kitai, S. T. (1993). D1 and D2 dopamine receptor modulation of sodium and potassium currents in rat neostriatal neurons. Prog. Brain Res., 99, 309–324. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Unpublished doctoral dissertation, University of Massachusetts. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Mach. Learn., 3, 9–44. Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1038– 1044). Cambridge, MA: MIT Press. Sutton, R. S. (1999). Open theoretical questions in reinforcement learning. In EUROColt, (pp. 11–17). Sutton, R., & Barto, A. (1981). Towards a modern theory of adaptive networks: Expectation and prediction. Psychol. Review, 88, 135–170. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel, & J. Moore (Eds.), Learning and computational neuroscience: Foundation of adaptive networks. Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Svenningsson, P., Lindskog, M., Rognoni, F., Fredholm, B. B., Greengard, P., & Fisone, G. (1998). Activation of adenosine A2A and dopamine D1 receptors stimulates cyclic AMP-dependent phosphorylation of DARPP-32 in distinct populations of striatal projection neurons. Neurosci., 84, 223–228. Svoboda, K., & Mainen, Z. F. (1999). Synaptic Ca2+ : intracellular stores spill their guts. Neuron, 22, 427–430. Tang, K., Low, M. J., Grandy, D. K., & Lovinger, D. M. (2001). Dopaminedependent synaptic plasticity in striatum during in vivo development. Proc. Natl. Acad. Sci. USA, 98, 1255–1260. Tesauro, G. (1992). Practical issues in temporal difference learning. Mach. Learn., 8, 257–277. Tingley, W. G., Ehlers, M. D., Kameyama, K., Doherty, C., Ptak, J. B., Riley, C. T., & Huganir, R. L. (1997). Characterization of the protein kinase A and protein kinase C phosphorylation of the N-methyl-D-aspartate receptor NR1 subunit using phosphorylation site-specific antibodies. J. Biol. Chem., 272, 5157–5166. Toan, D. L., & Schultz, W. (1985). Responses of rat pallidum cells to cortex stimulation and effects of altered dopaminergic activity. Neurosci., 15, 683– 694. Touzet, C. (1999). Neural networks and Q-learning for robotics. Available online at: http://www.sciences-cognitives.org/scico/annuaire/Touzet Claude/ Touzet IJCNN Tut.pdf. Tremblay, L., Hollerman, J. R., & Schultz, W. (1998). Modifications of reward expectation-related neuronal activity during learning in primate striatum. J. Neurophysiol., 80, 964–977.
318
F. Worg ¨ otter ¨ and B. Porr
Tremblay, L., & Schultz, W. (1999). Relative reward preference in primate orbitofrontal cortex. Nature, 398, 704–708. Umemiya, M., & Raymond, L. A. (1997). Dopaminergic modulation of excitatory postsynaptic currents in rat neostriatal neurons. J. Neurophysiol., 78, 1248– 1255. van Hemmen, J. L. (2001). Theory of synaptic plasticity. In F. Moss, & S. Gielen, (Eds.), Handbook of biological physics, neuro-informatics, neural modelling (Vol. 4, pp. 771–823. Amsterdam: Elsevier. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20(23), 8812– 8821. Walsh, J. P. (1993). Depression of excitatory synaptic input in rat striatal neurons. Brain Res., 608, 123–128. Walsh, J. P., & Dunia, R. (1993). Synaptic activation of N-methyl-D-aspartate receptors induces short-term potentiation at excitatory synapses in the striatum of the rat. Neurosci., 57, 241–248. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Unpublished doctoral dissertation, Cambridge University. Watkins, C. J. C. H., & Dayan, P. (1992). Technical note: Q-learning. Mach. Learn., 8, 279–292. West, A. E., Chen, W. G., Dalva, M. B., Dolmetsch, R. E., Kornhauser, J. M., Shaywitz, A. J., Takasu, M. A., Tao, X., & Greenberg, M. E. (2001). Calcium regulation of neuronal gene expression. Proc. Natl. Acad. Sci. USA, 98, 11024– 11031. Wickens, J. R., Begg, A. J., & Arbuthnott, G. W. (1996). Dopamine reverses the depression of rat corticostriatal synapses which normally follows highfrequency stimulation of cortex in vitro. Neurosci., 70, 1–5. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In IRE WESCON Convention Record (Vol. 4, pp. 96–104). New York. Wiering, M., & Schmidhuber, J. (1998). Fast online Q(λ). Mach. Learn., 33, 105–115. Williams, S. M., & Goldman-Rakic, P. S. (1993). Characterization of the dopaminergic innervation of the primate frontal cortex using a dopamine-specific antibody. Cereb. Cortex, 3, 199–222. Witten, I. H. (1977). An adaptive optimal controller for discrete-time Markov environments. Information and Control, 34, 86–295. Xie, X., & Seung, S. (2000). Spike-based learning rules and stabilization of persistent neural activity. In S. A. Solla, T. K. Leen, & K. R. Muller, ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 199–208). Cambridge, MA: MIT Press. Yamamoto, Y., Nakanishi, H., Takai, N., Shimazoe, T., Watanabe, S., & Kita, H. (1999). Expression of N-methyl-D-aspartate receptor-dependent longterm potentiation in the neostriatal neurons in an in vitro slice after ethanol withdrawal of the rat. Neurosci., 91, 59–68. Yang, S. N., Tang, Y. G., & Zucker, R. S. (1999). Selective induction of LTP and LTD by postsynaptic Ca2+ elevation. J. Neurophysiol., 81, 781–787. Yao, H., & Dan, Y. (2001). Stimulus timing-dependent plasticity in cortical processing of orientation. Neuron, 32, 315–323.
Temporal Sequence, Prediction, and Control
319
Yim, C. Y., & Mogenson, G. J. (1982). Response of nucleus accumbens neurons to amygdala stimulation and its modification by dopamine. Brain Res., 239, 401–415. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Zink, C. F., Pagnoni, G., Martin, M. E., Dhamala, M., & Berns, G. S. (2003). Human striatal response to salient nonrewarding stimuli. J. Neurosci., 23(22), 8092–8097. Zucker, R. S. (1999). Calcium- and activity-dependent synaptic plasticity. Curr. Opin. Neurobiol., 9, 305–313. Received September 4, 2003; accepted July 2, 2004.
NOTE
Communicated by James Stone
A Note on Stone’s Conjecture of Blind Signal Separation Shengli Xie
[email protected] Zhaoshui He he
[email protected] Yuli Fu
[email protected] School of Electronics and Information Engineering, South China University of Technology, Guangzhou 510640, China
Stone’s method is one of the novel approaches to the blind source separation (BSS) problem and is based on Stone’s conjecture. However, this conjecture has not been proved. We present a simple simulation to demonstrate that Stone’s conjecture is incorrect. We then modify Stone’s conjecture and prove this modified conjecture as a theorem, which can be used a basis for BSS algorithms. 1 Problem Formulation Blind source separation (BSS) separates the source signals from the observed signals without any prior information of the source signals and the transfer channel. “Blind” means that both the source and the channel are unknown. The mathematical model of this problem is written as X(t) = AT S(t)
(1.1)
Y(t) = W X(t),
(1.2)
T
where equation 1.1 is the model of the mixture, while equation 1.2 is the model of the separation. The blind (unknown) source signal vector is denoted as S(t) = (s1 (t), s2 (t), · · · , sn (t))T , and the observed signal vector is denoted as X(t) = (x1 (t), x2 (t), · · · , xm (t))T . The vector Y(t) = (y1 (t), y2 (t), · · · , yn (t))T indicating the separated signals. A is the unknown n × m mixture matrix (indicating the unknown channel). W is the separating matrix. The aim of BSS is to get the expression Y(t) = PDS(t)
(1.3)
by adjusting the separation matrix W. In equation 1.3, P is a permutation matrix, while D is a diagonal matrix. Usually the source signals are only assumed to be statistically independent. Neural Computation 17, 321–330 (2005)
c 2005 Massachusetts Institute of Technology
322
S. Xie, Z. He, and Y. Fu
Many methods are used to deal with BSS, for example, the neural network method (Amari, Cichocki, & Yang 1996; Bell & Sejnowski 1995), the minimal mutual information method (Pham, 2002; Xie & Zhang, 2002), the higherorder statistics method (Belouchrani, Karim, Cardoso, & Moulines, 1997; Yeredor, 2000), and the geometric method (Taro, Hirokewa, & Itoh, 2000; Xie & Zhang, in press). Recently, based on Stone’s conjecture, a novel method was proposed to deal with BSS problems for instantaneous mixture model (Stone, 2001), convolution mixture model (Stone, 2002), and nonlinear mixture model (Martinez, & Bray, 2003). Some fast algorithms were proposed in those articles. Thus, Stone’s conjecture-based method has gained attention. Generally, new methods and algorithms should be based on a reliable theoretical basis. If Stone’s conjecture could be proved, the new algorithms proposed in Stone (2001, 2002) and Martinez and Bray (2003) indeed would have a theoretical basis. However, in many cases, Stone’s conjecture is incorrect. The purpose of this note is to modify Stone’s conjecture and prove the modified conjecture theoretically. First, we present a simple simulation to demonstrate the problems with Stone’s conjecture. Then we modify Stone’s conjecture and prove the modified conjecture as a theorem. Thus, the theorem can be used as a basis of BSS algorithms. 2 The Modification of Stone’s conjecture Stone (2001) proposed temporal predictability and Stone’s conjecture. Some algorithms of BSS were presented based on Stone’s conjecture. Now, we cite the temporal predictability and the Stone’s conjecture from the original paper (Stone, 2001). 2.1 Definition of Temporal Predictability and Stone’s Conjecture. 2.1.1 Temporal Predictability. dictability is defined as n
rz = log
(zτ − zτ )2
τ =1 n τ =1
For a given signal series z = {zτ }, its pre-
zτ − z˜ τ
2
zτ = λL zτ −1 + (1 − λL )zτ −1 : 0 ≤ λL ≤ 1, z˜τ = λS z˜ τ −1 + (1 − λS )zτ −1 : 0 ≤ λS ≤ 1, where λL = 2−1/ hl , λS = 2−1/ hs , and hS , hl are the parameters will be selected.
A Note on Stone’s Conjecture of Blind Signal Separation
323
Figure 1: Waveforms of the three independent source signals s1 , s2 and s3 .
2.1.2 Stone’s Conjecture. The conjecture is, ”The temporal predictability of any signal mixture is less than (or equal to) that of any of its component source signals” (Stone, 2001). 2.2 The Faultiness of Stone’s Conjecture. Stone (2001) gave a counterexample to his conjecture and redefined the predictability. However, the source signals in this counterexample are not independent. Also, he did not prove the original conjecture in his early and later papers. By a simple example with independent source signals, we will point out the faultiness of the Stone’s conjecture. The three independent source signals (see Figure 1) are taken from Stone’s experiments (Stone, 2001). The supergaussian signal s1 is with the maximal temporal predictability rs1 = 5.2332, the sorted gaussian noise s2 is with the second maximal temporal predictability rs2 = 3.8966, and the subgaussian sine wave s3 is with the minimal temporal predictability rsnew = 1.8381. A mixed signal is taken as snew = s1 + s2 . Its temporal predictability can be computed by the definition rsnew = 5.1045. Obviously, we have rsnew > rs2 . Therefore, Stone’s conjecture is not always true for the mixture of the subgaussian signal, gaussian signal, and supergaussian signal. In fact, some counterexamples can also be given by taking some other type of signals.
324
S. Xie, Z. He, and Y. Fu
2.3 Modification of Stone’s Conjecture. For convenience, we define an equivalent index to replace temporal predictability. Definition 1 (predictive exponent). predicted exponent is defined as n
Vz =
(zτ − zτ )2
τ =1 n τ =1
For a given signal series z = {zτ }, its
zτ − z˜τ
2
(= exp(rz )).
(2.1)
Obviously, the predictive exponent is monotonic with respect to the temporal predictability. So the predictive exponent is equivalent to temporal predictability in Stone (2001). Thus, the statements in Stone’s conjecture can be rewritten in terms of predictive exponent. Although Stone’s conjecture is not true in general, it could be modified in the following way: Modified Conjecture. The temporal predictability (or equivalently, the predictive exponent) of any linear mixture of the source signals is between the greatest temporal predictability (or equivalently, the predictive exponent) and the smallest temporal predictability (or equivalently, the predictive exponent) of the source signals. The modified conjecture can be written as a theorem and can be proved exactly. Note that Stone’s method proposed some generalized eigenvalue-based algorithms. In the algorithms, the indices (temporal predictability or, equivalently, predictive exponent) should be sorted by values. Since the index temporal predictability (or, equivalently, predictive exponent) is time varying, it is not easily used in the algorithms, so we introduce a generalized index: Definition 2 (covariance rate). Consider a signal z(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), where L(s1 (t), s2 (t), . . . , sK (t)) denotes the linear space spanned by the signals s1 (t), s2 (t), . . . , sk (t). The index Rz =
cov( fza (t), fza (t)) cov(gbz (t), gbz (t))
(2.2)
is called a covariance rate of signal z(t), where fza = zk −zk , gaz = zk −zk , and zk , z˜ k are defined as in the definition of temporal predictability.
A Note on Stone’s Conjecture of Blind Signal Separation
325
The covariance rate implies the predictive exponent. The covariance rate is more general than the definitions of predictive exponent and temporal predictability, and is more easily used in both computation and theoretical analysis. From the independence of all the source signals, we have cov(s1 (t), sj (t)) = 0, (i = j). The covariance rate Rz of signal z(t) is independent of time t(time invariant). Therefore, for ∀z1 (t), z2 (t), . . . , zm (t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), we can assume that Rz11 ≥ Rzl12 . . . , ≥ Rzlm > 0,
(2.3)
where {l1, 12, . . . , lm} is a permutation of {1, 2, . . . , m}. For any signal z(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), let z(t) = WT S(t), W ∈ Rk . Taking two constants λL , λs , 0 ≤ λs < λL ≤ 1, two estimates z(t) and z˜ (t) of the signal z(t) are defined by z(t) = λL z(t − 1) + (1 − λL )z(t − 1), t = 2, 3, . . . , N, z(1) = z(1)
(2.4)
z˜ (t) = λs z˜ (t − 1) + (1 − λs )z(t − 1), t = 2, 3, . . . , N, z˜ (1) = z(1).
(2.5)
Also, we set fza (t) = z(t) − z(t), gbz (t) = z(t) − z˜ (t), t = 1, 2, . . . , N
(2.6)
T FaS (t) = fsa1 (t), fsa2 (t), . . . , fsak (t) , T Gbs (t) = gbs1 (t), gbs2 (t), . . . , gbsk (t) .
(2.7)
and
Some properties related to the generalized index covariance rate are obtained. Property 1. have
If the source signals s1 (t), s2 (t), . . . , sk (t) are independent, we
cov Fas (t), Fas (t) = 0 cov( fsa1 (t), fsa1 (t)) a (t), f a (t)) 0 cov( f s2 s2 = .. .. . . 0 0
... 0 ... 0 .. .. . . a a . . . cov( fsk (t), fsk (t))
326
S. Xie, Z. He, and Y. Fu
cov Gbs (t), Gbs (t) = cov(gbs1 (t), gbs1 (t)) 0 b (t), gb (t)) 0 cov(g s2 s2 = .. .. . . 0 0
Property 2. We have
... 0 ... 0 .. .. . . b b . . . cov(gsk (t), gsk (t))
For z(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), set z(t) = WT S(t), W ∈ RK .
cov( fza (t), fza (t)) = WT cov(Fas (t), Fas (t))W, cov(gbz (t), gbz (t)) = WT cov(Gbs (t), Gbs (t))W. Property 3. For z1 (t), z2 (t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), let z1 (t) = UT S(t), z2 (t) = VT S(t), U, V ∈ RK . We then have cov( fza1 (t), fza2 (t)) = UT cov(Fas (t), Fas (t))V, cov(gbz1 (t), gbz2 (t)) = UT cov(Gbs (t), Gbs (t))V. Property 4.
The covariance rate is homogeneous with zero order.
Property 1 tells us that the covariance of some independent signals can be calculated as some diagonal matrices. Properties 2 and 3 give the bilinear property of the covariance. And property 4 indicates that for any signal z(t) in the linear signal space and any constant c = 0, the covariance rate of z(t) is equal to that of cz(t). Property 4 will be helpful in computating the covariance rate. Property 3 is easy to understand. The proofs of properties 1, 2, and 4 are in the appendix. Now we can give the theorem to represent the modified conjecture: Theorem 1. In the linear signal space L(s1 (t), s2 (t), . . . , sK (t)), where the signals s1 (t), s2 (t), . . . , sk (t) are statistically independent, the covariance rate Rx of any signal x(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)), is between the maximal and minimal covariance rates of the source signals. Under the following assumption, max
x(t)∈L(s1 (t),s2 (t),...,sk (t))
Rx = Rs1 ,
x(t) ∈ L(s1 (t), s2 (t), . . . , sk (t)),
min
x(t)∈L(s1 (t),s2 (t),...,sk (t))
Rx = Rsk , and
A Note on Stone’s Conjecture of Blind Signal Separation
327
Theorem 1 results in Rsk ≤ Rx ≤ Rs1 The proof of theorem 1 is given in the appendix. Therefore, theorem 1 can replace Stone’s conjecture as a theoretical basis of the BSS algorithms. 3 Conclusion This note elaborates on modifying Stone’s conjecture and establishing a theoretical basis of BSS. We present a simple example to point out that Stone’s conjecture is incorrect. Then we give a modified conjecture and prove it as a theorem. The theorem we present can be used as a theoretical basis for BSS algorithms. In future work, we will discuss BSS algorithms based on theorem 1. Appendix Proof of Property 1. Since the signals s1 (t), s2 (t), . . . , sk (t) are independent, by probability theory, the signals fsa1 (t), fsa2 (t), . . . , fsak (t) are so in dependent. Thus, cov fsa1 (t), fsaj (t) = 0, when i = j, and the first equality is true. We can prove the second equality in same way. Proof of Property 2. and hence,
Since z(1) = WT S(1) , we have z(1) = WT S(1) ,
z(2) = λL z(1) + (1 − λL )z(1) = λL WT S(1) + (1 − λL )WT S(1) = WT (λL S(1) + (1 − λL )S(1)) = WT S(2).
(A.1)
Similarly, we have z(t) = WT S(t), t = 1, 2, . . . , N.
(A.2)
Then fza (t) = z(t) − z(t) = WT S(t) − WT S(t) = WT (s1 (t), s2 (t), . . . sk (t))T − (s1 (t), s2 (t), . . . , sk (t))T = WT (s1 (t) − s1 (t), s2 (t), s2 (t), . . . , sk (t) − sk (t))T T = WT fsa1 (t), fsa2 (t), . . . , fsak (t) = WT Fas (t).
(A.3)
By analogous argument, we have gbz (t) = WT Gbs (t).
(A.4)
328
S. Xie, Z. He, and Y. Fu
From the bilinear property of covariance, it follows that cov fza (t), fza (t) = WT cov Fas (t), Fas (t)(t) W cov gbz (t), gbz (t) = WT cov Gbs (t), Fas (t)(t) W.
(A.5)
The property is completed. Proof of Property 4. For the signal z(t) = WT S(t) and nonzero constant c, by the definition of covariance rate, we have cov fcza (t), fcza (t) = Rcz = cov gbcz (t), gbcz (t) =
(cW)T cov(Fas (t), Fas (t))(cW) (cW)T cov(Gbs (t), Gas (t))(cW)
= rz = c0 Rz .
(A.6)
This proves the property. Proof of Theorem 1. Let Rs1 ≥ Rs2 ≥ Rs1 · · · ≥ Rsk > 0, without loss of generality. Then cov gbs1 (t), gbs1 (t) cov fsa1 (t), fsa1 (t) ≥ ; i = 2, 3, · · · , k. (A.7) cov gbs1 (t), gbs1 (t) cov fsa1 (t), fsa1 (t) Define a matrix:
0 0 D1 = . .. 0 q22
0 q22 .. . 0
··· ··· .. . ···
0 0 .. . qkk
cov gbs2 (t), gbs2 (t) cov fsa2 (t), fsa2 (t) = − cov gbs1 (t), gbs1 (t) cov fsa1 (t), fsa1 (t)
cov gbsk (t), gbsk (t) cov fsak (t), fsak (t) − . qkk = cov gbs1 (t), gbs1 (t) cov fsa1 (t), fsa1 (t) By equation A.7, we find that D1 is nonnegative definite. Since x(t) is the linear combination of s1 (t), s2 (t), · · · , sk (t), we can write x(t) = AT1 S(t), where A1 is a k-dimensional column vector. So we have AT1 D1 A1 ≥ 0, that is, 0 0 ··· 0 0 q22 · · · 0 AT1 . .. .. A1 ≥ 0. .. .. . . . 0 0 · · · qkk
A Note on Stone’s Conjecture of Blind Signal Separation
It is equivalent to 0 v11 0 v22 AT1 . .. .. . 0 0
··· ··· .. . ···
0 0 .. A1 . vkk
cov(gbs1 (t), gbs1 (t))
0 ··· 0 v˜ 11 0 v˜ 22 · · · 0 AT1 . .. .. A1 .. .. . . . 0 0 · · · v˜ kk ≥ cov( fsa1 (t), ( fsa1 (t))
329
(A.8)
vii = cov(gbs1 (t), gbs1 (t)), v˜ ii = cov fsa1 (t), ( fsa1 (t) , i = 1, 2, . . . , k. From equation A.8 and property 2, we have 0 ··· 0 v˜ 11 0 v˜ 22 · · · 0 AT1 . .. .. A1 .. .. . . . a a cov fs1 (t), ( fs1 (t) 0 0 · · · v˜ kk ≥ b b ˜ 0 ··· 0 v11 cov gs1 (t), (gs1 (t) 0 v˜ 22 · · · 0 AT1 . .. .. A1 .. .. . . . 0 0 · · · v˜ kk
=
AT1 cov Fas (t), Fas (t) A1 cov fxa (t), fxa (t) = Rx . = cov gbx (t), gbx (t) AT1 cov Gbs (t), Gbs (t) A1
(A.9)
Hence, we have Rx ≤ Rs1 . By the same arguments, we have Rx ≥ Rsk . The proof of theorem 1 is complete. Acknowledgments The work is supported by Guang Dong Province Science Foundation for Program of Research Team (grant 04205783), the National Natural Science Foundation of China (grant 60274006), the Natural Science Key Fund of Guangdong Province, China (grant 020826), the National Natural Science Foundation of China for Excellent Youth (grant 60325310), and the TransCentury Training Program, Foundation for the Talents by the State Education Commission of China. We are also grateful to the reviewers for their helpful comments and suggestions. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new algorithm for blind source separation. In D. Touretsky, M. Mozer, & M. Wasselmo (Eds.), Advances in neural information processing, 8 (pp. 757–763). Cambridge, MA: MIT Press.
330
S. Xie, Z. He, and Y. Fu
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind source separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., Karim, A. M., Cardoso, J. F., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Trans. Signal Process, 45(2), 434–443. Martinez, D., & Bray, A. (2003). Nonlinear blind source separation using kernels. IEEE Trans. Neural Networks, 14(1), 228–235. Pham, D. T. (2002). Mutual information approach to blind separation for stationary of sources. IEEE Trans. Information Theory, 48, 1935–1946. Stone, J. V. (2001). Blind source separation using temporal predictability. Neural Computation, 13, 1559–1574. Stone, J. V. (2002). Blind deconvolution using temporal predictability. Neurocomputing, 49, 79–86. Taro, Y., Hirokawa, K., & Itoh, K. (2000). Independent component analysis by transforming a scatter diagram of mixtures of signal. Optics Communications, 173, 107–114. Xie, S., & Zhang, J. (2002). Blind separation algorithm of minimal mutual information based on rotating transform. Acta Electronica Sinica 30(5), 628–631. Xie, S., & Zhang, J. (in press). A geometric algorithm of blind source separation based on QR decomposition. Control Theory and Applications. Yeredor, A. (2000). Blind source separation via the second characteristic function. Signal Process, 80, 897–902. Received March 26, 2004; accepted July 2, 2004.
NOTE
Communicated by Barak Pearlmutter
A Further Result on the ICA One-Bit-Matching Conjecture Jinwen Ma
[email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong, and School of Mathematical Sciences and LMAM, Peking University, Beijing, 100871, China
Zhiyong Liu
[email protected] Lei Xu
[email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong
The one-bit-matching conjecture for independent component analysis (ICA) has been widely believed in the ICA community. Theoretically, it has been proved that under the assumption of zero skewness for the model probability density functions, the global maximum of a cost function derived from the typical objective function on the ICA problem with the one-bit-matching condition corresponds to a feasible solution of the ICA problem. In this note, we further prove that all the local maximums of the cost function correspond to the feasible solutions of the ICA problem in the two-source case under the same assumption. That is, as long as the one-bit-matching condition is satisfied, the two-source ICA problem can be successfully solved using any local descent algorithm of the typical objective function with the assumption of zero skewness for all the model probability density functions. The so-called one-bit-matching conjecture, which states that “all the sources can be separated as long as there is a one-to-one same-sign-correspondence between the kurtosis signs of all source probability density functions and the kurtosis signs of all model probability density functions,” was summarized in Xu, Cheung, and Amari (1998) and formally proved in Liu, Chiu, and Xu (2004) recently under the assumption of zero skewness for the model probability density functions. The one-bit-matching condition guarantees a successful solution of the ICA problem by globally maximizing the following cost function: J(R) =
n n
r4ij νjs kim ,
(1)
i=1 j=1
Neural Computation 17, 331–334 (2005)
c 2005 Massachusetts Institute of Technology
332
J. Ma, Z. Liu, and L. Xu
where the orthonormal matrix R = (rij )n×n = WA is to be estimated instead of W since A, i. e., the mixing matrix, is a constant one,1 νjs is the kurtosis of the source sj , and kim is a constant with the same sign as the kurtosis νim of the model probability density pi (yi ). As proved by Liu et al. (2004), under the assumption of zero skewness for the model probability density functions, this cost function is equivalent to the typical objective function of the ICA problem used by several researchers, such as Bell and Sejnowski (1995), Amari, Cichocki, and Yang (1996), and Cardoso (1999). However, in practice, typical gradient algorithms (e.g., Amari et al., 1996; Welling & Weber, 2001) cannot guarantee global optimization. Here we further analyze the local maxima of the cost function J(R) case of two inthe n sources. For convenience, J(R) can be rewritten as ni=1 j=1 r4ij kij , where kij = νjs kim is the element of matrix K = (kij )n×n . Theorem. Assume that the one-bit-matching condition is satisfied for n = 2, the local maxima of J(R) are only the permutation matrices up to sign indeterminacy. Proof. For the second-order orthonormal matrix R, there exist two disjoint fields of R that can be expressed by cos θ R1 = sin θ
− sin θ , cos θ
cos θ sin θ
sin θ , − cos θ
and R2 =
respectively, with one variable θ ∈ [0, 2π). Actually, each R1 denotes a rotation transformation in R2 (i.e., the two-dimensional real Euclidean space), while each R2 denotes a reflection transformation in R2 . Since J(R1 ) = J(R2 ), for simplicity we consider just R1 . By differentiating J(R1 ) with respect to θ, we have dJ(R1 ) = 4 cos θ sin θ [−(k11 + k22 ) cos2 θ + (k12 + k21 ) sin2 θ ], dθ d2 J(R1 ) = 4(k11 + k22 )(3 sin2 θ cos2 θ − cos4 θ) + 4(k12 + k21 ) d2 θ × (3 sin2 θ cos2 θ − sin4 θ). 1
Note that here, we additionally assume that A is square and invertible.
(2)
(3)
One-Bit-Matching Conjecture
333
Assume that the one-bit-matching condition is satisfied for n = 2 and ν1s > ν2s and k1m > k2m ; we have the two following possibilities for the matrix K. Case I. All kij > 0, that is, both sources are subgaussians or supergaussians. The necessary condition for local maxima is obtained by setting equation 2 = 0, and we have sin θ cos θ = 0 k11 + k22 tan2 θ = . k12 + k21
(4) (5)
The roots for equation 5 correspond to local minima. Solving equation 4, we ˆ have θˆ = 0, π2 , π, 3π 2 which are local maxima as equation 3 < 0 at θ = θ. It can be readily verified that each local maximum θˆ leads R1 to a permutation matrix up to sign indeterminacy in this case. Case II. k11 , k22 > 0 while k12 , k21 < 0, that is, a mixture that consists of one subgaussian and one supergaussian source. In this case, no real roots exist for equation 5. Among the roots θˆ = 0, π2 , π, 3π 2 , only 0 and π are local maxima as equation 3 < 0 there. The remaining two roots correspond to local minima. Similarly, it can be readily verified that each local maximum leads R1 to a permutation matrix up to sign indeterminacy in this case. Summing the results of cases I and II, the proof is completed. According to this theorem, when the skewness of each model probability density function is set to be zero, the two-source ICA problem can be successfully solved using any local descent algorithm of the typical objective function as long as the one-bit-matching condition is satisfied. Acknowledgments This work was supported by a grant from the Research Grant Council of the Hong Kong SAR (Project CUHK 4184/03E), and also by the Natural Science Foundation of China for Project 60071004. References Amari, S. I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind separation of sources. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bell A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159.
334
J. Ma, Z. Liu, and L. Xu
Cardoso, J. F. (1999). Informax and maximum likelihood for source separation. IEEE Signal Processing Letters, 4, 112–114. Liu, Z. Y., Chiu, K. C., & Xu, L. (2004). One-bit-matching conjecture for independent component analysis. Neural Computation, 16, 383–399. Welling, M., & Weber, M. (2001). A constrained EM algorithm for independent component analysis. Neural Computation, 13, 677–689. Xu, L., Cheung, C. C., & Amari, S. I. (1998). Further results on nonlinearity and separation capability of a linear mixture ICA method and learned LPM. In C. Fyfe (Ed.), Proceedings of the I&ANN’98 (pp. 39–45). Tenerife, Spain: NAISO Academic Press. Received March 24, 2004; accepted June 22, 2004
LETTER
Communicated by Andrew Barto
Robust Reinforcement Learning Jun Morimoto
[email protected] Computational Brain Project, ICORP, JST, Sora Ku-gun, Kyoto, 619-0288, Japan; ATR Computational Neuroscience Laboratories, Soraku-gun, Kyoto 619-0288, Japan
Kenji Doya
[email protected] ATR Computational Neuroscience Laboratories, Soraku-gun, Kyoto 619-0288, Japan; Initial Research Project, OIST, Gushikawa, Okinawa, 904-2234, Japan; and CREST, JST, Soraku-gun, Kyoto 619-0288, Japan
This letter proposes a new reinforcement learning (RL) paradigm that explicitly takes into account input disturbance as well as modeling errors. The use of environmental models in RL is quite popular for both offline learning using simulations and for online action planning. However, the difference between the model and the real environment can lead to unpredictable, and often unwanted, results. Based on the theory of H∞ control, we consider a differential game in which a “disturbing” agent tries to make the worst possible disturbance while a “control” agent tries to make the best control input. The problem is formulated as finding a min-max solution of a value function that takes into account the amount of the reward and the norm of the disturbance. We derive online learning algorithms for estimating the value function and for calculating the worst disturbance and the best control in reference to the value function. We tested the paradigm, which we call robust reinforcement learning (RRL), on the control task of an inverted pendulum. In the linear domain, the policy and the value function learned by online algorithms coincided with those derived analytically by the linear H∞ control theory. For a fully nonlinear swing-up task, RRL achieved robust performance with changes in the pendulum weight and friction, while a standard reinforcement learning algorithm could not deal with these changes. We also applied RRL to the cart-pole swing-up task, and a robust swing-up policy was acquired. 1 Introduction In this letter, we propose a new reinforcement learning paradigm that we call robust reinforcement learning (RRL). Plain, model-free reinforcement learning (RL) requires a significant number of trials to learn policies for given tasks, and therefore it is difficult to use for online learning of realNeural Computation 17, 335–359 (2005)
c 2005 Massachusetts Institute of Technology
336
J. Morimoto and K. Doya
world problems. Thus, the use of environmental models has been quite common for both online action planning (Doya, 2000) and off-line learning by simulation (Morimoto & Doya, 2000). However, no model is perfect, and modeling errors can cause unpredictable results—sometimes worse than with no model at all. In fact, robustness against model uncertainty has been the main subject of research in the control community for the past 20 years, and the result is formalized as the H∞ control theory (Zhou, Doyle & Glover, 1996). In general, a modeling error causes a deviation of the real system state from the state predicted by the model. This can be reinterpreted as a disturbance to the model. However, the problem is that the disturbances due to modeling error can be strongly correlated, and thus standard independence assumptions may not be valid. The basic strategy to achieve robustness is to keep the sensitivity γ of the feedback control loop against a disturbance input small enough so that any disturbance due to modeling error can be suppressed if the gain of mapping from the state error to the disturbance is bounded by 1/γ . In the H∞ paradigm, those disturbance-to-error and error-to-disturbance gains are measured by the max norms of the functional mappings in order to ensure stability for any mode of disturbance. In section 2, we briefly introduce the H∞ paradigm and show that design of a robust controller can be achieved by finding a min-max solution of a value function, which is given by the Hamilton-Jacobi-Isaacs (HJI) equation. In section 3, we formulate robust reinforcement learning. We first define an augmented value function that takes into account the amplitude of the disturbance and derive a corresponding HJI equation. We then derive an online algorithm for estimating the value function, the worst-case disturbances, and the optimal controller. In section 4, we test the validity of the algorithms first in a linear inverted pendulum task. It is verified that the value function as well as the disturbance and control policies derived by the online algorithms coincide with the analytical solution given by H∞ theory. We then compare the performance of the robust reinforcement learning algorithm with standard model-based RL in the nonlinear task of pendulum swing-up (Doya, 2000). It is shown that a robust RL controller can accommodate changes in physical parameters of the pendulum that a standard RL controller cannot cope with (Morimoto & Doya, 2001b). We also applied robust RL to the cart pole swing-up task (Doya, 2000) and tested the robustness of the learned controller. 2 H∞ Control Standard H∞ control (Zhou et al., 1996) deals with a system shown in Figure 1, where G is the plant, K is the controller, u is the control input, y is the measurement available to the controller, w is an unknown disturbance, and z is the error output that is desired to be kept small. In general, the controller K is designed to stabilize the closed-loop system based on a model of the plant
Robust Reinforcement Learning
w
G
u
337
z y
w
z G
u
y
K
K
(A)
(B)
Figure 1: (A) Generalized plant G and controller K. (B) Small gain theorem.
G. However, when there is a discrepancy between the model and the actual plant dynamics, the feedback loop could be unstable. The effect of modeling error can be equivalently represented as a disturbance w generated by an unknown mapping of the plant output z, as shown in Figure 1b. The goal of the H∞ control problem is to design a controller K that brings the error z to zero while minimizing the H∞ norm of the closed-loop transfer function Tzw from the disturbance w to the output z z2 , Tzw ∞ = sup w 2 w
(2.1)
where • 2 denotes L2 norm. The small gain theorem ensures that if Tzw ∞ ≤ γ , then the system shown in Figure 1b will be stable for any stable mapping : z → w with ∞ < γ1 (Zhou et al., 1996). 2.1 Min-Max Solution to H∞ Problem. We consider the plant G with dynamics given by x˙ = f (x, u, w),
(2.2)
where x ∈ X ⊂ Rn is the state, u ∈ U ⊂ Rm is the control input, and w ∈ W ⊂ Rl is the disturbance input. From equation 2.1 and the small gain theorem, the H∞ control problem can be considered as the problem to find a controller that satisfies a constraint z2 2 Tzw ∞ 2 = sup ≤ γ 2, 2 w w 2
(2.3)
where z is the error output. In other words, because the norm z2 and the norm w2 are defined as ∞ z2 2 = zT (t)z(t) dt (2.4) 0 ∞ w2 2 = wT (t)w(t) dt, (2.5) 0
338
J. Morimoto and K. Doya
the H∞ control problem is equivalent to finding a control input u that satisfies a constraint ∞ V= (zT (t)z(t) − γ 2 wT (t)w(t)) dt ≤ 0 (2.6) 0
against all possible disturbances w with x(0) = 0. We can consider this problem as a differential game (Weiland, 1989) in which the best control output u that minimizes V is sought while the worst disturbance w that maximizes V is chosen. Thus, an optimal value function V ∗ is defined as ∞ ∗ V = min max (zT (t)z(t) − γ 2 wT (t)w(t)) dt. (2.7) u
w
0
The condition for the optimal value function is given by ∂V ∗ T 2 T f (x, u, w) , 0 = min max z z − γ w w + u w ∂x
(2.8)
which is known as the Hamilton-Jacobi-Isaacs (HJI) equation. From equation 2.8, we can derive the optimal control output u and the worst disturbance w by solving ∂zT z ∂V ∗ ∂ f (x, u, w) + =0 ∂u ∂x ∂u ∗ ∂V ∂ f (x, u, w) −2γ 2 w + = 0. ∂x ∂w
(2.9) (2.10)
3 Robust Reinforcement Learning Here we consider a continuous-time formulation of reinforcement learning (Doya, 2000) with the system dynamics x˙ = f (x, u, w),
(3.1)
and the reward r(x, u), where x ∈ X ⊂ Rn is the state, u ∈ U ⊂ Rm is the control output, and w ∈ W ⊂ Rl is the disturbance input. The basic goal is to find a policy u = g(x) that maximizes the cumulative future reward,
∞
e−
s−t τ
r(x(s), u(s)) ds,
(3.2)
t
for any given state x(t), where τ is a time constant of evaluation. However, a particular policy that was optimized for a certain environment may perform
Robust Reinforcement Learning
339
badly when the environmental setting changes. In order to ensure robust performance under a changing environment or unknown disturbance, we introduce the notion of worst disturbance in H∞ control to the reinforcement learning paradigm. In this framework, we consider an augmented reward, q(t) = r(x(t), u(t)) + ω(w(t)),
(3.3)
where ω(w(t)) is an additional reward for withstanding a disturbing input; for example, we can use quadratic cost for the disturbance input to be consistent with equation 2.7, ω(w) = γ 2 wT w,
(3.4)
where γ is the robustness parameter. The augmented value function is then defined as
∞
V(x(t)) =
e−
s−t τ
q(x(s), u(s), w(s)) ds.
(3.5)
t
The optimal value function is given by the solution of a variant of HJI equation, 1 ∗ ∂V ∗ V (x) = max min r(x, u) + ω(w) + f (x, u, w) . u w τ ∂x
(3.6)
In the RRL paradigm, the value function is updated by using the temporal difference (TD) error (Doya, 2000), δ(t) = q(t) −
1 ˙ V(t) + V(t), τ
(3.7)
while the best action and the worst disturbance are generated by maximizing and minimizing, respectively, the right-hand side of the HJI equation, r(x, u) + ω(w) +
∂V ∗ f (x, u, w). ∂x
(3.8)
We use a function approximator to implement the value function V(x(t); v), where v = (v1 , . . . , vn ) is a parameter vector. As in the standard continuous-time RL, we use the learning rule for the value function appoximator as v˙ i = ηδ(t)ei (t),
(3.9)
340
J. Morimoto and K. Doya
J
K
L
N
L
P
Q
"
%
N
S
T
P
P
U
L
V
W
N
L
X
F
@
7
7
I
1
2
3
@
5
7
B
9
;
7
;
=
?
(
@
D
7
Y
A
0
R
0
E
*
,
(
*
,
"
Figure 2: Actor-disturber-critic architecture.
where η denotes the learning rate and ei denotes eligibility trace for a parameter vi (Doya, 2000). The eligibility trace is updated by 1 ∂V(t) e˙i (t) = − ei (t) + , κ ∂vi
(3.10)
By comparing equations 2.8 and 3.6, we can see that the output error zT z in the H∞ framework is generalized as an arbitrary reward function r(x, u) in the RRL. Furthermore, the sign of the value function is flipped, and a discount factor is introduced in the RRL framework. 3.1 Model-Free Policy: Actor-Disturber-Critic. To implement robust RL in a model-free fashion, we propose the actor-disturber-critic architecture, shown in Figure 2. We define the policies of the actor and the disturber as u(t) = su (Au (x(t); vu ) + nu (t))
(3.11)
w(t) = sw (Aw (x(t); vw ) + nw (t)),
(3.12)
and
respectively, where Au (x(t); vu ) ∈ Rm and Aw (x(t); vw ) ∈ Rl are function approximators with parameter vectors vu and vw , and nu (t) ∈ Rm and nw (t) ∈ Rl are noise terms for exploration. su () and sw () are monotonically increasing output functions. For example, we can use sigmoidal functions for su () and sw () to saturate control output and disturbance input. In the actor-disturber-critic architecture, policies of the actor and disturber are
Robust Reinforcement Learning
341
represented by the parameter vectors vu and vw . The parameters of the actor and the disturber are updated by v˙ ui = ηu δ(t)nui (t)
∂Au (x(t); vu ) ∂vui
w w v˙ w i = −η δ(t)ni (t)
(3.13)
∂Aw (x(t); vw ) , ∂vw i
(3.14)
where ηu and ηw denote the learning rates. 3.2 Model-Based Policy: Using Value Gradient. When the augmented reward function q(x, u, w) in equation (3.3) is convex with respect to the action u and disturbance w, the HJI equation, 3.6, has a unique solution. Here, we assume that the augmented reward q(x, u, w) can be separated into three parts: the reward for the state L(x), the cost for the action S(u), and the cost for the disturbance (w). We specifically consider the case q(x, u, w) = L(x) −
m i=1
Si (ui ) +
l
j (wj ),
(3.15)
j=1
where Si () is a convex cost function for action variable ui and j () is a convex cost function for disturbance variable wj . In this case, the condition for the optimal action and the worst disturbance is given from the HJI equation, 3.6, as −Si (ui ) +
∂V(x) ∂ f (x, u, w) =0 ∂x ∂ui
(i = 1, . . . , m)
(3.16)
j (wj ) +
∂V(x) ∂ f (x, u, w) =0 ∂x ∂wj
(j = 1, . . . , l),
(3.17)
∂ f (x,u,w) are the ith and jth column vector of the n × m ∂wj ∂ f (x,u,w) ∂ f (x,u,w) input gain matrix and the n × l disturbance gain matrix , ∂u ∂w ∂ f (x,u,w) ∂ f (x,u,w) and ∂wj are respectively. We now assume that the input gains ∂ui
where
∂ f (x,u,w) ∂ui
and
not dependent on u and w; that is, the system is input-affine. Then the above equations have unique solutions, ∂V(x) ∂ f (x, u, w) ∂x ∂ui ∂V(x) ∂ f (x, u, w) wj = j−1 − , ∂x ∂wj
ui = Si−1
(3.18) (3.19)
342
J. Morimoto and K. Doya
where Si () and j () are monotonic functions. Accordingly, greedy policies for action and disturbance are represented in vector notation as
∂ f (x, u, w)T ∂V(x)T ∂u ∂x ∂ f (x, u, w)T ∂V(x)T w = −1 − , ∂w ∂x
u = S −1
where
∂V(x) T ∂x
(3.20) (3.21)
represents the steepest ascent direction of the value func-
tion, which is then transformed by the “transpose” models ∂ f (x,u,w) T ∂w
∂ f (x,u,w) T ∂u
and
into a direction in the control output and disturbance input space, respectively (Doya, 2000). 3.3 Linear-Quadratic Case. We consider a special case in which a linear dynamic model and quadratic reward models are given or learned as x˙ = Ax + B1 u + B2 w q(x, u, w) = −xT Qx − uT Ru + γ 2 wT w.
(3.22) (3.23)
In this case, the value function is given by a quadratic form V(x) = −xT Px, where P is the solution of a Riccati equation: AT P + PA + P ∂f
1 1 T −1 T B B − B R B 1 1 2 2 P + Q = P. γ2 τ
(3.24)
∂f
T From ∂u = B1 , ∂w = B2 , S (u) = 2Ru, (w) = 2γ 2 w, and ∂V ∂x = −2x P, the best action u in equation 3.20 and the worst disturbance w in equation 3.21 can be derived as
u = R−1 BT1 Px 1 w = − 2 BT2 Px. γ
(3.25) (3.26)
4 Simulation In this section, we show the performance of robust RL. In section 4.1, we consider the control problem of a linear inverted pendulum (see Figure 10 and appendix B) and test whether the parameters of the linear controller acquired by our robust RL algorithm converge to the analytical H∞ solution. In section 4.2, we apply robust RL to a nonlinear pendulum swing-up task (see appendix A) and compare the robustness of the controllers learned by robust RL and standard RL. We also test how the performance of the robust
Robust Reinforcement Learning
343
RL controllers change with different settings of the robustness parameter γ . In section 4.3, we apply robust RL to the cart pole swing-up task (Doya, 2000; see Figure 11 and appendix C). We show that robust RL is applicable to a challenging nonlinear control task with a higher-dimensional state space. (Refer to the appendixes for the outline of the three benchmark tasks.) Below are the function approximation methods we used for the value functions, the dynamics models, and the policies in the three tasks. Value function. We use normalized gaussian networks as the function approximator to represent the value function V(x; v) =
K
vk bk (x),
(4.1)
k=1
where b() is the normalized gaussian basis function and v = (v1 , . . . , vk )T is the weight vector (see appendix D). In the case of linear dynamics and the quadratic reward, instead of using the normalized gaussian networks, we approximate the value function by the quadratic form so that V(x) = −xT Px,
(4.2)
where P is a symmetric matrix and the parameter of the function approximator. By using this quadratic form, we can exactly represent the actual value function. The update rules of these function approximators are given by equations 3.10 and 3.11. Dynamics model. We approximate the dynamics model without disturbance input by using the function approximator fˆ(x, u; vM ), x˙ −
∂ f (x, u, w) w fˆ(x, u; vM ), ∂w
(4.3)
where vM is a parameter vector. Because we assume that the system is inputaffine as described in section 3.2, we can remove the disturbance term from ∂ f (x,u,w) the dynamics f (x, u, w) by subtracting ∂w w from x˙ . Then we can approximate input gain using the approximated model ˆf (x, u; vM ) as ∂ fˆ(x, u; vM ) ∂ f (x, u, w) . ∂u ∂u
(4.4)
For linear dynamics, we use the linear approximation ˆ + Bˆ 1 u, x˙ − B2 w Ax
(4.5)
where Aˆ and Bˆ 1 are parameter matrices. The update rule of the function approximator is given in appendix D.
344
J. Morimoto and K. Doya
Actor and disturber. Actor and disturber are represented by equations 3.12 and 3.13. Corresponding policies to the reward function (see equation A.3 in appendix A) are given by u=
1 −1 R (Au (x; vu ) + nu ), 2
(4.6)
w=
1 (Aw (x; vw ) + nw ). 2γ 2
(4.7)
We use normalized gaussian networks as the function approximator (see appendix D). The update rule of the function approximator is given in equations 3.14 and 3.15. In the linear quadratic case, the actor and the disturber in equations 3.12 and 3.13 are represented as linear controllers, Au (x; vu ) = vTu x and Aw (x; vw ) = vw x, respectively, where vu ∈ Rm and vw ∈ Rl . The output functions su () and sw () are identities. Value-gradient-based policies. Value-gradient-based policies for action and disturbance are given in equations 3.20 and 3.21. For the pendulum swing-up task, we use policies that correspond to the reward function (see equation A.3), 1 −1 ∂ f (x, u, w) T ∂V T R , 2 u ∂x 1 ∂ f (x, u, w) T ∂V T w=− 2 . 2γ w ∂x u=
(4.8) (4.9)
For the cart pole swing-up task, we use value-gradient-based policies which correspond to the reward function (see equation C.7), 1 ∂ f (x, u, w) T ∂V(x) T max u=u s , (4.10) c ∂u ∂x 1 ∂ f (x, u, w) T ∂V(x) T max (j = 1, 2). (4.11) wj = wj s − dj ∂wj ∂x Exploration strategy. In order to promote exploration, we incorporated a noise term in both policies, in equations 3.12 and 3.13 and equations 3.21 and 3.22. We used low-pass filtered noise τn n˙ u (t) = −nu (t) + σu Nu (t), and τn n˙ w (t) = −nw (t)+σw Nw (t), where Nu (t) and Nw (t) denote normal gaussian noise (Doya, 2000). Simulation setup. The physical systems were simulated by the fourthorder Runge-Kutta method, and the learning dynamics was simulated by the Euler method, both with the time step of 0.01 sec. Because calculating learning dynamics requires more computational resources and less accuracy than the physical dynamics of the pendulum and the cart pole, we used the faster numerical integration method (the Euler method) for the learning dynamics.
Robust Reinforcement Learning
345
4.1 Linear Inverted Pendulum. We first considered a linear problem in order to test if the value function and the policy learned by robust RL coincide with the analytic solution of H∞ control problem. Thus, we considered only the pendulum dynamics near the unstable equilibrium point x = (0, 0)T . The parameters for the value function in equation 4.2 were given as P=
p11 p12
p12 . p22
(4.12)
Each trial was started from an initial state x(0) = (θ (0), θ˙ (0)), where θ (0) and θ˙ (0) were selected from a uniform distribution that ranged over − π6 < ˙ θ (0) < π6 and − π6 < θ(0) < π6 , respectively. We terminated a trial after 2 seconds, or when the state of the pendulum reached |θ | > π6 . Parameters for the learning process were given as η = 1.0, ηu = 0.1, ηw = 0.1, τ = 1.0, κ = 0.1, R = 1.0, γ = 1.5, τn = 0.02, σu = 3.0, and σw = 2.0. 4.1.1 Actor-Disturber-Critic. Here, we used robust RL implemented by the actor-disturber-critic shown in equations 4.6 and 4.7. We initialized the parameters of the actor vu as vu = {−10.0, −1.0} and the disturber vw = {0.0, 0.0}, which made the system stable. The parameters of the value function P in equation 4.12 were initialized with {p11 , p12 , p22 } = {−1.0, −10.0, −1.0}. The initial parameters of the value function are consistent with the initial parameters of the actor (i.e., vu = R−1 BT1 P). Figures 3A and 3B show learning performance of the actor-disturbercritic architecture. The parameters of the actor vu and the disturber vw converged to the values close to those of policies in equations 3.25 and 3.26 derived by solving the Riccati equation, 3.25. 4.1.2 Value-Gradient-Based Robust Policy. Value-gradient-based policies for action and disturbance are given based on equations 3.25 and 3.26. The parameter matrix P was learned instead of using the solution of the Riccati equation, 3.25. Again, the parameters of the value function in equation 4.12 were initialized with {p11 , p12 , p22 } = {−1.0, −10.0, −1.0}. The parameter matrices Aˆ and Bˆ 1 for the dynamics model are initialized as Aˆ = 0 and Bˆ 1 = 0. Results in Figure 3C show that the parameters of the value function P nearly converged to the solution of the Riccati equation, 3.25. As shown in Figure 3D, the model parameter Bˆ 1 = (b1 , b2 )T used in the controller, equation 3.26, also converged to the exact model in equation A.2. 4.2 Nonlinear Pendulum Swing-Up. Now we consider the original nonlinear dynamics, equation A.2. We used the reward function defined
346
J. Morimoto and K. Doya 20
−20
p
10
v1
−40
p
0
vw
22 12
Gain
V
0
−60
2
−10 u
v2
−20
−80 −100 −120 0
w
−30
p11 200
400
600
800
u
−40 0
1000
v1 200
(A)
600
800
1000
(B) 1.2
0 −20
p
1
p
0.8
22
−40
b2
12
B_1
V
400 Trials
Trials
−60
0.6 0.4
−80
0.2
−100 −120 0
p
0
11
200
400
600 Trials
(C)
800
1000
−0.2 0
b1 50
100 Trials
150
200
(D)
Figure 3: Learning performance of (A) estimated value function for the actordisturber-critic, (B) the actor and the disturber, (C) estimated value function for the value-gradient-based robust policy, and (D) estimated dynamics model. The dash-dot lines show the exact values of the parameters.
in equation A.3 with the control cost R = 0.05. The value and policy functions were implemented by normalized gaussian networks with 21 × 21 ˙ A 4 × 4 × 2 basis network bases for each dimension of the state space {θ, θ}. ˙ u} was used for modeling the system dyfor three-dimensional space {θ, θ, namics. 4.2.1 Learning Performance. Each trial was started from an initial state x(0) = (θ(0), 0.0), where θ(0) was selected from a uniform distribution that ranged over −π < θ < π.
Robust Reinforcement Learning
347
We set the learning parameters as η = 10.0, ηu = 5.0, ηw = 5.0, τ = 1.5, κ = 0.1, and τn = 0.2. The sizes of the perturbation σu and σw were tapered off as the performance improved (Gullapalli, 1990). We took the modulation V(t)−V0 scheme σu = 2.0 min[1, max[0, VV11−V(t) −V0 ]] and σw = 1.5 min[1, max[0, V1 −V0 ]], where V0 = −2.0 and V1 = 0.0 are the minimal and maximal levels of the expected reward. We used a range of the robustness parameter γ , starting from 1.0 and gradually reducing by 0.05, and found the smallest value with which the learning method acquired the swing-up policy. As we make γ smaller, the policy becomes more robust and more conservative. For the actor-disturber-critic, we initialized the parameters of the function approximators as v = 0, vu = 0, and vw = 0. The robustness parameter was set to γ = 0.45. For the value-gradient-based robust policy, we initialized the parameter of the value function as v = 0, and the dynamics model as vM = 0. The robustness parameter was set to γ = 0.25. As suggested in Doya (2000), using the learned dynamics model and the value gradient has an advantage for learning the pendulum swing-up task. We found that the valuegradient method also has an advantage for learning a more robust policy compared to the actor-disturber-critic. Therefore, we can use a smaller robustness parameter (γ = 0.25) than that of the actor-disturber-critic method (γ = 0.45). As a measure of the swing-up performance, we defined the time for which the pendulum stayed up (|θ | < π4 ) as tup . A trial lasted for 20 seconds unless the pendulum was over-rotated (|θ | > 5π ). Upon such a failure, the trial was terminated with a reward r(t) = −2.0 for 0.5 second. We compared the learning performance of four learning schemes: actor-disturber-critic, actor-critic, value-gradient-based robust policy, and value-gradient-based standard policy. Figure 4 shows the results of learning performance. The actor-critic and the value-gradient-based standard policy learn the swing-up task faster than the actor-disturber-critic and the value-gradient-based robust policy, respectively. Results show that learning robust policies is slower than learning standard policies because the disturbance is explicitly considered during learning. 4.2.2 Value Functions and Policies. Figure 5 shows the value functions acquired by the value-gradient-based robust RL with robustness parameter γ = 0.25 and by the value-gradient-based standard RL (γ = ∞). The value function acquired by the value-gradient-based robust RL has a sharper ridge ˙ = (0.0, 0.0) and a smoother slope around the upright position, x = (θ, θ) ˙ around the bottom position, x = (θ, θ) = (π, 0.0) (see Fgiure 5A). The sharp ridge keep the pendulum at the upright position, and the smoother slope, which makes more swings around the bottom position, is suitable to cope with modeling error. Figure 6 shows the acquired actor and disturber. The
J. Morimoto and K. Doya
20
20
15
15
t_up [sec]
t_up [sec]
348
10
5
0 0
10
5
200
400
600
800
0 0
1000
200
400
Trials
800
1000
(B)
20
20
15
15
t_up [sec]
t_up [sec]
(A)
10
5
0 0
600 Trials
10
5
200
400
600 Trials
(C)
800
1000
0 0
200
400
600
800
1000
Trials
(D)
Figure 4: Comparison of the time course of learning with different control schemes: (A) actor-disturber-critic, (B) actor-critic, (C) value-gradient-based robust policy, and (D) value-gradient-based standard policy. tup : time in which the pendulum stayed up.
disturber learned the policy that mainly disturbs the actor’s policy around ˙ = (0.0, 0.0). the upright position, x = (θ, θ) 4.2.3 Robustness to Parameter Changes. In Figure 7, we compare the robustness of the value-gradient-based robust policy and the value-gradientbased standard policy to the change of physical parameters. Both policies learned to swing up and hold a pendulum that has the weight m = 1.0 kg and the coefficient of friction µ = 0.01 (see Figure 7A). The value-gradientbased robust policy took more swings, indicating its conservative control law.
349
0 −1 −2 −3
400
V
V
Robust Reinforcement Learning
200 −100
0 −1 −2 −3
0 0
400 200 −100
100 θ [deg]
−400
0 0
−200
−200 100
ω [deg/s]
θ [deg]
(A)
ω [deg/s]
−400
(B)
4 2 0 −2 −4
400
w
u
Figure 5: Shape of the value function acquired by the value-gradient-based RL after 1000 learning trials. (A) Robust RL (γ = 0.25). (B) Standard RL (γ = ∞).
1 0 −1
400
200 −100
200
0 0
−100
θ [deg]
−400
0 0
−200 100
−200 100
ω [deg/s]
θ [deg]
(A)
ω [deg/s]
−400
(B)
2.5
2.5
2
2
1.5
1.5
θ [rad/π]
θ [rad/π]
Figure 6: Shape of the control and disturbance function after 1000 learning trials. the robustness parameter was γ = 0.45 (A) Actor. (B) Disturber.
1 0.5 0 −0.5 0
1 0.5 0
Robust Standard 1
2
3 Time [sec]
(A)
4
5
6
−0.5 0
Robust Standard 5
10 Time [sec]
15
20
(B)
Figure 7: Swing-up trajectories with a pendulum with different weight and friction. The dash-dot lines show the upright position. The initial position was θ (0) = 56 π . (A) m = 1.0, µ = 0.01. (B) m = 3.0, µ = 0.5.
350
J. Morimoto and K. Doya
Table 1: Comparison with Different Robustness Parameter (Average Performance of 10 Learned Controllers). γ
0.25
0.5
1.0
∞
Max m[kg]
5.9
3.9
3.2
3.1
Next, we applied both robust and standard policies to a pendulum that has different physical parameters (m = 3.0 kg, µ = 0.5) from the originally learned environment (m = 1.0 kg, µ = 0.01). As shown in Figure 7B, the robust policy could successfully swing the pendulum up, while the standard policy could not. This result shows the robustness of the policy acquired by robust RL. We also tested how the robustness parameter γ affects robust performance of the learned controller. We compare the value-gradient-based robust policy with the three robustness parameters (γ = 0.25, 0.5, 1.0) and the value-gradient-based standard policy (γ = ∞). We trained these controllers with the pendulum mass m = 1.0 kg. Then we compared the maximum weight that each controller can successfully swing up, where the initial joint angle was θ (0) = 56 π and the coefficient of friction was the same as the originally learned environment, µ = 0.01. A trial was regarded as successful when tup > 5 seconds. We generated 10 controllers for each robustness parameter γ and then compared the average performance. The results of the comparison in Table 1 show a more robust performance with smaller γ . This results are consistent with H∞ control theory (Zhou et al., 1996). 4.3 Cart Pole Swing-Up. We compared the robust performance of the robust RL controller with the standard RL controller in the cart pole swingup task. The value and policy functions were implemented by normalized gaussian networks with 7 × 7 × 21 × 21 bases for each dimension of the state ˙ A 2 × 2 × 4 × 2 × 4 basis network for five-dimensional space space {x, v, θ, θ}. {x, v, θ, θ˙ , u} was used for modeling the system dynamics. When the cart bumped into the end of the track or when the pole overrotated (|θ| > 5π ), a terminal reward r(t) = −1.0 was given for 0.5 second. Otherwise a trial lasted for 20 seconds. We set the learning parameters as η = 20.0, τ = 1.0, κ = 0.3, ηM = 10.0, τn = V(t)−V0 0.2, σu = 2.0 min[1, max[0, VV11−V(t) −V0 ]], and σw = 2.0 min[1, max[0, V1 −V0 ]], where V0 = −1.0 and V1 = 0.0. Each trial was started from an initial state x(0) = (0.0, 0.0, θ(0), 0.0), where θ(0) were selected from a uniform distribution that ranged over −π < θ(0) < π.
Robust Reinforcement Learning
351
100
Success rate [%]
80
60
40
20 Robust RL Standard RL 0 0.1
0.15
0.2 0.25 Pole mass m_p [kg]
0.3
0.35
Figure 8: Success rate of the cart pole swing-up task (average success rate of 10 learned controllers).
Here, we compared the average swing-up performance after 1000 learning trials. We changed the pole mass from mp = 0.1 kg to mp = 0.35 kg by 0.05 kg. Each trial was started from an initial joint angle θ (0) selected from a uniform distribution that ranged over −π < θ (0) < π. Figure 8 shows the success rate of the cart pole swing-up task. The success rate was derived from the number of successful swing-up times in 100 trials. A trial was regarded as successful when tup > 5 seconds, where we defined the time for which the pole stayed up (|θ| < 12 deg) as tup . We generated 10 robust controllers (policies) and standard controllers (policies) and then compared average success rate. We used robustness parameters defined in equation C.9 in appendix C. The success rate using the robust policy with the pole mass mp = 0.35 kg was about 80%, while using standard policy was about 30%. This result shows the robustness of the acquired robust policy. Figure 9A shows that the robust policy could swing up the pole with the cart pole that has different physical parameters (pole mass mp = 0.35 kg, pole friction µp = 0.005, the cart mass mc = 1.5 kg) from the original environment (mp = 0.1 kg, µp = 2.0 × 10−6 , mc = 1.0 kg). Figure 9B shows that the standard policy failed to swing up the pole with the same cart pole used in Figure 9A. These results suggest that the robust RL can possibly be applied to a more difficult task with a higher-dimensional state space. 5 Discussion H∞ control theory gives an analytical solution only for linear systems. For nonlinear systems, there is no analytical way of solving the HJI equation.
352
J. Morimoto and K. Doya
1 0 10 −1
9 8 7
−3
6
−2
5
−1
4
0
3
1
2
2 3
1
Time [sec]
0
Position x [m]
(A)
1 0 10 −1
9 8 7
−3
6
−2
5
−1
4
0
3
1
2
2 3
1
Time [sec]
0
Position x [m]
(B) Figure 9: Cart pole swing-up trajectories with pole mass mp = 0.35 kg, pole friction µp = 0.005, and cart mass mc = 1.5 kg. The initial joint angle was θ(0) = π. (A) Value-gradiant-based robust policy. (B) Value-gradient-based standard policy.
Robust Reinforcement Learning
353
In order to derive a nonlinear H∞ controller, the value function is usually derived by using dynamic programming (Imafuku, 1999; Coraluppi & Marcus, 1999). A similar approach to the methods using dynamics programming was proposed by Nilim and Ghaoui (2004) to cope with uncertain state transition matrices. However, these methods need off-line calculation and an environmental model. Robust RL can derive a nonlinear H∞ controller by online calculation and without any prior knowledge of an environmental model. Robust RL provides a new way of using min-max solution in RL problems. Min-max RL was applied to games like backgammon (Tesauro, 1992) and Othello (Yoshioka & Ishii, 1998). Minimax Q-learning was proposed by Littman (1994, 2001) and applied to zero-sum games. However, in these studies, each player takes the same role. The min-max RL was also applied to a problem in which an airplane tries to avoid a missile and the missile tries to catch the airplane (Harmon, Baird, & Klopf, 1995). However, this study focuses on only linear control problems. We applied min-max RL in which each of two players (the control agent and the disturbing agent) has a different ability with the nonlinear problems of pendulum swing-up and cart pole swing-up. Risk-sensitive control studies are also related to our RRL framework and have an interesting application to the field of finance. Neuneirer and Mihatsch (1998) proposed risk-sensitive reinforcement learning and applied their method to acquire an optimal investment policy for an artificial stock price. Robustness during the learning process in RL framework was studied in Singh, Barto, Gruppen, and Connolly (1994) by using a constrained policy space and in Kretchmar et al. (2001) by combining standard RL with a linear robust controller. Although the target of our study is different from these studies, this kind of robustness is also important when we apply RL to real-world problems.
6 Conclusion In this study, we proposed a new RL paradigm called robust reinforcement learning (RRL). We showed that RRL can learn the analytic solution of the linear H∞ controller for the inverted pendulum dynamics around the upright position and also showed that RRL can deal with modeling error that standard RL cannot in the nonlinear inverted pendulum swing-up simulation example. We also applied RRL to the cart pole swing-up task, and a robust swing-up policy was acquired by RRL. We will apply RRL to more complex tasks like learning stand-up behaviors (Morimoto & Doya, 2000, 2001a) and biped locomotion (Atkeson & Morimoto, 2003; Morimoto & Atkeson, 2003) in future work.
354
J. Morimoto and K. Doya
Figure 10: Single-pendulum swing-up task. θ: joint angle, T: input torque, l: length of the pendulum.
Appendix A: Pendulum Swing-Up Task The task of swinging a pendulum up (Doya, 2000) is a simple but strongly nonlinear control task. The dynamics of the pendulum is given by ml2 θ¨ = −µθ˙ + mgl sin θ + T,
(A.1)
where θ is the angle from the upright position, T is the input torque, µ = 0.01 is the coefficient of friction, m = 1.0 kg is the weight of the pendulum, l = 1.0 m is the length of the pendulum, and g = 9.8 m/s2 is the gravity acceleration (see Figure 10). ˙ T and the action as u = T. We We define the state vector as x = (θ, θ) consider disturbance input w to the acceleration θ¨ . Thus the dynamics is given by an input-affine system: x˙ =
θ˙
g µ ˙ + l sin θ − ml2 θ
0 1 ml2
u+
0 w. 1
(A.2)
The augmented reward function is given by q(x, u, w) = (cos θ − 1) − Ru2 + γ 2 w2 .
(A.3)
Appendix B: Linear Inverted Pendulum For comparison of robust RL with analytical linear H∞ control, we use a local linearization of the pendulum dynamics, equation A.2, near the unstable equilibrium point x = (0, 0)T . The coefficient matrices of the linear from
Robust Reinforcement Learning
355
Figure 11: Cart pole swing-up task. θ : joint angle, x: cart position F: input force.
equation 3.23 are given by A=
0
g l
1 − mlµ2
, B1 =
0
1 ml2
, B2 =
0 . 1
(B.1)
The quadratic approximation of the augmented reward function, equation A.3, is given by q(t) = −xT Qx − Ru2 + γ 2 w2 ,
(B.2)
where Q = diag{0.5, 0}. Appendix C: Cart Pole Swing-Up Task The cart pole swing-up (see Figure 11; Doya, 2000) is a strongly nonlinear extension to the common cart pole balancing task (Barto, Sutton, & Anderson, 1983). The physical parameters of the cart pole were the same in Barto et al. (1983) and Doya (2000). Marked differences from the pendulum swing-up task are that the dimension of the state space is higher and that the disturbance input has two degrees of freedom. ˙ where x and v are, respectively, the The state vector is x = {x, v, θ, θ}, position and the velocity of the cart. The dynamics of the cart pole is modeled by the following nonlinear ordinary differential equations:
θ¨ = x¨ =
gsinθ +
2 ˙ cos θ(µc sign(x)−F−m p lθ˙ sin θ) mc +mp
l
4 3
−
mp cos2 θ mc +mp
−
µp θ˙ mp l
˙ F + mp l(θ˙ 2 sin θ − θ¨ cos θ) − µc sign(x) . mc + mp
,
(C.1)
(C.2)
356
J. Morimoto and K. Doya
The following specifications were used for the simulated pole and cart: the track ranged from −3.6 m to 3.6 m; mass of the cart: mc = 1.0 kg; mass of the pole: mp = 0.1 kg; half-length of the pole l = 0.5 m; gravity g = 9.8m/s2 ; coefficient of friction of cart on track: µc = 5.0 × 10−4 ; coefficient of friction of pole on cart: µp = 2.0 × 10−6 . F denotes applied force, x denotes the cart position, and θ denotes joint angle of the pole. The cart-pole dynamics can be written as an input-affine system, x˙ = f(x) + g(x)u + B2 w, where
(C.3)
v
mp l(θ˙ 2 sin θ −θ¨ cos θ)−µc sign(x) ˙ mc +mp θ˙ f(x) = , 2 ˙ cos θ(µc sign(x)−m µp θ˙ p lθ˙ sin θ) − mp l g sin θ+ mc +mp
2 l
(C.4)
mp cos θ 4 3 − mc +mp
v 1 m +m c p θ˙ g(x) = cos θ − mc +mp
m cos2 θ
,
(C.5)
p 4 3 − mc +mp
l
and we used the input matrix for disturbance B2 : 0 0 1 0 B2 = 0 0 . 0 1
(C.6)
The augmented reward was given by q(x, u, w) = 0.5(cos θ − 1) − S(u) +
2
j (wj ),
(C.7)
j=1
where
u
S(u) = c
s−1
0
j (wj ) = dj
a
da, umax
wj
s 0
1
a wjmax
(C.8)
da
(j = 1, 2),
(C.9)
Robust Reinforcement Learning
s(x) =
357
2 π arctan x , 2 π
(C.10)
= 1.0 N, wmax = 10.0 N · m, c = 1.0, d1 = 1.0, and umax = 20.0 N, wmax 1 2 d2 = 1.0. Appendix D: Normalized Gaussian Network A value function is represented by V(x; v) =
K
vk bk (x),
(D.1)
k=1
where the normalized gaussian basis function is represented by ak (x) bk (x) = K , l=1 al (x)
ak (x) = e−sk (x−ck ) . T
2
(D.2)
The vectors ck and sk define the center and the size of the kth basis function, respectively. Note that if there is no neighboring basis function, the shape of the basis functions extends like sigmoid functions by the effect of normalization. In the current simulations, the centers are fixed in a grid, which is analogous to the “boxes” approach (Barto et al., 1983) often used in discrete RL. Grid allocation of the basis functions enables efficient calculation of their activation as the outer product of the activation vectors for individual input variables. In the actor-disturber-critic method, the polices are implemented as u u u(t) = su vk bk (x(t)) + n (t) , (D.3) k
w(t) = sw
vw k bk (x(t))
w
+ n (t) ,
(D.4)
k
where sw () and sw () are monotonically increasing output function, and nu and nw are the noise. In the value-gradient-based methods, the control and disturbance policies are given by ∂ f (x, u, w)T ∂bk (x) u u(t) = su vk (D.5) + n (t) , ∂u ∂x k ∂ f (x, u, w)T ∂bk (x)T w w(t) = sw − + n (t) . vk (D.6) ∂w ∂x k
358
J. Morimoto and K. Doya
To implement the input gain model, a network is trained to predict the time derivative of the state from x and u, x˙ (t) −
∂ f (x, u, w) vkM bk (x(t), u(t)). w(t) fˆ(x, u) = ∂w k
(D.7)
The weights are updated by v˙ kM (t) = η M
∂ f (x, u, w) ˆ x˙ (t) − w(t) − f (x(t), u(t)) bk (x(t), u(t)), (D.8) ∂w
where the learning rate was set to η M = 10.0 for all simulations, and the input gain of the system dynamics is given by ∂ f (x, u) M ∂bk (x, u) vk ∂u ∂u k
.
(D.9)
u=0
Acknowledgments We thank Christopher G. Atkeson, Mitsuo Kawato, and the anonymous reviewers for their helpful comments. References Atkeson, C. G., & Morimoto, J. (2003). Nonparametric representation of policies and value functions: A trajectory-based approach. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 643–1650). Cambridge, MA: MIT Press. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846. Coraluppi, S. P., & Marcus, S. I. (1999). Risk-sensitive and minmax control of discrete-time finite-state Markov decision processes. Automatica, 35, 301–309. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245. Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3, 671–192. Harmon, M. E., Baird III, L. C., & Klopf, A. H. (1995). Advantage updating applied to a differential game. In G. Tesauro, D.S. Touretzky, & T.K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 353–360). Cambridge, MA: MIT Press. Imafuku, K. (1999). Singularities of nonlinear control systems designed by HamiltonJacobi equations. Unpublished doctoral dissertation, Nara Institute of Science and Technology.
Robust Reinforcement Learning
359
Kretchmar, R. M., Young, P. M., Anderson, C. W., Hittle, D. C., Anderson, M. L., & Delnero, C. C. (2001). Robust reinforcement learning control with static and dynamic stability. International Journal of Robust and Nonlinear Control, 11, 1469–1500. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning (pp.157–163). San Francisco: Morgan Kaufmann. Littman, M. L. (2001). Value-function reinforcement learning in Markov games. Journal of Cognitive Systems Research, 2, 55–66. Morimoto, J., & Atkeson, C. G. (2003). Minimax differential dynamic programming: An application to robust biped walking. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1563–1570). Cambridge, MA: MIT Press. Morimoto, J., & Doya, K. (2000). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. In Proceedings of Seventeenth International Conference on Machine Learning (pp. 623–630). San Francisco: Morgan Kaufmann. Morimoto, J., & Doya, K. (2001a). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36, 37–51. Morimoto, J., & Doya, K. (2001b). Robust reinforcement learning. In T.K. Leen, T. G., Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 1061–1067). Cambridge, MA: MIT Press. Neuneier, R., & Mihatsch, O. (1998). Risk sensitive reinforcement learning. In M.S. Kearns, S A Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 1031–1037). Cambridge, MA: MIT Press. Nilim, A., & Ghaoui, L. E. (2004). Robustness in Markov decision problems with uncertain transition matrices. In S. Thrun, L. Saul, & B. Scholkopf, ¨ (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Singh, S. P., Barto, A. G., Grupen, R., & Connolly, C. (1994). Robust reinforcement learning in motion planning. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 655–662). San Francisco: Morgan Kaufmann. Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning 8, 257–277. Weiland, S. (1989). Linear quadratic games, H∞ , and the Riccati equation. In Proceedings of the Workshop on the Riccati Equation in Control, Systems, and Signals (pp. 156–159) Como, Italy. Yoshioka, T., & Ishii, S. (1998). Strategy acquisition for the game “othello” based on reinforcement learning. In S. Usui & T. Omori (Eds.), International Conference on Neural Information Processing (pp. 841–844). London: ISO Press. Zhou, K., Doyle, J. C., & Glover, K. (1996). Robust optimal control. Englewood Cliffs, NJ: Prentice Hall. Received May 13, 2004; accepted July 2, 2004.
LETTER
Communicated by Daniel Durstewitz
A Computational Model of the Functional Role of the Ventral-Striatal D2 Receptor in the Expression of Previously Acquired Behaviors Andrew James Smith
[email protected] Suzanna Becker
[email protected] Psychology Department, McMaster University, Hamilton, Ontario, Canada L8S 4K1
Shitij Kapur Shitij
[email protected] Center for Addiction and Mental Health, Toronto, Ontario, Canada, M5R 1T8
The functional role of dopamine has attracted a great deal of interest ever since it was empirically discovered that dopamine-blocking drugs could be used to treat psychosis. Specifically, the D2 receptor and its expression in the ventral striatum have emerged as pivotal in our understanding of the complex role of the neuromodulator in schizophrenia, reward, and motivation. Our departure from the ubiquitous temporal difference (TD) model of dopamine neuron firing allows us to account for a range of experimental evidence suggesting that ventral striatal dopamine D2 receptor manipulation selectively modulates motivated behavior for distal versus proximal outcomes. Whether an internal model or the TD approach (or a mixture) is better suited to a comprehensive exposition of tonic and phasic dopamine will have important implications for our understanding of reward, motivation, schizophrenia, and impulsivity. We also use the model to help unite some of the leading cognitive hypotheses of dopamine function under a computational umbrella. We have used the model ourselves to stimulate and focus new rounds of experimental research. 1 Introduction Dopamine is a neuromodulator of great interest because of its central role in reward and motivation, as well as in a number of human disorders, including schizophrenia, attention deficit hyperactivity disorder (ADHD), drug addiction, and Parkinson’s disease. The precise role of dopamine in each of these processes is still a matter of debate, and a number of partially overlapping hypotheses exist. With a view to formalizing and uniting some of these hypotheses, we present a novel computational model of dopamine function that offers a consistent account of some apparently diverse results Neural Computation 17, 361–395 (2005)
c 2005 Massachusetts Institute of Technology
362
A. Smith, S. Becker, and S. Kapur
from the animal behavior literature. In particular, we concentrate on the effect of dopamine manipulation on the expression of previously acquired behaviors. There are several competing hypotheses of dopamine function, which we briefly review. Some are computational and others cognitive, with the computational approaches tending to be oriented around the phasic dopamine response and its role in learning, and the cognitive hypotheses based on behavioral data and possibly pertaining more to the tonic or constant background signal. The anhedonia hypothesis (Wise, 1982; Wise, Spindler, DeWit, & Gerber 1978) suggested that dopamine mediates the hedonia associated with rewarding environmental stimuli. Evidence was drawn from animal experiments in which neuroleptics (dopamine-blocking drugs) caused extinctionlike effects. Extinction refers to the slow disappearance of a conditioned behavior if the resulting reward is not forthcoming. Recent data have eroded the anhedonia hypothesis and supported a more subtle role for dopamine in reward, motivation, and salience. For example, Berridge and Robinson (2003) suggest that reward is a multidimensional construct that can be decomposed into (among others) liking (hedonic) and wanting (motivational) components and that dopamine selectively mediates the latter. This position has gained widespread recognition as the incentive salience hypothesis. Evidence is derived from an extreme experiment in which rats that are deprived of almost all dopamine in the ventral and neostriatum simply stop eating, even though they apparently maintain the motoric capability to do so and even though their life depends on the food that is right under their noses! An analysis of affective responses when artificially fed reveals that the hedonic impact of the food (”liking”) is apparently unaffected (Berridge & Robinson, 1998). The incentive salience hypothesis is not computational in nature, although McClure, Daw, and Montague (2003) have developed a computational instantiation to account for some basic experimental data. Salamone, Cousins, and Snyder (1997) suggest that “dopamine in the nucleus accumbens [part of the ventral striatum] is important for responding to conditioned stimuli and . . . to stimuli that are spatially and temporally distant from the organism” (p. 353). General support for the claim that dopamine is implicated in responding for rewards (or avoiding punishments) that are in some way distal from the organism is provided by a wide range of experiments pertaining to conditioned avoidance (Anisman, Irwin, Zacharko, & Tombaugh 1982; Beninger & Hahn, 1983; Blackburn & Phillips, 1989; Courvoisier, 1956a; Grilly, Johnson, Minardo, Jacoby, & LaRiccia, 1984; van der Heyden & Bradford, 1988; Maffii, 1959; Wadenberg, Soliman, Vanderspek, & Kapur, 2001; Stark, Bischof, & Scheich, 1999; Wilkinson et al., 1998), animal (Richards, Sabol, & de Wit, 1999; Wade, de Wit, & Richards, 2000; Cousins, Atherton, Turner, & Salamone, 1996; Salamone et al., 1991; Salamone, Cousins, & Bucher, 1994) and human (de Wit, Enggasser, & Richards, 2002) models of impulsivity, aphagia (Berridge &
Modeling the Functional Role of Dopamine in the Ventral Striatum
363
Robinson, 1998), and instrumental responding for food (Dickinson, Smith, & Mirenowicz, 2000; Evenden & Robbins, 1983; Fowler, LaCerra, & Ettenberg, 1986; Rolls et al., 1974; Wise & Schwartz, 1981; Wise et al., 1978), drugs (see Wise, 2002, for a review), conditioned reinforcers (Taylor & Robbins, 1984), and electrical brain stimulation (Ettenberg, 1989; Fibiger, Carter, & Phillips, 1976; Rolls et al., 1974; Salamone, Kurth, McCullough, Sokolowski, & Cousins, 1993). Those studies that directly target the ventral striatum suggest that this is the site where this particular effect of dopamine manipulation is occurring (Berridge & Robinson, 1998; Cardinal, Pennicott, Sugathapala, Robbins, & Everitt, 2001; Salamone et al., 1991, 1993, 1994; Salamone, Wisniecki, Carlson, & Correa, 2001). In contrast to these motivational perspectives, Horvitz (2002) suggests a more attentional role for dopamine in the gating of sensory, reward, and motor processes. Also within this attentional category, Redgrave, Prescott, and Gurney (1999) propose that dopamine plays a role in switching between different behaviors. For example, Phillips, Stuber, Helen, Wrightman, and Carelli (2003) report that rats could be made to stop what they were doing and press a lever for cocaine (a preconditioned behavior) simply by artificially stimulating dopamine release. From a computational perspective, a highly influential model is the prediction error hypothesis of dopamine function (Hollerman & Schultz, 1998; Schultz, Dayan, & Montague, 1997; Montague, Dayan, & Sejnowski, 1996; Houk, Adams, & Barto, 1995). Electrophysiological recordings from the primate midbrain suggest that the dopaminergic signal is somewhat analogous to the prediction error signal used to drive learning in the temporal difference (TD) learning algorithm (for TD, see Sutton, 1988; Sutton & Barto, 1998). Apart from being able to account for a range of data pertaining to the phasic firing of dopamine neurons, one of the strengths of this hypothesis is that it posits both a cause (error in predicted reward) and effect (update of reward prediction) of dopamine neuron firing within a concrete computational framework. However, a number of criticisms have been discussed (Berridge & Robinson, 1998; Horvitz, 2000, 2002; Redgrave et al., 1999). One problem for the prediction error hypothesis is that dopamine is released not just following rewarding stimuli and stimuli that predict reward, but also following novel stimuli, aversive stimuli, and even stimuli that predict aversive events (for reviews, see Ikemoto & Panksepp, 1999; Horvitz, 2000, 2002; Salamone et al., 1997; Joseph, Datla, & Young, 2003). Another problem is that dopamine neurons produce not only intermittent burst or phasic firing, but also a constant background tonic firing. Recently Fiorillo, Tobler, and Schultz (2003) have also found an intermediate sustained firing between a stimulus and a subsequent reward. A more general role for dopamine is therefore suggested. However, perhaps the major consideration is that the prediction error hypothesis primarily posits a role for dopamine in learning (i.e., the first derivative of behavior), and yet there is compelling evidence that dopamine
364
A. Smith, S. Becker, and S. Kapur
is also required for the expression (’‘zeroth derivative”) of previously acquired behaviors (Cousins et al., 1996; Rolls et al., 1974; Maffii, 1959; Berridge & Robinson, 1998; Wade et al., 2000; Richards et al., 1999; Wadenberg et al., 2001). Moreover, following dopamine manipulation, behaviors appear to be affected differently depending on their relationship to the rewarding (or punishing) outcome. Although Montague, Dayan, Person, and Sejnowski (1995), Montague et al., (1996), and McClure et al. (2003) have extended the TD model to include a role for dopamine in biasing action selection, this approach has not yet been used to account for the selectivity of dopamine manipulation on distal versus proximal rewards. It is this selectivity that is the focus of this letter. A brief note on the motivation of this work is now offered. One of our long-term goals is a better computational understanding of schizophrenia, and in particular psychosis. Since the ventral striatum and the dopamine D2-receptor subtype have been strongly implicated in the disorder, the first step was to look at behavioral studies that pertain to either the ventral striatum or the specific D2-receptor manipulation, or, where possible, both. The aim of this work is not to model the striatal D2 receptor at an anatomical or physiological level, but rather to induce its functional significance at the behavioral level based on the studies referred to above. Since TD methods provide the currently preeminent computational account of phasic dopamine, our first inclination was to adopt and adapt this approach. However, we were unsuccessful in matching the data to TD and will therefore outline an alternative internal model account of motivated behavior. That is not to say that TD cannot address the data considered below, but rather that we were unable to use it to do so in a parsimonious fashion. The internal model approach described below and TD use very different representational techniques with far-reaching implications, and it is important to know which forms the sounder basis for understanding schizophrenia. Behavioral data provide our current constraint, but future work must draw on additional constraints (including physiological and electrophysiological considerations) to resolve the issue to the satisfaction of behavioral, psychiatric, and computational communities.
2 Schizophrenia, Psychosis, and Conditioned Avoidance Schizophrenia occurs with a global incidence of around 1% and is considered one of the most debilitating human disorders (Kandel, Schwartz, & Jessell, 1991). The symptoms are broadly divided into two categories: the positive and the negative. The negative symptoms, characterized by a lack of motivation, flat affect, anhedonia, eccentric behavior, social isolation, poverty of speech, and a poor attention span, are chronic and effectively untreatable by pharmacological intervention. In contrast, the positive symptoms or psychosis, which consist of delusions, disordered thoughts,
Modeling the Functional Role of Dopamine in the Ventral Striatum
365
and hallucinations, often occur in acute phases and may be mitigated or prevented with antipsychotic drugs (APDs). Although psychosis is a major component of schizophrenia, it may also occur in nonschizophrenic individuals. For insight into the nature of delusions and hallucinations, see Maher and Ross (1984) and Beck and Rector (2003), respectively. All current APDs block dopamine; moreover, there is a striking correlation between the ability of these drugs to block the dopamine D2 receptor and the dose required to mitigate psychosis (Kandel et al., 1991). Interestingly, all drugs of abuse cause an increase in dopamine in the nucleus accumbens (for reviews, see DiChiara, 1999; Kauer, 2003; Ikemoto & Panksepp, 1999), and some of these (particularly cocaine and amphetamine) can also induce psychotic symptoms in users (Bell, 1973; Connell, 1958). The number of D2 receptors is increased in the striatum of (unmedicated) schizophrenia patients in postmortem examination, an effect that is particularly pronounced in patients with positive symptoms (Kandel et al., 1991, chap. 55). The nucleus accumbens in particular may be important as a convergence site for a number of brain regions that are implicated in schizophrenia, including PFC, amygdala, and hippocampus (Grace, 2000), not to mention dopamine projections of the VTA. (See also Grace, 1991, for discussion.) These, along with other data, have led to the hypothesis that psychosis is mediated, if not caused, by an excess of dopamine in the limbic system, of which the ventral striatum is a key component (Kandel et al., 1991, chap. 55). An important preclinical drug test for potential antipsychotic efficacy is a well-established animal experimental paradigm called conditioned avoidance (CA) (Kilts, 2001; Arnt, 1982; Janssen, Niemegeers, & Schellekens, 1965; Wadenberg & Hicks, 1999). The standard CA experiment finds that if a neutral stimulus, such as an auditory tone (the conditioned stimulus or CS), regularly precedes an electric shock (unconditioned stimulus, or US), an animal will learn to avoid the shock by taking appropriate evasive action in response to the tone (Kamin, 1954; Low & Low, 1962; Black, 1963). Appropriate evasive action often involves the animal’s running or jumping to another compartment in the cage. An avoidance response is recorded if the animal runs during the tone, and an escape response is recorded if the animal waits until the arrival of the shock. A failure is recorded if the animal fails to run even when shocked. It is well established that low (noncataleptic) doses of all APDs, administered after the avoidance behavior has been acquired, selectively disrupt that avoidance response yet leave the escape response intact (Ader & Clink, 1957; Arnt, 1982; Cook & Weidley, 1957; Cook & Catania, 1964; Courvoisier, 1956b; Davidson & Weidley, 1976; Ponsluns, 1962). As the drug wears off, the avoidance response is restored (Wadenberg et al., 2001; Smith, Li, Becker, & Kapur, 2004). It seems unlikely that these effects are due to the interaction of the drug with the learning processes of the animal because of the differences in the timescales involved. For example, a rat typically requires many trials over many days to acquire the avoidance response. If an APD is
366
A. Smith, S. Becker, and S. Kapur
then administered and the rat is tested 20 minutes later (allowing the drug to take effect), the rat will stop running in response to the tone almost immediately. Similarly, when the rat is tested the following day drug free, the avoidance response will reappear almost immediately. Also, in untreated rats, the extinction of the avoidance response when the shock is actually discontinued tends to be a relatively slow process (Kamin, 1954). Therefore, the immediate impact of the drug is again striking. However APDs do also retard acquisition of the conditioned response, and dissociating the role of dopamine in performance and learning is not straightforward (Beninger, 1989). The degree of APD-induced avoidance disruption has been correlated with D2-receptor blockade, leading to the suggestion that blockade of this dopamine receptor is the neurochemical link between conditioned avoidance disruption in rats and antipsychotic action in people (Wadenberg et al., 2001). Importantly, conditioned avoidance can also be disrupted by direct intra-accumbens injections of D2-receptor antagonists (blocking D2 receptors) and by accumbens 6-OHDA lesions, destroying the dopaminereleasing neurons themselves (see Ikemoto & Panksepp, 1999, for a review). However, the common behavioral or psychological processes are currently unknown. For example, existing hypotheses of why APDs disrupt avoidance include the inhibition of an internal “fear” or “anxiety” (Cook & Weidley, 1957; Miller, Murphy, & Mirsky, 1957; Davis, Capehart, & Llewellin, 1961; Hunt, 1956), motor impairment (Ponsluns, 1962; Cook & Catania, 1964; Morpurgo, 1965; Beninger, Mason, Phillips, & Fibiger, 1980a, 1980b; Grilly et al., 1984; Aguilar, Mari-Sanmillan, Mortant-Deusa, & Minarro, 2000; Ogren & Archer, 1994), reduced responsiveness to external stimuli (Dews & Morse, 1961), decrease in sensory stimulation (Irwin, 1958; Key, 1961), and loss of attention or arousal (Low, Eliasson, & Kornetsky, 1966). However, following the incentive salience hypothesis of dopamine function, we will propose a motivational explanation that can subsequently be used to account for additional data from other experimental paradigms.
3 The Model Our model is based on the assumption that an animal builds an explicit internal model of its environment. Internal models have played a significant role in the methodologies of a number of fields, including AI. Indeed, they have been used not only within formal reinforcement learning (reviewed in Sutton & Barto, 1998), but also to model animal conditioning (Schmajuk, 1988) and the dopamine system (Schmajuk, Cox, & Gray, 2001; Suri, Bargas, & Arbib, 2001; Suri, 2001). Although the types of representation used by animals are many and varied (Balleine, Garner, Gonzalez, &
Modeling the Functional Role of Dopamine in the Ventral Striatum
367
Dickinson, 1995; Berridge & Robinson, 1998; Cardinal, Parkinson, Hall, & Everitt, 2002; Dickinson, 1980), the evidence that animals internally represent action-outcome relationships (among others) is compelling (Dickinson, 1987; Dickinson, Nicholas, & Adams, 1983). For example, Dickinson (1980) reviews the sensory preconditioning paradigm (Nader & LeDoux, 1999; Rizley & Rescorla, 1972; Talk, Gandhi, & Matzel, 2002; Young, Ahier, Upton, Joseph, & Gray, 1998). During stage 1, a tone is paired with food. During stage 2, the food is paired with illness. During stage 3, appetitive response to just the tone on its own is tested. Rats trained on stages 1 and 2 show an attenuated response in stage 3 in comparison to rats that were trained on only stage 1. This demonstrates that rats are able to integrate the knowledge acquired in stages 1 and 2, and Dickinson interprets this (and other data) as evidence for the presence of declarative representations (i.e., an internal world model). Our approach is a type of model-based reinforcement learning in which an agent learns to maximize reward through trial-and-error interaction with its environment. This satisfactorily captures the problem faced by a real animal. First, we assume that our model can recognize the current environmental stimulus and the time since its onset, by activating one of a set of predefined and fixed internal states. This assumption is similar to the tapped-delay-line representation assumption of Schultz et al. (1997), Montague et al. (1996), and others. Second, these states are built into an explicit internal model of the environment that comprises a transition function and a reward function. Third, the internal model is used to evaluate alternative actions and generate motivation via the online calculation of expected future reward. We will then propose a role for dopamine in modulating the efficacy of the connections between the states of the internal model, effectively implementing an online version of the discount factor of reinforcement learning (Sutton & Barto, 1998). Figure 1 illustrates our tapped-delay-line assumption. The temporal resolution at which stimuli are represented is arbitrary, and we choose an interval of 1s. This interval will affect the quantitative but not the qualitative nature of the results. The availability of states is assumed to be sufficient for each task that is considered. Each state, si , is associated with an intrinsic reward value, r(si ), that is assumed to be supplied by the environment. For example, a state representing shock might be assumed to elicit a reward value of -1, while a state representing food might elicit a reward of 1. Neutral stimuli will always elicit a reward value of zero. We assume that a number of different actions are available to the agent, A = {a1 , a2 , ..., am }, where m is the total number of actions available. We also assume that the agent can take an action only when a new environmental stimulus is presented. For example, actions are not taken in s‘Any ,t>0 . The internal model is described with reference to the simple neural circuit shown in Figure 2. Each time a new internal state, snew , is activated, the
368
A. Smith, S. Becker, and S. Kapur
...
... US
SUS,0
SUS,1
SUS,2
CS2
SCS2,0
SCS2,1
SCS2,2
CS1
SCS1,0
SCS1,1
SCS1,2
0
1
2
...
Figure 1: We assume that the model is able to recognize and represent the current stimulus and the time since its onset by activating one appropriate internal state unit from a predefined and fixed set, S = {s1 , . . . , sn }, where n is the total number of states. We will find it convenient to index each unit by either a stimulus-offset pair (as in the figure) or a single generic index that uniquely labels each state. For example, si is simply the ith state.
following procedure is performed: 1. Update reward estimate of the new state: ˆ new ) + α r(snew ) − R(s ˆ new ) . ˆ new ) := R(s R(s 2. Update the transition connections:
ˆ ˆ old , aold , y) := (1 − β)T(sold , aold , y) + β T(s ˆ old , aold , y) (1 − β)T(s
If y = snew , Otherwise.
(3.1)
for all states, y ∈ S. 3. Action selection: If snew represents the onset of an external stimulus, then select and take action, anew , as described below. In the above, sold is the previously active state and aold is the previously selected action. α and β are learning rates, which are arbitrarily defined by α = β = exp(− trial 100 ), where trial is the trial number. These learning rates start at 1 and are slowly reduced to 0 across trials. Some environmental states are marked as terminal states (see the environment descriptions later), and when a terminal state transition occurs, the
Modeling the Functional Role of Dopamine in the Ventral Striatum
369
Figure 2: The internal model can be interpreted using a simple neural metaphor. Each internal state unit is connected to each other state unit under each possible action by a unique transition connection, the strength of which is controlled ˆ i , a2 , sj ), will be by a transition weight. For example, the transition strength, T(s adapted during learning to reflect the probability that state sj follows state si under action a2 . Note that the direction of the transition connections is from the state-action units back to the state units. Additionally, each state maintains ˆ ∈ an estimation of the immediate reward value associated with that state, R(s S). Although this value will be immediately available in r(s ∈ S), the former represents the estimate learned by the internal model, while the latter represents the actual value coming in from the environment. In theory, r(s ∈ S) could be a distribution, for example, while Rˆ ( s ∈ S) is always a scalar value. The reward values and transition connections are adapted during learning so that an explicit internal model of the environment is approximated. This model will be used to estimate future reward and to drive motivation and action selection. From an anatomical perspective, we speculate that the states themselves are represented in cortical areas (possibly including the OFC and amygdala), while the transition connections are routed through the ventral striatum (see the discussion).
trial is ended. Over a number of trials, the agent learns to represent an explicit internal model of its environment that consists of an estimated reward function and an estimated transition function. Step 2 just redistributes the weights from the relevant state-action unit (see Figure 2) back to each state unit according to a learning rate, β. All transition strengths are initialized ˆ y, si ) = 1 for all to 1/n, and the redistribution rule ensures that ni=1 T(x, x ∈ S and y ∈ A at all times.
370
A. Smith, S. Becker, and S. Kapur
Action selection is performed by using a look-ahead process to generate the expected future reward of each action in turn and then selecting the best action. The look-ahead process involves playing through the consequences of taking each action inside the internal model. We assume that this lookahead can be performed without disturbing the modeling process described above. If actually implemented in neural circuitry, it might be necessary to maintain two copies of the internal model: one for keeping track of the environment and one for performing look-ahead. We abstract over this detail. We now propose a role for dopamine in modulating the efficacy of the transition connections. First, let zij denote the state-action unit corresponding to taking action j in state i. Next, let the activation of si and zij be denoted by ξ(si ) and ξ(zij ) respectively. Also, let FutRew(a ∈ A) be an internal register used for accumulating the total future reward of taking action a in current state, snew . The action selection step is fleshed out as follows: 3. Action Selection: If snew represents the onset of an external stimulus: (a) For each action, ai ∈ A: i. FutRew(ai ) := 0 ii. For all j∈ {1...n}: 1 If sj = snew , ξ(sj ) := 0 Otherwise. iii. For some fixed number of iterations, q: A. Propagate activation from state units to stateaction units: For all j ∈ {1...n} and all k ∈ {1...m}: ξ(sj ) If k = i, ξ(zjk ) := 0 Otherwise. B. Generate hypothetical next state: For all j ∈ {1...n}: ˆ ξ(sj ) := nk=1 m l=1 ξ(zkl ) × T(sk , al , sj ) × DAtonic C. Collect rewards for this hypothetical state: n ˆ j) FutRew(ai ) := FutRew(ai ) + j=1 ξ(sj ) × R(s D. Return to 3(a)iiiA. (b) Select and take action, argmax FutRew(a) With prob. p, a∈A anew:= Random action With prob. 1−p.
Modeling the Functional Role of Dopamine in the Ventral Striatum
371
In the above, 0 ≤ DAtonic ≤ 1 is the level of tonic dopamine (default = 1); q is the maximum depth of the look-ahead process, which we arbitrarily set at 50; and p controls exploration, where p = (1 + exp(5 − 0.07trial))−1 . Exploration starts at 1 and is reduced to 0 over successive trials. Exploration is an important part of behavior acquisition, since in general the initial state of the internal model will not accurately reflect the environmental contingencies. The approach we have taken is very simple. Activity cycles around the circuit of Figure 2 in order to simulate the environmental consequences of each action in sequence. The action that accumulates the greatest future reward during this process is actually selected, generating a new real sequence of environmental stimuli. A role for DAtonic as a modulator of the transition connections is proposed in step 3(a)iiiB. This will allow us to account for the finding that actions motivated by distal rewards are more vulnerable to dopamine manipulation than those motivated by proximal rewards. We have made the simplifying assumption that during the look-ahead process, the action currently being evaluated is always selected (step 3(a)iiiA). We consider a more flexible alternative to this look-ahead policy later, although the qualitative nature of the results are not contingent on this feature of the model because of the simple environments used. nProviding that DAtonic = 1, then at any stage in the look-ahead process, j=1 ξ(sj ) = 1. This is guaranteed by the starting condition in 3(a)ii and the ˆ y, sj ) = 1 for all x ∈ S and y ∈ A. More important, the fact that n T(x, j=1
activation of each state unit corresponds to the probability of that state indeed being the current state at that future time, under the look-ahead policy. However, for DAtonic < 1, these activations will decay during the look-ahead process, although the activations will still be in proportion to these probabilities. As the look-ahead process continues, FutRew(a) converges on the estimated future reward of taking action a, modulated by the online discount factor, DAtonic . Expected future reward or the return is the standard quantity to maximize in reinforcement learning problems (Sutton & Barto, 1998; Kaelbling, Littman, & Moore, 1996). Note that the agent is naturally parallel in design and relies mainly on local learning and activation rules. 4 Modeling Conditioned Avoidance A prerequisite of any model of delusions or disordered thoughts is a model of ordered thoughts. Since a model of ordered thoughts represents the holy grail of a number of disciplines, it is as well that we have a relatively simple yet highly relevant animal model of dopaminergic action at our disposal. We present the model in conjunction with a generalized version of the CA paradigm presented in Maffii (1959) (reviewed with other classic APD studies in Dews & Morse, 1961). The standard CA finding that APDs disrupt avoidance before escape is observed within Maffii’s paradigm, but Maffii
372
A. Smith, S. Becker, and S. Kapur
also makes an interesting additional observation that we wish to model. He found that after sufficient training, the rats began producing the avoidance response as soon as they were placed in the cage and before they were even presented with the tone. He termed this response to environmental context the secondary avoidance response, and the standard response to the tone the primary avoidance response. He then found that not only was the primary avoidance response more vulnerable to dopamine blockade than the escape response itself (the standard finding), but that the secondary avoidance response was more vulnerable than the primary response (see Figure 5, left). Apparently, the more distal the cue from the shock, the more vulnerable that cue was to dopamine blockade in terms of its ability to elicit a response. We can describe the standard CA environment with the finite state model of Figure 3a (left) and Maffii’s experiment with Figure 3a (right). The finite state environments defined in Figure 3 are a necessary assumption required to formalize the problem, and they effectively take on the role of an animal’s environment as well as its sensory processing. These environment descriptions will be different for each experiment that we consider. In contrast, the learning rules described above are defined once and represent our model of the behavioral processes, which we believe to pertain to the ventral striatal D2 receptor. Figure 4(a) shows the internal model after 100 learning trials. The transition connections reflect the environmental contingencies of Figure 3a (right). This internal model can then be used to generate the expected future reward of taking different actions in each environment state. For example, Figure 4e shows how activation is propagated through the internal model during the look-ahead process of the ”do nothing” action at trial onset for DAtonic = 1 for the internal model of Figure 4c. The return is generated by summing the reward estimates of each active unit in proportion to its activation. Setting DAtonic < 1 after acquisition of the internal model results in the decay of this activation during the look-ahead process, and therefore also the attenuation of the impact of increasingly distal outcomes (not shown). Figure 5 compares expected future reward thus generated with the performance of Maffii’s rats. Here we are assuming that expected future reward is a suitable basis for motivation. In order to convert this value into the probabilistic behavior of the rats as shown in Figure 5 (right), a softmax action selection mechanism would be required in place of the -greedy mechanism of step 3b. Although not shown, the model acquires avoidance behavior within a number of trials that is consistent with the amount of experience required by an actual rat (Wadenberg et al., 2001). However, this behavior is rather trivial, and the real point of interest is the selective effect of DAtonic on secondary avoidance versus primary avoidance versus escape. For interest, Figures 4c and 4e give an example of the model’s behavior in noisy or stochastic environments.
Modeling the Functional Role of Dopamine in the Ventral Striatum
Shock
5s
Tone 0
Environment cue 5s
−1 0s
Safety 0s 0
Tone
0
5s
373
Shock
0
−1 0s
0s
5s
Safety 0s 0
Run Do Nothing
5s
(a) MDP for modeling Conditioned Avoidance
Tone Start
0s
0
4s
150 0s
0 0s
Fixed reward
Immediate reward 100
0s
Terminal 0
Press lever 1 Press lever 2
(b) MDP for discounting task of (Wade et al., 2000) Figure 3: (a, left) A simple and abstract environment model for the classic CA task. Each circle represents a different environment state an agent can be in, and the arrows denote the outcome of each of the two possible actions. The arrows are labeled with the time between the action being taken and the transition actually occurring. For example, if the Do Nothing action is selected when the tone is presented, the shock will be delivered after a delay of 5 s (see Wadenberg et al., 2001, for an experimental example). We make the simplifying assumption that the agent must take one of the available actions on entry into a state, and must then wait for the ensuing transition before another action may be taken. The number inside each state represents the reward, r, associated with that state. The Safety state is the terminal state, which always transitions to itself, and the trial is terminated when the first such transition is made. Note that if the agent selects Do Nothing in the Shock state, the shock lasts only for 5 s, after which the trial is automatically terminated. This is consistent with a typical experimental setup (Wadenberg et al., 2001). (a, right) A generalized version of (left) that represents the experimental setup of Maffii (1959). (b) An abstract definition of the environment for the discounting task of Wade et al., (2000; see section 5). Trials begin in the left-most state.
374
A. Smith, S. Becker, and S. Kapur
Figure 4: (a) The state of the internal model after learning the environment of Figure 3a (right). The secondary stimulus refers to the Environment cue in Figure 3a (right) and the primary stimulus to the Tone. All learned transition ˆ are shown. The relationship to Figure 2 is as follows: The units connections, T, of Figure 2 are plotted here in weight space rather than physical or anatomical space. Also, the state-action units are removed, and the transition connections from each of the state-action units are collapsed onto the relevant state so that the connections from state to state can be seen more clearly. As a result of visuˆ a, s ) function as a simpler T(s, ˆ s ) function, it is not possible in alizing the T(s, this figure to know which action causes a given transition. However, the transition connections associated with taking the Run action in response to the onset of each external stimulus always transition to the Safety unit. The remaining transitions show the consequences of Do Nothing. (b) The state of the internal model after learning the environment of Figure 3b. (c) As in a except that temporal random noise is added to the identification of the current state. The bolder connections denote a stronger transition weight. (d) The state of the internal model after learning the T-maze environment of Figure 8(a). (e) The propagation of activation through the internal model of c during a typical look-ahead process for the Do Nothing action at the beginning of a trial, with DAtonic = 1. The process is shown for every alternate iteration.
Modeling the Functional Role of Dopamine in the Ventral Striatum
375
Figure 5: A comparison of model performance with Maffii’s data. (a) Number of secondary avoidance responses, primary avoidance responses, and escape responses under increasing doses of the neuroleptic chlorpromazine (a drug with a particularly high affinity for the D2 receptor), as a percentage of the number of responses without the drug (adapted from Maffii, 1959). (b) The change in FutRew(Do Nothing) for each of the three important states (solid line = environment cue, dashed line = tone, gray line = shock) as DAtonic is decreased. Since we use DAtonic as an abstract representation of dopamine and the model does not attempt to address the underlying neurochemical processes, the relationship between the model parameter, DAtonic , and chlorpromazine dose is uncertain. However, it is the qualitative nature of the results that is of interest, and in particular, the selective effect of dopamine blockade on secondary avoidance versus primary avoidance versus escape. For the comparison to be meaningful, we assume that FutRew(Do Nothing) can be used as a direct analogy of motivation, since the alternative action, Run, always yields an estimated future reward of zero. The horizontal line suggests an example escape cost that could be used to threshold motivation. This could explain why many studies find that low doses of neuroleptics disrupt avoidance but not escape.
5 Impulsivity, Delayed Rewards, and ADHD Attention deficit/hyperactivity disorder (ADHD) is a developmental disorder affecting 3% to 7% of school-age children, characterized by inattention, hyperactivity, and impulsivity (Frances, 2000). The single most effective treatment for the disorder is medication with psychostimulants such as methylphenidate (Phares, 2003). Psychostimulants (of which amphetamine is one) are known to increase extracellular concentrations of dopamine (Seeman & Madras, 1998), and indeed ADHD is believed to be primarily a dopaminergic/noradrenergic disorder (see Phares, 2003). Decreased blood flow has also been observed in the striatum of ADHD patients, a deficit
376
A. Smith, S. Becker, and S. Kapur
reversible by treatment with methylphenidate (reviewed in Schneider, Sun, & Roeltgen, 1994). As with psychosis, striatal dopamine seems to be of importance to ADHD. However, unlike psychosis, it may be a deficit rather than an excess that is to blame. There are no laboratory tests, neurological assessments, or attentional assessments that have been established as a diagnostic in the clinical assessment of ADHD (Frances, 2000). However, Solanto et al., (2001) have suggested that a desire to avoid delay (the delay aversion hypothesis) is one of the best characterizations of impulsivity with respect to ADHD, and Catania (in press) has argued specifically that many of the symptoms of ADHD can be accounted for by assuming too steep a discounting gradient. Studies such as de Wit et al. (2002) have confirmed that certain measures of impulsivity are indeed reduced in healthy volunteers by amphetamine. For example, subjects were asked questions such as, “Would you prefer ten dollars in thirty days or two dollars at the end of the session?” The amphetaminetreated subjects showed an increased preference for the larger but delayed reward when compared with a placebo group. If blocking dopamine can disrupt avoidance responding in rats, enhancing dopamine seems to decrease impulsivity as measured by this kind of delay discounting (Wade et al., 2000; Richards et al., 1999; de Wit et al., 2002). Furthermore, both phenomena may be mediated by the same part of the brain: the striatum (see Phares, 2003, for more on striatal involvement in ADHD). However, as with psychosis, the underlying behavioral and psychological processes are unclear. In a bid to better understand the role of dopamine in impulsivity and ADHD, Richards et al. (1999) and Wade et al. (2000) have investigated the effects of both amphetamine- and dopamineblocking drugs on delay discounting in rats. Their experimental paradigm can be summarized as follows. A thirsty rat is trained to press one of two levers. One lever yields an immediate reward (for example, 100 µl of water), and the other yields a fixed reward (150 µl of water) but only after a delay of 4 s. If the animal selects the delayed reward, a tone is presented between the lever press and the water to make the task easier. The immediate reward is then adjusted from trial to trial depending on which lever the animal selected on the previous trial. If the rat chose the immediate reward, the immediate reward is reduced by 15%, and if the rat chose the delayed reward, the immediate reward is increased by 15%. In this way, the immediate reward is varied until the rat has no particular preference. The amount of immediate reward elicited at this indifference point is then interpreted as the animal’s ”value” of the fixed, delayed reward. The rats are trained over a period of many weeks, on a variety of different starting conditions, until they become familiar with and adept at exploring the two alternatives and achieving the indifference point that suits their preference. As an additional aid, if the rat chooses the same lever twice in a row, then the immediately following trial is a forced exploration trial in which only the other lever yields a reward.
Modeling the Functional Role of Dopamine in the Ventral Striatum
377
Figure 6: The impact of various doses of (a) a dopamine-enhancing drug (amphetamine) and (b) a D2-blocking drug (raclopride), on the indifference point. The indifference point indicates the value of the delayed 150 µL alternative. Error bars indicate SEM, and asterisks denote a significant difference from dose = 0. Adapted from Wade et al. (2000).
After the rats have been trained on this procedure, they are tested under various systemic doses of both amphetamine and raclopride (dopamine D2 receptor blocker; see Figure 6). A dose-dependent effect of both drugs is observed on the indifference point. They claim that amphetamine has effectively reduced impulsivity (by increasing the value of the delayed reward), and raclopride has increased impulsivity (by decreasing the value of the delayed reward). Because of the incrementally adjusting procedure used, it is impossible to rule out a learning effect completely. However, the effects of these drugs were very quick compared with the weeks of training required for acquisition of the basic task. In the case of raclopride in particular, the lower indifference point was reached as quickly as the paradigm allowed—after just a few trials under the drug. It therefore seems likely that a significant part of the change in the animals’ behavior was due to the effect of drug on performance (rather than or in addition to its effect on learning). Wade et al. (2000) conclude that “this pattern of results indicates that blocking D2-receptors may have a selective effect on the value of delayed rewards” (p. 197). We can now use the model described in the previous section to propose an explanation that is consistent with data from CA. Figure 4b shows the result of training the agent on the environment of Figure 3b. Acquisition is again trivial (for DAtonic = 1, the agent learns to select the greater, delayed reward), but Figure 7 shows the effect of manipulating dopamine after acquisition. The delayed reward is discounted more in
378
A. Smith, S. Becker, and S. Kapur
the look-ahead process as DAtonic is reduced. In order to capture the effects of both dopamine blockade and amphetamine, we assume a baseline DAtonic < 1. The model is then able to provide a qualitative account of all drug-induced changes in impulsivity. Note that the vertical bars shown in Figure 7 can be moved apart (or together) by reducing (or increasing) the temporal resolution in Figure 1, thereby modifying the steepness of discounting. For reference, the largest dose of raclopride used (120 µg) would reduce avoidance responding in CAR by around 20% (Wadenberg, Kapur, Soliman, Jones, & Vaccarino, 2000). Notionally, the model also captures five additional observations made by Wade et al. (2000). First, independent of drug treatments, they vary the delay to the fixed reward (2 s and 8 s). They find that a delay of 2 s increases the indifference point (value of the delayed alternative) and that a delay of 8 s decreases the indifference point. Second, they find that the initial amount of water on the immediate alternative does not affect the eventual indifference point. Third, they find that the deprivational state of the animal (i.e., its thirst) does not affect the indifference point. In our model, a logical role for thirst would affect the perception of the two rewards equally (i.e., by a common factor), which could be achieved, for example, by substituting step 3(a)iiiC with: 3(a)iiiC)
Collect rewards for this hypothetical state. n ˆ j ) × ”Thirst”. FutRew(ai ) := FutRew(ai ) + j=1 ξ(sj ) × R(s
This would not affect the relative indifference point. Fourth, our model predicts that blocking dopamine should generally reduce the motivation to press either lever, and enhancing dopamine should generally increase the motivation to press a lever. Wade et al (2000) examined the tendencies of the rats to complete each trial and found that raclopride reduced but amphetamine increased the mean number of trials completed. Admittedly, completed trials is only an informal measure of motivation. Also, in a related and recent study, J.B. Richards (personal communication, 2003) finds that if rats are trained on a task in which levers yield immediate rewards, but one is certain and one is uncertain, then amphetamine does not affect the indifference point. This behavior would be produced by the model too because uncertainty is represented by the transition connection strengths, and dopamine has an equal effect on all such connections. A complete analysis of the model’s performance under these five conditions is outside the scope of the account presented here. It should be noted that Wade et al. (2000) observed the effects described above only with dopamine D2-blocking drugs. D1 receptor blockers had no significant effect. Interestingly, Cardinal et al. (2001) demonstrate that accumbens (core) lesions in rats cause exactly the same kinds of effects, leading them to conclude that the accumbens is involved in the pathogenesis of impulsive choice, a finding they suggest can shed light on ADHD, addiction, and other impulsive control disorders. (Cardinal, Robbins, & Everitt (2000)
Modeling the Functional Role of Dopamine in the Ventral Striatum
379
Figure 7: The effect of varying the DAtonic parameter after learning on the estimated future reward associated with each of the two actions from the Start state. The solid curve shows the estimated future reward of pressing lever 1 (delayed reward =140 µL), while the horizontal dashed line shows the estimated future reward of pressing lever 2 (immediate reward = 100 µL). The gray curved line shows the hypothetical value of the immediate reward necessary to balance the estimated future reward of the immediate alternative with that of the delayed alternative (i.e., the indifference point). This gray line can be used to match the model performance with data from Wade et al. (2000). The thick vertical bar shows the simulated level of dopamine that corresponds to the performance of the undrugged rats in Wade et al. (2000). The vertical bars to the left show the simulated dopamine level corresponding to the indifference point of the amphetamine-treated rats, and the vertical bars to the right show the simulated dopamine level corresponding to the performance of the raclopride-treated rats (see Figure 6). In agreement with the animal study, the model predicts that reducing DAtonic reduces the indifference point, while increasing dopamine increases the indifference point.
also find the same kinds of impulsive behaviour with systemic administration of amphetamine and combined D1/D2 blockers. It has been suggested that accumbens dopamine is necessary for producing anticipatory responses (e.g., avoidance, or lever pressing for food or water), but not consummatory responses (e.g., escape, or feeding or drinking itself) (Ikemoto & Panksepp, 1999). It is therefore noteworthy that in the impulsivity studies discussed above, two equally anticipatory responses with
380
A. Smith, S. Becker, and S. Kapur
equal motoric requirements were dissociated by dopamine manipulation. Therefore, arguments that rely solely either on anticipatory-consummatory distinctions or on motor deficits may need to be adjusted to address delaydiscounting paradigms. Finally, two important caveats are considered. First, in contrast to the data discussed above, some studies have observed that amphetamine actually increases impulsivity rather than decreases it, leading to the suggestion that the finer experimental details, such as whether a cue is presented during the delay, may be important (Cardinal et al., 2000). Perhaps one problem is that amphetamine injections directly into the accumbens increase general locomotor activity (reviewed in Pennartz, Groenewegen, & Silva, 1994), as well as an animal’s motivation to work for reinforcers (Taylor & Robbins, 1984), as predicted by the internal model since expected future reward equals motivation). Therefore, it may be difficult for impulsivity studies to separate the absolute increases in motivation due to amphetamine from the relative decreases in choice tasks. The way in which the animal internally models its environment is likely to play a large role. Second, it must be acknowledged that ADHD is a complex and multidimensional disorder, and the delay aversion hypothesis (see Solanto et al., 2001) and delay discounting hypothesis (see Catania, in press), represent only one line of argument. However, we conclude by contrasting our proposed role for dopamine in gating a look-ahead process with one of the standard characterizations of impulsivity within ADHD offered by DSM-IV (emphasis our own): “Impulsivity may lead to accidents and to engagement in potentially dangerous activities without consideration of possible consequences” (Frances, 2000, p. 86). 6 A T-Maze Experiment In a fascinating study by (Cousins et al. (1996), rats were presented with a choice between two arms of a T-maze: one leading to four food pellets, and the other to two. However, the arm leading to four pellets was obstructed with a barrier that had to be climbed (see Figure 8b). Once trained, the rats chose the arm containing four pellets on almost 100% of trials. However, after bilateral intra-accumbens injections of 6-hydroxydopamine, a treatment that destroys dopaminergic projections to the accumbens, the rats changed their behavior, selecting the unobstructed but lesser reward instead (80% of the time). A second experiment was then performed in which different rats were trained (untreated, as before) on a different version of the T-maze in which there was no reward in the unobstructed arm. Subsequent dopamine lesions only slightly reduced responding for the obstructed arm (from 100% of the time to 80%). Therefore, Cousins et al. (1996) were able to rule out the possibility that the animals in the first experiment were unable to cross the barrier because of motoric deficits for example. Rather, it appears that dopamine lesions were influencing the motivational and choice processes of the animal.
Modeling the Functional Role of Dopamine in the Ventral Striatum
381
Figure 8: (a) A formal environment model capturing the salient features of the T-maze task. For simplicity we can assume negligible delays between all state transitions. This is not a necessary assumption. (b) The T-maze experimental setup used in Cousins et al. (1996). (c) The effect of varying DAtonic ,after acquisition, on the estimated future reward associated with taking the Left (solid) and Right (dotted) actions from the Start state. The point at which the graphs cross denotes the point at which the model will switch from seeking the obstructed reward to seeking the easily available alternative.
The internal model can be used to capture these results by training it on the environment of Figure 8a. Figure 4d illustrates the resulting internal model, and Figure 8c demonstrates the effect of modulating DAtonic , after acquisition, on the estimated future reward associated with moving left or right from the start state. The model accounts for the change in animal behavior from the obstructed to the unobstructed arm under dopamine disruption. It also notionally accounts for the observation that if the rats are trained with no food in the unobstructed arm, then they continue to select
382
A. Smith, S. Becker, and S. Kapur
the obstructed arm, although apparently with less motivation. The model would continue to select the obstructed arm in this case until the solid line failed to exceed the cost of climbing the wall. The results of this study are particularly striking because the nucleus accumbens is specifically targeted with a direct dopamine challenge, eliciting a qualitative change in behavior. This result is not an isolated experiment either. It is duplicated in essence in Salamone et al. (1991, 1994) with both direct accumbens dopamine depletion and systemic D2 blockers, and Salamone et al. (1997) provides a review of many other related experiments by Salamone and colleagues that adhere to the same general pattern. 7 Instrumental Responding for Rewards A generally accepted consequence of dopamine blockade in rats is a reduction in lever pressing (also other responses) for various types of reward (Ettenberg, Koob, & Bloom, 1981; Evenden & Robbins, 1983; Fibiger et al., 1976; Fowler et al., 1986; Rolls et al., 1974; Salamone et al., 1993; Wise & Schwartz, 1981; Wise et al., 1978). Moreover, lever pressing for food or water is often found to be reduced at doses that have little or no impact on free feeding or free drinking where no lever press is required (Rolls et al., 1974; Salamone et al., 1993; see Figure 9a). However, recall that Berridge and Robinson (1998) were able to disrupt free feeding, even if the food was under the rats’ noses, by reducing accumbens and neostriatal dopamine by around 95%. We can use the current model to account for these findings by assuming the environment model of Figure 9b. In order to estimate future reward (and therefore generate motivation) for lever pressing, all three of the transitions are required. In contrast, free feeding requires only Approach and Consume, and in Berridge’s rats’ case, only the Consume transition is required. We do not show the model’s performance on this task because of the qualitative similarity between Figures 9a and 5a. However, it should be evident that lever pressing is disrupted before free feeding and free feeding before consumption in the model under the assumption of Figure 9b. With respect to this and the T-maze experiment reviewed earlier, it is an open question as to whether it is the “instrumental” or temporal distance (or both) between an action and its rewarding outcome that renders that action vulnerable to dopamine blockade. 8 Uniting Existing Hypotheses and Related Work We have presented a computational model intended to address a corpus of experimental data that point to a selective role for dopamine in the expression of previously acquired behaviors. More specifically, the D2receptor subtype within the ventral striatum (particularly the accumbens) has emerged as a common neuroanatomical thread throughout the studies we have considered. If our model of the transition connections passing
Modeling the Functional Role of Dopamine in the Ventral Striatum
383
Figure 9: (a) Lever pressing for food is attenuated at neuroleptic doses that do not drastically affect free feeding. Free drinking may be even more robust than free feeding to dopamine challenge (results adapted from Rolls et al., 1974). Note that spiroperidol primarily blocks the D2 receptor. (b) An abstract environment for instrumental responding that, in combination with the model presented, accounts for the selective vulnerability of instrumental responding to dopamine challenge.
through the ventral striatum is plausible, then we would expect to see this area being activated in response to conditioned stimuli as the look-ahead process is invoked. A number of fMRI studies in human subjects confirm this for both appetitive (Gottfried, O’Doherty, & Dolan, 2002; O’Doherty, Deichmann, Critchley, & Dolan, 2002; O’Doherty, Dayan, Friston, Critchley, & Dolan, 2003) and aversive (Jenson et al., 2003) tasks. Also, using electrophysiological recording techniques, Schultz, Tremblay, & Hollerman (2000) have found a variety of activations in the ventral striatum that are consistent with its role in the expectation of reward across a delay. In addition to the ventral striatum, a range of fMRI, electrophysiological, and behavioral data points to a central role for the amygdala (Cador, Robbins, & Everitt, 1989; Everitt, Morris, O’Brien, & Robbins, 1991; Everitt & Robbins, 1989; Cardinal et al., 2002; Nishijo, Ono, & Nishijo, 1988; Robbins, Cador, Taylor, & Everitt, 1989; Tremblay & Schultz, 2000a, 2000b) and the orbito-frontal cortex (Arana et al., 2003; Bechara, Damasio, Tranel, & Anderson, 1998; Cardinal et al., 2002; Iversen & Mishkin, 1970; Masterman & Cummings, 1997; O’Doherty et al., 2003) in the representation of environmental reward contingencies of the kind relevant to the current discussion. It has been suggested that the OFC in particular is crucially involved in the motivational control of goal-directed behavior (Schultz et al., 2000) learning about rewards (Dias, Robbins, & Roberts, 1996), and evaluating alternatives
384
A. Smith, S. Becker, and S. Kapur
(Arana et al., 2003; Schultz et al., 2003) via a common neural currency (Montague & Berns, 2002). We therefore speculate that the states themselves may be represented cortically (originating presumably in the sensory cortex), ˆ along with the the return being represented with the reward estimates, R, in the OFC and amygdala. The proposed model can be used to unite a number of existing cognitive hypotheses of dopamine function within a formal framework. For example, Berridge and Robinson (1998) suggest that mesolimbic dopamine mediates the wanting component of reward as distinct from the liking, and in our model, following McClure et al. (2003), we interpret expected future reward as precisely this wanting. Salamone et al. (1997) suggest that “accumbens dopamine is important for responding to stimuli that are spatially and temporally distant from the organism” (p. 353). This statement precisely summarizes the psychological value of dopamine in our model. Ikemoto and Panksepp (1999) have also noted the relevance of the proximal-distal distinction to mesolimbic dopamine function. They argue that nucleus accumbens dopamine is important for invigorating flexible approach responses (distal), as distinct from consummatory responses (proximal). With respect to our model, and particularly section 7, it should be noted that consummatory behaviors such as free feeding may not be disrupted with only accumbens dopamine depletions. For example, Berridge and Robinson (1998) achieved aphagia with both accumbens and neostriatal dopamine reduction. The striatum may therefore be functionally differentiated, with the ventral region specializing in distal motivation (e.g., approach) and the dorsal region in proximal motivation (e.g., consumption). Modeling this distinction must constitute future work. Our approach is different from the TD prediction-error hypothesis (see Houk et al., 1995; Schultz et al., 1997; Montague et al., 1996), which suggests that the phasic dopamine response signals the difference (error) between the future reward predicted by the animal and the actual reward received. This error is then used to drive the learning process in a biologically plausible fashion (Waelti, Dickinson, & Schultz, 2001). In contrast, we have proposed a role for tonic dopamine in the generation of expected future reward that is independent of the acquisition process. The advantages of our approach are that we can model the effect of dopamine manipulation not only on the expression of previously acquired behaviors, but also the sensitivity of this effect to the relationship between action (or CS), and outcome (or US). A weakness of our internal model is that it fails to address the role of dopamine in the acquisition process or the phasic response of dopamine neurons themselves. We therefore suggest that future work should be oriented around hybrid approaches aimed at achieving a more comprehensive account of the neuromodulator. Toward this end, a number of discussions and models have been proffered that extend TD-based representations with
Modeling the Functional Role of Dopamine in the Ventral Striatum
385
explicit internal model representations (Daw, Courville, & Touretzky, 2004; Dayan, 2002; Dayan & Balleine, 2002; Suri & Schultz, 1998; Suri et al., 2001; Suri, 2001, 2002). However, a significant challenge remains in bridging the gap between models of dopamine neuron firing and models of behavioral and psychological phenomena in which dopamine may play a pivotal role. It is particularly important that we achieve a better understanding of whether explicit internal model representations or a cached value function (as in TD) is most appropriate for modeling the brain reward system. To conclude this section, we speculate as to why the brain might need DAtonic . Some suggestions include (1) constraint of the look-ahead process via its action as an online discount factor; (2) adaptation of the trade-off between proximal and distal rewards in response to environment cues (Wilson & Daly, 2004) or deprivational states (Giordano et al., 2002); or (3) global control of general motivated activity. 9 Model Predictions and Future Work Model is only as good as the useful predictions it makes, and so we are actively engaged in a program of experimental validation. In the account presented above, we allowed the model to make an action choice only in response to the onset of an external stimulus. However, within the context of conditioned avoidance, we have also looked at the predictions made by the model if an action can be selected in any state, including the internal timing states. These novel model-driven predictions were subsequently validated with experimental data (Smith et al., 2004), leading us to argue against the preeminent motor deficit hypothesis (Aguilar et al., 2000; Ogren & Archer, 1994) in favor of a motivational hypothesis of APD-induced avoidance disruption in rats. We are looking for interactions of dopamine manipulation with CS-US interval within CA, with a view to further testing the internal model account of motivated behavior. One of the primary motivations for undertaking this work is to create a computational dopamine hypothesis that can be used to shed light on schizophrenia. We argue that many of the negative symptoms of schizophrenia can be notionally captured by reducing DAtonic , including reduced motivation, flat affect, and anhedonia (given the suggestion that many of the rewards of life may actually be conditioned stimuli; Wise, (2002). Grace (1991) and Moore, West, and Grace (1999) have argued that a tonically hypo-dopaminergic state may be a key step to the development of psychosis, with the latter perceived as a hypersensitivity to phasic dopamine caused as result of homeostasis to the former. With respect to modeling psychosis, we suggest that an aberrant dopamine signal could lead to the construction or modulation of an aberrant internal model, and an aberrant internal model seems to be an excellent starting place to model delusions. Further research into combining a phasic (learning) role for dopamine with the internal model would be expected to flesh out this hypothesis.
386
A. Smith, S. Becker, and S. Kapur
One model assumption deserves a brief revisit. We enforced a very simple policy in the look-ahead process (step 3(a)iiiA), in which the current action being evaluated was always selected at each subsequent hypothetical state. This approach was simple and adequate given the way in which the problems were formulated. However, a more general and flexible look-ahead process would search different possible actions as one might in a game of chess, for example. However, since a branching search process is potentially costly, an alternative is to represent a Q-value at each state-action unit and then use this value to guide the search process during look-ahead. Such a value could be constructed using either a TD or Monte Carlo method (see Sutton & Barton, 1998). This would allow the agent to trade off the benefits of being able to explicitly simulate the consequences of actions (using the internal model) against the efficiency of simply using a precalculated value (an action-value function). The former is particularly important for being able to modulate behavior after acquisition based on an internal drive (e.g., salt deprivation) that was not present during conditioning. Berridge and Schulkin (1989) have demonstrated just such an ability in animals. There are already open lines of investigation into when and where internal model versus TD-like value functions influence motivated behavior (Dayan, 2002; Dayan & Balleine, 2002). However, using an action-value function to guide the online look-ahead process does not have a direct impact on the current discussion of dopamine function, which is kept as simple as possible. In conclusion, we have demonstrated that an internal model approach is able to account for a range of experimental evidence that suggests that ventral striatal dopamine D2-receptor manipulation selectively modulates motivated behavior for distal versus proximal outcomes. Whether an internal model or the cached values of the TD algorithm are better placed to model both a tonic and phasic dopamine response in this brain region is likely to have important implications for understanding a number of human disorders, including schizophrenia and ADHD. Acknowledgments This work was primarily supported by an OMHF Special Initiative grant and a NET grant from the Canadian Institutes of Health Research. S.K. is additionally supported by a Canada Research chair. Thanks also to Ming Li and Jimmy Jensen for assisting in the review of the conditioned avoidance and fMRI literature (respectively) and to Christopher Fiorillo for reviewing an early draft of this work. References Ader, R., & Clink, D. W. (1957). Effects of chlorpromazine on the acquisition extinction of an avoidance response in the rat. J. Pharmacol. Exp. Ther., 131, 144–148.
Modeling the Functional Role of Dopamine in the Ventral Striatum
387
Aguilar, M. A., Mari-Sanmillan, M. I., Morant-Deusa, J. J., & Minarro, J. (2000). Different inhibition of conditioned avoidance response by clozapine and Da D1 and D2 antagonists in male mice. Behav. Neurosci., 114(2), 389–400. Anisman, H., Irwin, J., Zacharko, R. M., & Tombaugh, T. N. (1982). Effects of dopamine receptor blockade on avoidance performance: Assessment of effects on cue-shock and response-outcome associations. Behavioral and Neural Biology, 36, 280–290. Arana, F. S., Parkinson, J. A., Hinton, E., Holland, A. J., Owen, A. M., & Roberts, A. C. (2003). Dissociable contributions of the human amygdala and orbitofrontal cortex to incentive motivation and goal selection. Journal of Neuroscience, 23(29), 9632–9638. Arnt, J. (1982). Pharmacological specificity of conditioned avoidance response Inhibition in rats: Inhibition by neuroleptics and correlation to dopamine receptor blockade. Acta Pharmacol. Toxicol. (Copenh.), 51(4), 321–329. Balleine, B. W., Garner, C., Gonzalez, F., & Dickinson, A. (1995). Motivational control of heterogeneous instrumental chains. Journal of Experimental Psychology: Animal Behaviour Processes, 21, 203–217. Bechara, A., Damasio, H., Tranel, D., & Anderson, S. W. (1998). Dissociation of working memory from decision making within the human prefrontal cortex. Journal of neuroscience, 18(1), 428–437. Beck, A. T., & Rector, N. A. (2003). A cognitive model of hallucinations. Cognitive Therapy and Research, 27(1), 19–52. Bell, D. S. (1973). The experimental reproduction of amphetamine psychosis. Archives of General Psychiatry, 29, 35–40. Beninger, R. J. (1989). Dissociating the effects of altered dopaminergic function on performance and learning. Brain Research Bulletin, 23, 365–371. Beninger, R. J., & Hahn, B. L. (1983). Pimozide blocks establishment but not expression of amphetamine-produced environment-specific conditioning. Science, 220, 1304–1306. Beninger, R. J., Mason, S. T., Phillips, A. G., & Fibiger, H. C. (1980a). The use of conditioned suppression to evaluate the nature of neuroleptic-induced avoidance deficits. J. Pharmacol. Exp. Ther., 213(3), 623-627. Beninger, R. J., Mason, S. T., Phillips, A. G., & Fibiger, H. C. (1980b). The use of extinction to investigate the nature of neuroleptic-induced avoidance deficits. Psychopharmacology (Berl.), 69(1), 11–18. Berridge, K. C., & Robinson, T. E. (1998). What is the role of dopamine in reward: Hedonic impact, reward learning, or incentive salience? Brain Research Reviews, 28, 309–369. Berridge, K., & Robinson, T. E. (2003). Parsing reward. Trends in Neurosciences, 26(9), 507–513. Berridge, K. C., & Schulkin, J. (1989). Palatability shift of a salt-associated incentive during sodium depletion. Quarterly Journal of Experimental Psychology, 41B(2), 121–138. Black, A. H. (1963). The effects of CS-US interval on avoidance conditioning in the rat. Canadian Journal of Psychology, 17(2), 174–182. Blackburn, J. R., & Phillips, A. G. (1989). Blockade of acquisition of one-way conditioned avoidance responding by haloperidol and metoclopramide but not
388
A. Smith, S. Becker, and S. Kapur
by thioridazine or clozapine: Implications for screening new antipsychotic drugs. Psychopharmacology, 98, 453–459. Cador, M., Robbins, T. W., & Everitt, B. J. (1989). Involvement of the amygdala in stimulus-reward associations: Interaction with the ventral striatum. Neuroscience, 30(1), 77–86. Cardinal, R. N., Parkinson, J. A., Hall, J., & Everitt, B. J. (2002). Emotion and motivation: The role of the amygdala, ventral striatum, and prefrontal coretx. Neuroscience and Biobehavioral Reviews, 26(3), 321–352. Cardinal, R. N., Pennicott, D. R., Sugathapala, C. L., Robbins, T. W., & Everitt, B. J. (2001). Impulsive choice induced in rats by lesions of the nucleus accumbens core. Science, 292(5526), 2499–2501. Cardinal, R. N., Robbins, T. W., & Everitt, B. J. (2000). The effects of damphetamine, chlordiazepoxide, alpha-flupenthixol and behavioural manipulations on choice of signalled and unsignalled delayed reinforcement in rats. Psychopharmacology, 152, 362–375. Catania, A. C. (in press). Attention-deficit/hyperactivity disorder (ADHD): Delay-of-reinforcement gradients and other behavioral mechanisms. Behavioral and Brain Sciences. Connell, P. H. (1958). Amphetamine psychosis. London: Chapman and Hall. Cook, L., & Catania, A. C. (1964). Effects of drugs on avoidance and escape behavior. Federation Proceedings, 23, 818–835. Cook, L., & Weidley, E. (1957). Behavioral effects of some psychopharmacological agents. Ann. N.Y. Acad. Sci., 66, 740–752. Courvoisier, S. (1956a). Pharmacodynamic basis for the use of chlorpromazine in psychiatry. Quarterly Review of Psychiatry and Neurology, 17(1), 25–37. Courvoisier, S. (1956b). Pharmacodynamic basis for the use of chlorpromazine in psychiatry. Journal of Clin. Exp. Psychophathol. Quart. Rev. Psychiat. Neurol., 17, 25–37. Cousins, M. S., Atherton, A., Turner, L., & Salamone, J. D. (1996). Nucleus accumbens dopamine depletions alter relative response allocation in a Tmaze cost/benefit task. Behavioural Brain Research, 74, 189–197. Davidson, A. B., & Weidley, E. (1976). Differential effects of neuroleptic and other psychotropic agents on acquisition of avoidance in rats. Life Sci., 18(11), 1279–1284. Davis, W. M., Capehart, J., & Llewellin, W. L. (1961). Mediated acquisition of a fear-motivated response and inhibitiory effects of chlorpromazine. Psychopharmacologia, 2, 268–276. Daw, N. D., Courville, A. C., & Touretzky, D. S. (2004). Timing and partial observability in the dopamine system. In S. Still, W. Bialek, & L. Botlou (Eds.), advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Dayan, P. (2002). Motivated reinforcement learning. In T. G. Dietterich, S. Becker, & Z. Ghahramani, (Eds.) Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Dayan, P., & Balleine, B. W. (2002). Reward, motivation and reinforcement learning. Neuron, 36, 285–298.
Modeling the Functional Role of Dopamine in the Ventral Striatum
389
de Wit, H., Enggasser, J. L., & Richards, J. B. (2002). Acute administration of damphetamine decreases impulsivity in healthy volunteers. Neuropsychopharmacology, 27(5), 813–825. Dews, P. B., & Morse, W. H. (1961). Behavioral pharmacology. Annual Review of Pharmacology, 1, 145–174. Dias, R., Robbins, T. W., & Roberts, A. C. (1996). Disssociation in prefrontal cortex of affective and attentional shifts. Nature, 380(6569), 69–72. DiChiara, G. (1999). Drug addiction as dopamine-dependent associative learning disorder. European Journal of Pharmacology, 375, 13–30. Dickinson, A. (1980). Contemporary animal learning theory. Cambridge: Cambridge University Press. Dickinson, A. (1987). Instrumental performance following saccharin prefeeding. Behavioural Processes, 14, 147–154. Dickinson, A., Nicholas, D. J., & Adams, C. D. (1983). The effect of the instrumental training contingency on susceptibility to reinforcer devaluation. Quarterly Journal of Experimental Psychology, 35B, 35–51. Dickinson, A., Smith, J., & Mirenowicz, J. (2000). Dissociation of Pavlovian and instrumental incentive learning under dopamine antagonists. Behavioral Neuroscience, 40, 468–483. Ettenberg, A. (1989). Dopamine, neuroleptics and reinforced behavior. Neuroscience and Biobehavioral Reviews, 13, 105–111. Ettenberg, A., Koob, G. F., & Bloom, F. E. (1981). Response artifact in the measurement of neuroleptic-induced anhedonia. Science, 213, 357–359. Evenden, J. L., & Robbins, T. W. (1983). Dissociable effects of d-amphetamine, chlordiazepoxide and alpha-flupenthixol on choice and rate measures of reinforcement in the rat. Psychopharmacology, 79, 180–186. Everitt, B. J., Morris, K. A., O’Brien, A., & Robbins, T. W. (1991). The basolateral amygdala-ventral striatal system and conditioned place preference: Further evidence of limbic-striatal interactions underlying reward-related processes. Neuroscience, 42, 1–18. Everitt, B. J., & Robbins, M. C. T. W. (1989). Interactions between the amygdala and ventral striatum in the stimulus-reward associations: Studies using a second-order schedule of sexual reinforcement. Neuroscience, 30(1), 63–75. Fibiger, H. C., Carter, D. A., & Phillips, A. G. (1976). Decreased intracranial self-stimulation after neuroleptics of 6-hydroxydopamine: Evidence for mediation by motor deficits rather than by reduced reward. Psychopharmacology, 47, 21–27. Fiorillo, C. D., Tobler, P. N., & Schultz, W. (2003). Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. Fowler, S. C., LaCerra, M. M., & Ettenberg, A. (1986). Effects of haloperidol on the biophysical characteristics of operant responding: Implications for motor and reinforcement processes. Pharmacology, Biochemistry and Behaviour, 25, 791–796. Frances, A., (Ed.). (2000). Diagnostic and statistical manual of mental disorders. Washington, DC: American Psychiatric Association. Giordano, L. A., Bickel, W. K., Loewenstein, G., Jacobs, E. A., Marsch, L., & Badger, G. J. (2002). Mild opioid deprivation increases the degree that opioid-
390
A. Smith, S. Becker, and S. Kapur
dependent outpatients discount delayed heroin and money. Psychopharmacology, 163, 174–182. Gottfried, J. A., O’Doherty, J., & Dolan, R. J. (2002). Appetitive and aversive olfactory learning in humans studied using event-related functional magnetic resonance imaging. Journal of Neuroscience, 22, 10829–10837. Grace, A. A. (1991). Phasic versus tonic dopamine release and the modulation of dopamine system responsivity: A hypothesis for the etiology of schizophrenia. Neuroscience, 41(1), 1–24. Grace, A. A. (2000). Gating of information flow within the limbic system and the pathophysiology of schizophrenia. Brain Research Reviews, 31, 330–341. Grilly, D. M., Johnson, S. K., Minardo, R., Jacoby, D., & LaRiccia, J. (1984). How do tranquilizing agents selectively inhibit conditioned avoidance responding? Psychopharmacology, 84, 262–267. Hollerman, J., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neuroscience, 1, 304– 309. Horvitz, J. C. (2000). Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward events. Neuroscience, 96(4), 651–656. Horvitz, J. C. (2002). Dopamine gating of glutamatergic sensorimotor and incentive motivational input signals to the striatum. Behavioural Brain Research, 137, 65–74. Houk, J. C., Adams, J. L., & Barto, A. G. (1995). A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT press. Hunt, H. F. (1956). Some effects of drugs on classical (type S) conditioning. Ann. N.Y. Acad. Sci., 65, 258–267. Ikemoto, S., & Panksepp, J. (1999). The role of nucleus accumbens dopamine in motivated behavior: A unifying interpretation with special reference to reward-seeking. Brain Research Reviews, 31(1), 6–41. Irwin, S. (1958). Factors influencing acquisition of avoidance behaviour and sensitivity to drugs. Fed. Proc., 17, 380. Iversen, S. D., & Mishkin, M. (1970). Perseverative interference in monkeys following selective lesions of the inferior prefrontal convexity. Experimental Brain Research, 11, 376–386. Janssen, P. A. J., Niemegeers, C. J. E., & Schellekens, K. H. L. (1965). Is it possible to predict the clinical effects of neuroleptic drugs (major tranquilizers) from animal data? Arzneimittelforschung, 15, 104–117. Jensen, J., McIntosh, A. R., Crawley, A. P., Mikulis, D. J., Remington, G., & Kapur, S. (2003). Direct activation of the ventral striatum in anticipation of aversive stimuli. Neuron, 40, 1251–1257. Joseph, M. H., Datla, K., & Young, A. M. J. (2003). The interpretation of the measurement of nucleus accumbens dopamine by vivo dialysis: The kick, the craving or the cognition. Neuroscience and Biobehavioral Reviews, 27, 527– 541. Kaelbling, L., Littman, M., & Moore, A. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence, 4, 237–285.
Modeling the Functional Role of Dopamine in the Ventral Striatum
391
Kamin, L. J. (1954). Traumatic avoidance learning: The effects of CS-US interval with a trace-conditioning procedure. Journal of Comparative and Physiological Psychology, 47, 65–72. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (1991). Principles of neural science. New York: Elsevier Science. Kauer, J. A. (2003). Addictive drugs and stress trigger a common change at VTA synapses. Neuron, 37, 549–550. Key, B. J. (1961). The effects of drugs on discrimination and sensory generalisation of auditory stimuli in cats. Psychopharmacologia, 2, 352–363. Kilts, C. D. (2001). The changing roles and targets for animal models of schizophrenia. Biological Psychiatry, 50(11), 845–855. Low, L. A., Eliasson, M., & Kornetsky, C. (1966). Effects of chlorpromazine on avoidance acquisition as a function of CS-US interval length. Psychopharmacologia, 10, 148–154. Low, L. A., & Low, H. L. (1962). Effects of CS-US interval upon avoidance responding. Journal of Comparative and Physiological Psychology, 55(6), 1059– 1061. Maffii, G. (1959). The secondary conditioned response of rats and effects of some psychopharmacological agents. Journal of Pharmacy and Pharmacology, 11, 129–139. Maher, B., & Ross, J. S. (1984). Delusions. In H. E. Adams, & P. Sutker (Eds.), Comprehensive handbook of psychopathology (pp. 383-409). New York: Plenum Press. Masterman, D. L., & Cummings, J. L. (1997). Frontal-subcortical circuits: The anatomical basis of executive, social and motivated behaviors. Journal of Psychopharmacology, 11(2), 107–114. McClure, S. M., Daw, N., & Montague, P. R. (2003). A computational substrate for incentive salience. Trends in Neuroscience, 26(8), 423–428. Miller, R. E., Murphy, J. V., & Mirsky, A. (1957). The effect of chlorpromazine on fear-motivated behavior in rats. Journal of Pharmacol. and Exper. Therap., 120, 379–387. Montague, P. R., & Berns, G. S. (2002). Neural economics and the biological substrates of valuation. Neuron, 36, 265–284. Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive hebbian learning. Nature, 377, 725–728. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947. Moore, H., West, A. R., & Grace, A. A. (1999). The regulation of forebrain dopamine transmission: Relevance to the pathophysiology and psychopathology of schizophrenia. Biological Psychiatry, 46, 40–55. Morpurgo, C. (1965). Drug-induced modifications of discriminated avoidance behavior in rats. Psychopharmacologia, 8, 90–99. Nader, K., & LeDoux, J. (1999). The dopaminergic modulation of fear: Quinpirole impairs the recall of emotional memories in rats. Behavioral Neuroscience, 113(1), 152–165.
392
A. Smith, S. Becker, and S. Kapur
Nishijo, H., Ono, T., & Nishino, H. (1988). Single neuron responses in amygdala of alert monkey during complex senosry stimulation with affective significance. Journal of Neuroscience, 8, 3570–3583. O’Doherty, J. P., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 28, 329–337. O’Doherty, J. P., Deichmann, R., Critchley, H. D., & Dolan, R. J. (2002). Neural responses during anticipation of primary taste reward. Neuron, 33, 815–826. Ogren, S. O., & Archer, T. (1994). Effects of typical and atypical antipsychotic drugs on two-way active avoidance: Relationship to DA receptor blocking profile. Psychopharmacology (Berl.), 114(3), 383–391. Pennartz, C. M. A., Groenewegen, H. J., & Silva, F. H. L. D. (1994). The nucleus accumbens as a complex of functionally distinct Neuronal ensembles: An integration of behavioural, electrophysiological and anatomical data. Progress in Neurobiology, 42, 719–761. Phares, V. (2003). Understanding abnormal child psychology. New York: Wiley. Phillips, P. E. M., Stuber, G. D., Helen, M. L. A. V., Wightman, R. M., & Carelli, R. M. (2003). Subsecond dopamine release promotes cocaine seeking. Nature, 422, 614–618. Ponsluns, D. (1962). An analysis of chlorpromazine-induced suppression of the avoidance response. Psychopharmacologia, 3, 361–373. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). Is the short-latency dopamine response too short to signal reward error? Trends in Neurosciences, 22(4), 146– 151. Richards, J. B., Sabol, K. E., & de Wit, H. (1999). Effects of methamphetamine on the adjusting amount of procedure, a model of impulsive behavior in rats. Psychopharmacology, 146, 432–439. Rizley, R. C., & Rescorla, R. A. (1972). Associations in second-order conditioning and sensory preconditioning. Journal of Comparative and Physiological Psychology, 81(1), 1–11. Robbins, T. W., Cador, M., Taylor, J. R., & Everitt, B. J. (1989). Limbic-striatal interactions in reward-related processes. Neuroscience and Biobehavioural Reviews, 13, 155–162. Rolls, E. T., Rolls, B. J., Kelly, P. H., Shaw, S. G., Wood, R. J., & Dale, R. (1974). The relative attenuation of self-stimulation, eating and drinking produced by dopamine-receptor blockade. Psychopharmacology, 38, 219–230. Salamone, J. D., Cousins, M. S., & Bucher, S. (1994). Anhedonia or anergia? Effects of haloperidol and nucleus accumbens dopamine depletion on instrumental response selection in a T-maze cost/benefit procedure. Behavioural Brain Research, 65, 221–229. Salamone, J. D., Cousins, M. S., & Snyder, B. J. (1997). Behavioural functions of nucleus accumbens dopamine: Empirical and conceptual problems with the anhedonia hypothesis. Neuroscience and Biobehavioural Reviews, 21(3), 341– 359. Salamone, J. D., Kurth, P. A., McCullough, L. D., Sokolowski, J. D., & Cousins, M. S. (1993). The role of brain dopamine in response initiations: Effects of
Modeling the Functional Role of Dopamine in the Ventral Striatum
393
haloperidol and regionally-specific dopamine depletions on the local rate of instrumental responding. Brain Research, 628, 218–226. Salamone, J. D., Steinpreis, R. E., McCullough, L. D., Smith, P., Grebel, D., & Mahan, K. (1991). Haloperidol and nucleus accumbens dopamine depletion suppress lever pressing for food but increase free food consumption in a novel food-choice procedure. Psychopharmacology, 104, 515–521. Salamone, J. D., Wisniecki, A., Carlson, B. B., & Correa, M. (2001). Nucleus accumbens dopamine depletions make animals highly sensitive to high fixed ratio requirements but do not impair primary food reinforcement. Neuroscience, 105(4), 863–870. Schmajuk, N. A. (1988). The hippocampus and the classically conditioned nictitating membrane response: A real-time attentional-associative model. Psychobiology, 16(1), 20–35. Schmajuk, N. A., Cox, L., & Gray, J. A. (2001). Nucleus accumbens, entorhinal cortex and latent inhibition: A neural network model. Behavioural Brain Research, 118, 123–141. Schneider, J. S., Sun, Z. Q., & Roeltgen, D. P. (1994). Effects of dopamine agonists on delayed response performance in chronic low-dose MPTP-treated monkeys. Pharmacology, Biochemistry and Behaviour, 48(1), 235–240. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., Tremblay, L., & Hollerman, J. R. (2000). Reward processing in primate orbitofrontal cortex and basal ganglia. Cerebral Cortex, 10, 272–283. Seeman, P., & Madras, B. K. (1998). Anti-hyperactivity medication: Methylphenidate and amphetamine. Molecular Psychiatry, 3(5), 386–396. Smith, A., Li, M., Becker, S., & Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: A computational approach. Neuropsychopharmacology, 29(6), 1040–1049. Solanto, M. V., Abikoff, H., Sonuga-Barke, E., Schachar, R., Logan, G. D., Wigal, T., Hechtman, L., Hinshaw, S., & Turkel, E. (2001). The ecological validity of delay aversion and response inhibition as measures of impulsivity in AD/HD: A supplement to the NIMH multimodal treatment study of AD/HD. Journal of Abnormal Child Psychology, 29(3), 215–228. Stark, H., Bischof, A., & Scheich, H. (1999). Increase of extracellular dopamine in prefrontal cortex of gerbils during acquisition of the avoidance strategy in the shuttle box. Neuroscience Letters, 264, 77–80. Suri, R. E. (2001). Anticipatory responses of dopamine neurons and cortical neurons reproduced by internal model. Experimental Brain Research, 140, 234–240. Suri, R. E. (2002). TD models of reward predictive responses in dopamine neurons. Neural Networks: Special Issue on Computational Models of Neuromodulation, 15(4-6), 523–533. Suri, R. E., Bargas, J., & Arbib, M. A. (2001). Modeling functions of striatal dopamine modulation in learning and planning. Neuroscience, 103(1), 65–85. Suri, R., & Schultz, W. (1998). Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Experimental Brain Research, 121, 350–354.
394
A. Smith, S. Becker, and S. Kapur
Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1981). An adaptive network that constructs and uses an internal model of its world. Cognition and Brain Theory, 4(3), 217–246. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Talk, A. C., Gandhi, C. C., & Matzel, L. D. (2002). Hippocampal function during behaviourally silent association learning: Dissociation of memory storage and expression. Hippocampus, 12, 648–656. Taylor, J. R., & Robbins, T. W. (1984). Enhanced behavioural control by conditioned reinforcers following microinjections of d-amphetamine into the nucleus accumbens. Psychopharmacology, 84, 405–412. Tremblay, L., & Schultz, W. (2000a). Modifications of reward expectation–related neuronal activity during learning in primate orbitofrontal cortex. Journal of Neurophysiology, 83, 1877–1885. Tremblay, L., & Schultz, W. (2000b). Reward-related neuronal activity during go–nogo task performance in primate orbitofrontal cortex. Journal of Neurophysiology, 83, 1864–1876. van der Heyden, J. A. M., & Bradford, L. D. (1988). A rapidly acquired oneway conditioned avoidance procedure in rats as a primary screening test for antipsychotics: Influence of shock intensity on avoidance performance. Behavioural Brain Research, 31, 61–67. Wade, T. R., de Wit, H., & Richards, J. B. (2000). Effects of dopaminergic drugs on delayed reward as a measure of impulsive behavior in rats. Psychopharmacology, 150, 90–101. Wadenberg, M. G., Soliman, A., Vanderspek, S. C., & Kapur, S. (2001). Dopamine D2 receptor occupancy is a common mechanism underlying animal models of antipsychotics and their clinical effects. Neuropsychopharmacology, 25(25), 633–641. Wadenberg, M. L., & Hicks, P. B. (1999). The conditioned avoidance response test reevaluated: Is it a sensitive test for the detection of potentially atypical antipsychotics? Neurosci. Biobehav. Rev., 23(6), 851–862. Wadenberg, M.-L. G., Kapur, S., Soliman, A., Jones, C., & Vaccarino, F. (2000). Dopamine D2 receptor occupancy predicts catalepsy and the supression of conditioned avoidance response behaviour in rats. Psychopharmacology, 150, 422–429. Waelti, P., Dickinson, A., & Schultz, W. (2001). Dopamine responses comply with basic assumptions of formal learning theory. Nature, 412, 43–48. Wilkinson, L. S., Humby, T., Killcross, A. S., Torres, E. M., Everitt, B. J., & Robbins, T. W. (1998). Dissociations in dopamine release in medial prefrontal cortex and ventral striatum during the acquisition and extinction of classical aversive conditioning in the rat. European Journal of Neuroscience, 10, 1019–1026. Wilson, M., & Daly, M. (2004). Do pretty women inspire men to discount the future? Biology letters, 271(S4), 177–179. Wise, R. A. (1982). Neuroleptics and operant behavior: The anhedonia hypothesis. Behavioural and Brain Sciences, 5, 39–87.
Modeling the Functional Role of Dopamine in the Ventral Striatum
395
Wise, R. A. (2002). Brain reward circuitry: Insights from unsensed incentives. Neuron, 36, 229–240. Wise, R. A., & Schwartz, H. V. (1981). Pimozide attenuates acquisition of leverpressing for food in rats. Pharmacology, Biochemistry and Behavior, 15, 655–656. Wise, R. A., Spindler, J., DeWit, H., & Gerber, G. J. (1978). Neuroleptic-induced ”anhedonia” in rats: Pimozide blocks reward quality of food. Science, 201, 262–264. Young, A. M. J., Ahier, R. G., Upton, R. L., Joseph, M. H., & Gray, J. A. (1998). Increased extracellular dopamine in the nucleus accumbens of the rat during associative learning of neutral stimuli. Neuroscience, 83(4), 1175–1183. Received January 9, 2004; accepted July 2, 2004.
LETTER
Communicated by Bruno Olshausen
A Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural Signals Yan Karklin
[email protected] Michael S. Lewicki
[email protected] Computer Science Department and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
Capturing statistical regularities in complex, high-dimensional data is an important problem in machine learning and signal processing. Models such as principal component analysis (PCA) and independent component analysis (ICA) make few assumptions about the structure in the data and have good scaling properties, but they are limited to representing linear statistical regularities and assume that the distribution of the data is stationary. For many natural, complex signals, the latent variables often exhibit residual dependencies as well as nonstationary statistics. Here we present a hierarchical Bayesian model that is able to capture higher-order nonlinear structure and represent nonstationary data distributions. The model is a generalization of ICA in which the basis function coefficients are no longer assumed to be independent; instead, the dependencies in their magnitudes are captured by a set of density components. Each density component describes a common pattern of deviation from the marginal density of the pattern ensemble; in different combinations, they can describe nonstationary distributions. Adapting the model to image or audio data yields a nonlinear, distributed code for higher-order statistical regularities that reflect more abstract, invariant properties of the signal. 1 Introduction The goal of many algorithms in machine learning, signal processing, and computational perception is to discover and process intrinsic structures in the data. Extracting these from real signals is a difficult problem, because often the relationships among the observable variables are complex, and there is little a priori knowledge about the types of structures that exist. When some a priori knowledge is available, specialized algorithms can be designed, but this approach is generally less desirable, as it places restrictions on the type of structure that can be learned. Another difficulty is that the dimensionality of the data is often very high, and properties of interNeural Computation 17, 397–423 (2005)
c 2005 Massachusetts Institute of Technology
398
Y. Karklin and M. Lewicki
est lie in a relatively low-dimensional subspace. Because of the inherent variability of most real-world signals, intrinsic regularities are statistical in nature, which makes them that much more difficult to learn. One approach to learning statistical regularities is to formulate a probabilistic model of how the data are generated and adapt its parameters to fit the observed distribution. The adapted parameters reflect the statistics of the data ensemble, while internal representations encode individual data patterns. These models make minimal assumptions about the data and can result in more general representations than those in algorithms tailored for specific tasks or types of data. There are several ways in which data patterns are represented in probabilistic generative models. Distributed representations of linear componential models, such as those for principal component analysis (PCA) and independent component analysis (ICA), are particularly useful for modeling complex high-dimensional data because they can capture independent regularities with independent internal parameters (Bell & Sejnowski, 1995). This makes it possible to model a continuum of different statistical relationships and allows scaling of the algorithms to large numbers of dimensions. Current models, however, are limited in the type of structure they can represent; in order to understand these limitations, it is helpful to look at their mathematical formulation. Linear componential models achieve a distributed representation by describing the data as a combination of linear basis functions (for a review, see Hyv¨arinen, Karhunen, & Oja, 2001; Cichocki & Amari, 2002). This yields a probabilistic generative model in which the data (x) are generated as a linear combination of basis functions (A) weighted by coefficients (u), x = Au.
(1.1)
The likelihood of the observed data under this model is p(x) = p(u)/| det(A)|
(1.2)
(Pearlmutter & Parra, 1996; Cardoso, 1997), and the basis function matrix A is adapted to maximize the data likelihood. The coefficients u are the unknown (latent) variables. They are assumed to be independent and identically distributed (i.i.d.), p(u) = p(ui ). (1.3) i
The priors p(ui ) are typically chosen to be fixed sparse distributions (although parameters of the prior may be adjusted to maximize data likelihood). Because basis function coefficients are assumed to be i.i.d., the dependence among the data is represented solely by the learned matrix of basis functions.
Learning Density Components
399
The obvious limitation of this model is that its inherent linearity restricts the type of structure it can capture. Even simple, low-dimensional data often exhibit statistical dependencies that cannot be captured by linear transformations. In many applications, data are complex and rich with statistical structure, and latent variables of linear models adapted to these data exhibit significant residual mutual dependence (Hyv¨arinen & Hoyer, 2000; Schwartz & Simoncelli, 2001; Karklin & Lewicki, 2003). Another shortcoming of these models is that they assume that the statistical regularities in the data do not change; they describe stationary probability distributions. For example, once model parameters are adapted in ICA, both the prior and the basis functions are fixed, leading to a stationary distribution over the data. This does not depend on the form of the prior and also applies to models with adaptive or entirely nonparametric priors. In many domains, however, the statistics of the data are known to change, as the physical properties of the environment or conditions for data acquisition vary. While the stationary prior assumption gives a valid approximation of true density over a large enough corpus of training data, it does not reflect the variation across contexts that is observed in many signals. Figure 1 illustrates nonstationary statistics observed in images of natural scenes. ICA basis functions were adapted to 20 × 20 patches taken from an ensemble of natural images. Over the full ensemble of the training data, the basis function coefficients have marginal distributions that are consistent with the prior assumed by the model (not shown). However, computing coefficient histograms over particular image regions reveals systematic deviations from the (globally valid) stationary distribution. Patterns in the histograms suggest that basis functions of certain orientations are more active in some parts of the image (e.g., textured, oriented surface of the log), while in other regions, different subsets tend to be activated. This is observed in other types of data as well: temporal basis functions adapted to speech also yield coefficients whose statistics vary greatly across local regions of the signal (see Figure 2). Figures 1 and 2 give just a few examples of patterns in latent variable distributions that depend on the local context. In fact, there is a wide range of statistical regularities in complex data, a continuum of contexts that is as multidimensional as the physical properties of the environment that give rise to it. Local representations, as employed by clustering or mixture model techniques, assume that the contexts are discrete and thus cannot describe regularities that arise from a combination of different contexts. A model that captures this variation must form flexible, distributed representations of higher-order structure. Moreover, because the dimensions of the contexts are not known a priori, the model must be able to automatically discover this underlying structure. Finally, many previous models of nonstationary distributions have relied on the assumption that data statistics vary smoothly from sample to sample (Everson & Roberts, 1999; Pham & Cardoso, 2001) and computed local estimates of context-dependent variation.
400
Y. Karklin and M. Lewicki
Figure 1: The distribution of ICA basis function coefficients exhibits nonstationary statistics that reflect local image structure. (a) A subset of image basis functions learned from an ensemble of natural images, ordered by orientation. The small black square on the image indicates the size, relative to the image, of the learned basis functions. (b) Coefficients of independent components were computed over two regions of an image. (c, d) Histograms of the coefficients for the two regions reveal patterns in the joint distributions. Each histogram in the 10 × 10 grid in c and d corresponds to a basis function at the same grid position in a and is normalized so that the filled area sums to 1. Different types of local image structure produce different patterns in the joint activities. For example, the image region containing the log yields higher coefficient variation for basis functions oriented along the grain and matching the approximate spatial frequency of the wood texture.
Learning Density Components a
401
256
128
1
b
R1
R2
R1
log(var(u))
c
R3
R2
R3
4
4
4
0
0
0
−4
−4
−4
−8
1
128
256
−8
1
128
256
−8
1
128
256
Figure 2: ICA basis functions adapted to speech data also exhibit nonstationary statistical dependencies. (a) A subset of 256 ICA-derived basis functions ordered by dominant frequency. (b) Each basis function was convolved with three different regions of a speech signal. The length of the basis function is indicated by the short bar above the start of the speech signal. (c) The variances of coefficients sampled over the three regions, with the 256 coefficients ordered by frequency as in a. Although all basis function coefficients have unit variance when sampled over the whole data ensemble, local regions show characteristic variance patterns that reflect local signal structure.
This assumption does not always hold; even spatially and temporally coherent data exhibit abrupt changes that cannot be modeled as slowly evolving processes. Here we address the limitations of previous models with a hierarchical Bayesian model that forms a distributed code of higher-order statistical regularities and captures nonstationarities in the data distribution. The model is a generalization of ICA; thus, we begin with a standard linear componential model in which the data are generated as a combination of linear basis functions. However, instead of assuming that the basis function coefficients are independent (and their joint prior distribution is factorable; see equation 1.3), we explicitly model the dependence among hyperparameters of their priors. In order to capture variable, context-dependent activation of
402
Y. Karklin and M. Lewicki
basis functions, the dependence is specified through the scale parameters governing the width of the prior (and hence the variance of the coefficients). This dependence is modeled with a set of density components, a distributed code that describes the shape of the joint density of the linear coefficients and captures patterns in the variances of the coefficients as observed in the motivating examples. Each density component describes a common underlying deviation from the standard assumption of independence (the i.i.d. joint prior) associated with a frequently encountered context. Using a weighted combination of density components, the model is able to represent a continuum of contextdependent changes in probability distributions. Adapting the set of density components and modeling their activation with a sparse prior yields a compact description of higher-order statistical regularities of the data ensemble. Unlike other recent methods, the model makes no assumptions of temporal or spatial coherence; it is able to infer, independently for each data sample, the higher-order code that describes the generating distribution. Below we present the probabilistic framework for the model and describe the associated learning algorithms. Previously, we have used this model to discover higher-order structure in natural images (Karklin & Lewicki, 2003). Here, we describe the algorithm in more detail and frame it as a general method of statistical density estimation for high-dimensional nonstationary data. We verify the recovery of correct model parameters using a toy data set, apply the learning algorithm to a wider range of data types, and show how the learned higher-order code accounts for observed dependencies. We provide results and analysis for photographs of natural scenes, scanned images of newspapers, and speech waveforms. However, the model is not tailored specifically to images or audio data, and can be used to automatically learn the nonlinear statistical dependencies in any data set with sufficiently rich structure. 2 A Hierarchical Model for Nonstationary Distributions Our model is a generalization of previous linear models. Hence, we begin by assuming that each data vector is generated as a combination of linear basis functions, x = Au. As in standard ICA models (e.g., Cichocki & Amari, 2002), basis function coefficients are assumed to be sparsely distributed. Here we use a generalized gaussian distribution with zero mean: p(ui ) = N (0, λi , qi ) qi ui = zi exp − , λ
(2.1) (2.2)
i
where zi = qi /(2λi [1/qi ]) is a normalizing constant. The parameter qi determines the weight of the distribution’s tails and can be estimated from the data; in many ICA applications, the coefficients tend to be sparse, making
Learning Density Components
403
their distributions supergaussian (qi < 2). Typically, the scale parameter λi is fixed to a constant, since the basis functions in A can themselves scale to fit the data. In order to capture residual dependence among coefficients u, we must abandon the assumption of fixed, independent priors. The motivating examples suggested that intrinsic structures in the data give rise to patterns in the scales of the coefficients (similar dependencies have been observed previously in wavelet coefficients; Simoncelli, 1997). A natural way to model this is through the scale parameters of the prior, which we model as a nonlinear transformation of latent higher-order variables. Specifically, we use a matrix of density components B and density component coefficients v to describe the logarithm of the scale parameter, log(λ/c) = Bv.
(2.3)
If we define the constant c = (1/q)/ (3/q), the variance of the coefficients becomes 1 when the right side of the equation is 0 (this becomes convenient when a zero-centered prior is selected for the distribution of v; see below). The joint prior distribution of coefficients u can now be expressed as − log p(u|B, v) ∝
[Bv]i + i
qi ui , c exp([Bv]i )
(2.4)
where [Bv]i represents the ith element of the vector Bv (see the appendix for the derivation). Basis function coefficients are assumed to be independent conditional on the higher-order variables, p(u|v) = p(ui |v). This accounts for the dependence in the magnitudes of basis function coefficients. The new form of the prior (2.4) implies that if v is 0, the model reduces to standard ICA in which the linear coefficients are independent and identically distributed with variance equal to 1. Nonzero values of v scale and combine density components (columns of B) that define patterns in the distributions of u. Because each vi can be positive or negative, each density component represents contrast in the magnitudes of coefficients u (see Figure 3). We place a nongaussian, sparse prior on the latent variables v and infer their values for each data sample.1 This means that a priori, we assume that the activity of density component coefficients is sparse, and relatively few components are needed to describe how the generating distribution associated with each data sample differs from the i.i.d. ICA model. Using this parameterization, we adapt the density components to the entire data ensemble, which produces a compact description of higher-order statistical regularities. 1 A Laplacian prior was used in the simulations, but other distributions may be more appropriate.
404
Y. Karklin and M. Lewicki
Bj
p(u|v j = 1) p(u|v j = 0) p(u|v j = −1) Figure 3: Each density component defines a pattern in the joint distribution p(u). The plot at the top shows an example nine-dimensional density component Bj . The distributions of coefficients u1,...,9 are shown for different values of vj . Here we show only a single density component Bj , whereas the model adapts a set of them B = {B1 , B2 , . . . , BM } to obtain a compact description for common scale patterns in the data.
The full generative model is shown in graphical form in Figure 4. There are two sets of random variables that give rise to the data, v and u, and two sets of parameters adapted to the data: the linear basis functions A and the density components B. A crucial difference between this generative form and several other models that account for higher-order dependence is that here, the density components specify a distribution over the coefficients, as opposed to exact values or pooled magnitudes, which have been used in other models (Hoyer & Hyv¨arinen, 2002; Welling, Hinton, & Osindero, 2003). Thus, the model forms a hierarchical representation in which the lower-level codes data values precisely and the higher level represents more abstract properties associated with the shape of the data distribution. 3 Inference of Density Component Coefficients For each data sample, it is necessary to compute the higher-order representation v that best describes the pattern in the scale of coefficients u. This transformation is nonlinear and cannot be expressed in closed form. Here, we compute the best value of v by maximizing the posterior distribution, vˆ = arg max p(v|u, B), v
= arg max p(u|B, v)p(v). v
(3.1) (3.2)
Learning Density Components
405
vi ∼ N (0, 1, qi) λ j = c exp[Bv] j u j |λ j ∼ N (0, λ j , q j ) xk =
∑ Ak j u j j
Figure 4: Schematic of the hierarchical generative model. Sparsely distributed random variables v specify (through a nonlinear transformation) the scale hyperparameters λ for the distribution of coefficients u. The data x are a linear combination of coefficients u. Matrices A and B are parameters that are adapted to the statistical distribution of the data.
We assume that vi ’s are independent (p(v) = i p(vi )) and sparsely distributed (log p(vi ) ∝ −|vi |). For the simulations below, vˆ was derived by gradient ascent. We used second-order methods (LeCun, Bottou, Orr, & Muller, ¨ 1998) to stabilize and speed up convergence to optimal estimates. Because the prior is zero centered and sparse, only a few nonzero values will contribute to the representation of each data sample. The inference of optimal density component coefficients is analogous to estimating sample variance based on a single observation, but the problem is further constrained by the structure of the learned density components. Because the model is constrained to describe the pattern of variance with a sparse combination of density components, the value of v for a typical pattern is usually well determined. In addition, the high dimensionality of the input facilitates the inference process, as it provides more directions of variation that make up the variance pattern. 4 Adapting Model Parameters to the Data The linear basis functions and the density components are adapted to the data ensemble by maximizing the posterior p(A, B|X). We assume that samples in the data ensemble X = {x1 , . . . , xN } are independent, so that p(X|A, B) =
N
p(xn |A, B).
(4.1)
n=1
For each data sample x, the posterior distribution is p(A, B|x) ∝ p(x|A, B)p(A, B) = p(u|B)p(B)/| det(A)|.
(4.2) (4.3)
406
Y. Karklin and M. Lewicki
Ideally, the marginal distribution p(u|B) would be computed by integrating over v, but evaluating this integral for equation 2.4 is intractable. Here we approximate it using the maximum a posteriori estimate v: ˆ p(u|B) =
p(u|B, v)p(v)dv,
≈ p(u|B, v)p( ˆ v). ˆ
(4.4) (4.5)
Substituting this approximation into the posterior gives p(A, B|x) ∝ p(u|B, v)p( ˆ v)p(B)/| ˆ det A|.
(4.6)
The prior on B places a small a priori bias for small values of Bi,j and eliminates the problem of a degenerate case in which B grows without bounds while v’s rescale to be smaller. For the results here, we assumed Bi,j followed a gaussian distribution. The matrices A and B can be optimized iteratively by maximizing p(A|X, B) and then maximizing p(B|X, A). In this case, the first step amounts to performing ICA in which the priors incorporate the scale estimates v. ˆ Alternatively, we can assume that optimal linear basis functions are largely independent of the set of density components, and optimize B using a fixed A. For computational efficiency, A and B were assumed to be independent and were adapted separately in the simulations described below. We confirmed the validity of this approach by training a model on data of reduced dimensionality and with fewer density components; results were qualitatively similar to optimizing the parameters independently. In order to verify that the learning algorithm produces a valid solution, we adapted model parameters to an artificial data set for which the optimal solution was known. The data were generated by constructing a set of density components and then sampling basis function coefficients according to p(u|B). An illustration of the process and the obtained results is shown in Figure 5. Optimizing density components from random initial values produced a matrix that was identical (up to a permutation of its columns) to the true model parameters (see Figures 5a and 5b). The patterns in the learned density components specify nonlinear dependencies among coefficient magnitudes; in fact, there are no linear correlations among basis function coefficients sampled from the model (even when the same v is used to generate the coefficients). Linear models like ICA are unable to recover these statistical regularities. As a control, we adapted the density component model to a pure noise data set in which coefficients u were random samples from independent sparse distributions. In this case, no regularities in the magnitudes of coefficients existed, and the resulting density components consisted of small, random values.
Learning Density Components
407
5 Discovering Structure in Complex Data 5.1 Learned Density Components. We optimized model parameters on several data sets and analyzed the learned density components. For computational simplicity, the model was optimized in two stages in all the simulations. First, a complete linear basis A was adapted to the data using standard methods; next, the density component matrix B was optimized on the coefficients of the fixed A. Since the linear basis functions were learned using standard ICA methods, our analysis and discussion here is limited to the recovered matrix of density components. The density components were initialized to small random values, and gradient ascent was performed on stochastically sampled batches of data. The maximum a posteriori estimate vˆ was obtained using 20 steps of gradient ascent. Convergence of the gradient procedures for the optimization of B and estimation of vˆ was tested in a number of ways, including varying the step size, the number of iterations, and the initial conditions. The given optimization parameters yielded reasonable speed and accuracy, as well as consistent solutions for different random initial conditions. We first applied the learning algorithm to small (20 × 20) image patches sampled from a standard set of 10 gray-scale images of natural scenes (Olshausen & Field, 1996; Karklin & Lewicki, 2003). We used a complete set of 400 linear basis functions. The number of density components was set to 100 (although the algorithm is able to recover any number that yield a sparse distribution for coefficients v). We used batches of 1000 samples for 35,000 iterations of gradient ascent with a fixed step size of 0.3. Statistical regularities of the data ensemble are captured in the matrix of density components. In order to analyze the structure described by this matrix, we need to examine its weights as they relate to the basis functions whose distributions they affect. (Recall that each weight in a density component vector specifies how a particular p(ui ) is rescaled). The initial ordering of basis functions in the learned matrix A is arbitrary; hence, weights in B also appear random in their original ordering. However, we can rearrange the weights in B according to some property of the linear basis functions and examine whether the learned density components capture structure related to the chosen property. For example, ICA basis functions adapted to natural images are spatially localized; arranging density component weights according to the location of corresponding basis functions within the image patch reveals patterns in their organization (see Figure 6). Thus, density components that appear structured in this arrangement specify dependence among spatially related linear basis functions. As parameters in the generative model, they describe common data distributions that reflect localized image structure. Some density components also appear random when arranged spatially, but these often show organization along other dimensions of the lower-order representation, such as orientation or spatial frequency (Karklin & Lewicki, 2003). Changing the number of density components
408
Y. Karklin and M. Lewicki
does not affect the type of structure captured by the hierarchical model. A larger number of density components allows the model to represent finerscale spatial regularities, as well as other statistical structure that is not as obvious to interpret. We also applied the model to speech data from the TIMIT database. Linear basis functions were adapted to bandpass filtered speech segments of 256 samples (16 msec of 16 kHz sound). The number of density components was set to 100, and the parameters were optimized using stochastic learning on data batches of 1000 for 10,000 iterations. A representative set of the learned density components is shown in Figure 7. In order to display the weights in the density components as they relate to the linear code, we first computed the Wigner distributions (WD) of the linear basis functions using the DiscreteTFDs Matlab package (O’Neill, 1999). The Wigner distribution of a basis function is a surface in the time-frequency space; we took a contour at 95% peak value for each basis function and drew all these contours on a single time-frequency plot (time on the horizontal axis, 0 to 16 msec, and frequency on the vertical axis, 0 to 8 kHz). Because the linear basis functions adapted to speech tile most of the the time-frequency space, the contours also exhibit relatively even tiling of the plots. In Figure 7, nine WD plots show the weights in nine density components to the same set of linear basis functions. Here, as in image density components, the shading of each patch corresponds to the value of the weight. Some density components
Figure 5: Facing page. The model correctly recovers the density components used to generate synthetic data. We constructed a 50 × 10 matrix B composed of 10 cosine-shaped density components (a). After 3000 iterations, the model recovers (up to a permutation) the correct density components (b). (c) The generative and inference steps of the algorithm. (1) Three 10-dimensional density component coefficients are drawn from a sparse distribution; (2) each v(i) specifies a vector of (i) (i) scaling variables λ through the nonlinear transformation λ = c exp[Bv](i) . (3) The scaling variables are hyperparameters for nonstationary distributions p(u), from which data samples u are drawn. In order to emphasize that each (i) vector of scaling variables λ specifies a distribution, not fixed values of u, we (i) plotted several u’s drawn from the distribution p(u|λ ). In actual simulation, each data point was generated independently. Using the learned density comˆ were obtained for each data sample. Because ponents, estimates of (4) vˆ and (5) λ the inference problem involves the estimation of density parameters from sinˆ only approximately match true parameters. Although gle data points, vˆ and λ the complete hierarchical model includes another transformation x = Au, the projection to data space x is linear and is not necessary for inference of vˆ when coefficients u are known. The scatter plot of 1000 samples of u1 and u2 drawn from the model (d) shows that there is no linear dependence among basis function coefficients.
Learning Density Components
true B
a
c (1)
(5)
v ∼ p(v)
vˆ
409
b
(2)
(4)
λ = c exp(Bv)
learned B
(3)
λ) u ∼ p(u|λ
λˆ d
uj
ui
410
Y. Karklin and M. Lewicki
Figure 6: Density components optimized on an ensemble of 20 × 20 image patches drawn from natural scenes. Each column of B is represented here as a square; its weights to 400 image basis functions are plotted as dots, placed in locations corresponding to the center of each image basis function in the image patch. Each dot is colored according to the value of the weight, with white indicating positive weights, black negative weights, and gray weights that are close to zero. Most density components describe spatial relationships and capture coactivation of linear basis functions localized to a particular area of the image patch. For example, the density component in the second row, second column indicates whether contrast in the image patch is localized to the top or the bottom half. While most density components represent location, orientation, or spatial frequency regularities, the organization of some is not obvious.
describe coactivation of linear basis functions of adjacent frequency bands, while others are localized in time within the sample window. Most density components capture periodic higher-order structure and regularities across
Learning Density Components
411
Figure 7: A subset of density components of speech. The weights in a column of B are plotted as shaded patches in one of the nine panels. Each patch is placed according to the temporal and frequency distribution of the associated linear basis function and shaded according to the value of the weight, with white indicating positive weights, black negative weights, and gray weights that are close to zero. The axes represent time, 0 to 16 msec, horizontally, and frequency, 0 to 8 kHz, vertically. The density components form a distributed representation of the frequency of the signal and the location of energy within the sample window. Density components coding for multiple frequencies might capture harmonic regularities in the speech signal (see the text for details).
412
Y. Karklin and M. Lewicki
multiple frequencies or time intervals, and a few are tuned specifically to subtle shifts in dominant frequency over the sample window. 5.2 Higher-Order Code. In order to better understand the type of structure captured by the model, it is informative to look at the higher-order code—the coefficients of density components—and the statistical regularities it represents. Individual density component coefficients indicate the presence, in each data sample, of the type of structure represented in Figure 6. As a distributed code, their joint activity describes the data density whose shape reflects underlying structure in the data. Figure 6 shows that among other statistical regularities, the higher-order code captures spatial relationships in the data. How does this representation compare to the lower-level, linear code for image structure? The activity of density component coefficients over contiguous regions of the data suggests that the higher-order representation captures more abstract properties of the data (see Figure 8). When a sliding window is applied to a natural scene image, the resulting lower-level representation changes rapidly from sample to sample, as would be expected from what are essentially outputs of linear filters. The higher-order representation varies more slowly over the image and captures more invariant properties of the data, such as overall image contrast or the dominance of certain spatial frequencies. Also shown in Figure 8 are the values of the linear and the density component coefficients for a model trained on images of newspaper text. Here too the density component coefficients describe more abstract properties: several combine to form a distributed representation of text line position in the image patch (the activity of one such coefficient is shown in the first panel of Figure 8f), while others represent commonly observed structures in the data, such as recurring shapes of letters or blank spaces between words. Applied to audio data, the model also captures more abstract properties of the stimulus. In Figure 9a, we plot an example audio signal, along with the activities of three linear coefficients in Figure 9b and three density component coefficients in Figure 9c. We emphasize that, as for the images, the model is trained on segments drawn randomly from the data set, and the values of the coefficients for each sample position in the signal shown in the figure are determined independently. The higher-order representation varies more slowly than responses of the linear filters and captures structural elements that extend well beyond the small sampling window. This may reflect a general property of natural signals—fast fluctuations in their exact values are caused by interactions of underlying physical properties, which themselves change more slowly. 5.3 Modeling Residual Dependencies. The motivating examples (see Figures 1 and 2) showed specific types of residual dependencies among the “independent” linear coefficients, such as the dependence among the scale of coefficients, which formed patterns that changed from context to context.
Learning Density Components
413
Figure 8: The higher-order code captures more abstract properties of image data and therefore forms a more invariant representation than the coefficients of linear basis functions. We trained the model on natural images (a–c) and scanned newspaper clippings (d–f) and analyzed the representation formed by the model as it varied over the images. A sliding window (represented as white squares in the images) was applied over contiguous sections of the training data (a,d), and values of three linear coefficients ui (b,e) and three higher-order coefficients vj (c,f) were plotted as they varied over the signal. White represents large pos values, black large negative values, and gray zeros. Although the model is trained on image patches selected randomly from the data set, the higher-order code forms a representation that changes more slowly over space and captures properties of the data that extend beyond the sampling window, such as the overall contrast in natural images or the position of the text-line in newspaper images.
The adapted hierarchical density component model is able to capture these dependencies. First, drawing from the model generates data with similar
414
Y. Karklin and M. Lewicki
a b
c
Figure 9: The higher-order representation formed by the hierarchical model trained on speech data is more invariant than simple outputs of linear filters. A sliding window was applied to a speech signal (a; size of window indicated by a short bar). At each point, the linear basis function coefficients u were computed (b) and the higher-order coefficients v were inferred (c). Values of v change slowly and represent more abstract properties, such as the presence of silence or the onset of vocalization. Only three examples for u and v are shown.
statistical regularities. Furthermore, the higher-order representation in the model defines an implicit normalization of the linear code, and the residual dependencies are no longer observed in the normalized code. Figure 10a shows the empirical joint distributions (top row) of two linear coefficients when sampled from the image regions R1 or R2 of Figure 1. In the two contexts, the shape of the distribution is different: the coefficients have high variance in one context but not in the other. The statistical properties in the two contexts are captured by the inferred density component coefficients. Fixing the density component coefficient to the empirical distributions and sampling the linear coefficients reveals the same type of statistical structure (middle row). At the same time, it is possible to use the estimated parameters of the generating distribution to normalize the data. ˆ results Dividing the linear coefficients by the estimated scale parameters λ in joint distributions that are symmetric with uniform variance across different contexts and image regions (bottom row). Another way to observe dependence among coefficient magnitudes is to draw a conditional histogram that plots distributions of one coefficient conditional on different values of another (Simoncelli, 1997; Schwartz & Simoncelli, 2001). While the joint histograms show that coefficient magnitudes are dependent on the sampling context, conditional histograms reveal pair-wise dependencies between coefficients across all contexts. For natural images, most linear coefficients show a positive magnitude dependence; the
Learning Density Components
415
Figure 10: Dependence in the magnitudes of linear basis function coefficients is captured by the density component model. (a) The joint distributions of linear coefficients are different in the two image regions from Figure 1, that is, the data distribution is not stationary. Sampling from the model under the estimated higher-order representation of each context results in similar distributions. Normalizing the image data by the estimated scale parameters, u¯i = ui /λi , eliminates the non-stationarity. (b) Over the full data ensemble, empirical conditional histograms for pairs of coefficients show statistical dependencies in the magnitude. Sampling from the model adapted to this data ensemble produces similar dependencies, and normalizing by the estimated scale parameters removes the magnitude correlations. See the text for more details.
magnitude of one coefficient is positively correlated with the magnitude of another, (e.g., the left pair in Figure 10b), but some exhibit the reverse pattern. Sampling from the model produces data with the same statistical dependencies (see Figure 10b, middle row), while normalized linear coefficients show no conditional magnitude dependence (see Figure 10b, bottom row). Joint and conditional histograms illustrate pair-wise structure in the linear coefficients; global patterns in coefficients, such as those observed in Figures 1 and 2, are also captured by the model. In the top row of Figure 11, we replot the statistics from Figure 2 that show variance patterns in different regions of the speech signal. In the bottom row, we plot the same statistics for the coefficients normalized by the estimated scale parameters; after nor-
416
Y. Karklin and M. Lewicki R
R
log(var(u))
1
3
4
4
0
0
0
−4
−4
−4
−8 1 4
log(var(u)) ¯
R
2
4
128
256
−8 1 4
128
256
−8 1 4
0
0
0
−4
−4
−4
−8
1
128
256
−8
1
128
256
−8
1
128
256
128
256
Figure 11: The model accounts for nonstationary statistics of coefficients. (Top row) Log variance of u for the three regions in the speech signal from Figure 2b. Each plot shows the log variance of 256 basis functions, sorted by dominant frequency (replotted from Figure 2c). (Bottom row) Log variance of the normalized basis function coefficients u¯i = ui /λˆ i .
malization, the statistics are stationary, and the coefficients are identically distributed. The same global normalization effect is observed for natural images (plots not shown). 6 Discussion Some previous work has focused on extending linear probabilistic models. Mixtures of linear ICA models have been used to describe high-dimensional, nongaussian data drawn from distinct classes (Lee, Lewicki, & Sejnowski, 2000; Lee & Lewicki, 2002). In this approach, the number of classes is specified in advance, and an optimal linear basis is learned for each class. This nonlinear generative model describes different data distributions for different classes, but its higher-order representation is fundamentally local and does not scale well in domains where the variation in higher-order structure is continuous and high-dimensional. A key problem addressed by the model presented here is the presence and interaction of multiple instrinsic structures, and this is achieved by a continuous, distributed higher-order code. Other models have extended ICA to handle nonstationary data distributions. Everson and Roberts (1999) proposed a model in which ICA basis functions evolve with time as a first-order Markov diffusion process. Similarly, Pham and Cardoso (2001) developed and Choi, Cichocki, and Belouchrani (2002) extended algorithms for non-stationary models in which
Learning Density Components
417
the variances of the sources modulate slowly in time. These are also related to models of time-varying mean and variance in economics (Bollerslev, Engle, & Nelson, 1994), and typically model data whose statistics change slowly over time or space. Alternatively, one can describe the variance with a sparse but temporally coherent latent variable (Hyv¨arinen, Hurri, & V¨ayrynen, 2003). However, in many cases, real-world data are subject to both smooth and abrupt changes that do not follow diffusion dynamics or smooth amplitude modulation. In contrast to these approaches, the density component model makes no assumptions of temporal or spatial smoothness. It infers an optimal generating distribution for each data sample based on only the values of that sample, though the inference process is constrained by parameters adapted to the statistical regularities of the entire data ensemble. Thus, it is able to capture both smooth and abrupt changes in the underlying structure. Another approach to capturing intrinsic structures in the data has been to incorporate a specific nonlinearity, such as the sum of squares (Kruger, ¨ 1998; Hoyer & Hyv¨arinen, 2002) or sigmoid functions (Lee, Koehler, & Orglmeister, 1997). The drawback to these models is that the type of structure learned is limited by the specific choice of the nonlinearity. Most of these methods also assume a fixed linear representation (e.g., a set of oriented, localized 2D basis functions for image models), and those that adapt the linear representation assume a more constrained form of the nonlinear dependence (see below). In the model presented here, the linear basis is adapted to the data and maximizes the statistical independence of the linear representation. This ensures that the statistical regularities captured by the higher-order code represent fundamentally nonlinear dependencies rather than residual dependence resulting from the choice of a suboptimal linear basis. Furthermore, in some applications, there is no clear choice of linear representation (such as Gabor filters or wavelets in image processing); in such cases, it is sensible to derive the linear code from the statistics of the data. Several earlier models have explicitly represented the dependence among coefficients of linear basis functions. In the subspace ICA model (Hyv¨arinen & Hoyer, 2000), the linear basis functions are grouped into neighborhoods and adapted to maximize the independence of the vector norms of the neighborhoods. Basis functions within a neighborhood are no longer assumed to be independent; in fact, the energies of their coefficients are correlated. In the more generalized form of the model, called topographic ICA, the disjoint sets of dependent basis functions are replaced by a topographic arrangement that defines magnitude dependencies among basis functions (Hyv¨arinen, Hoyer, & Inki, 2001). The generative forms of subspace ICA and topographic ICA can be interpreted as more constrained versions of the density component model presented here. Neighborhood or topographic dependencies can be equivalently represented by density components whose weights are specified in advance to reflect tree-dependent or topographic relationships. The density component model, however, places no such constraints on the
418
Y. Karklin and M. Lewicki
higher-order representation; thus, density components adapted to the data can capture nontopographic dependencies as well. A related set of work has attempted to model the dependence among coefficients of a fixed linear transform, such as a multiscale wavelet decomposition. Romberg, Choi, and Baraniuk (2001) used a set of discrete latent variables, propagated along a multiscale wavelet tree, to describe the distribution of each wavelet coefficient. The transition probabilities of the latent states were adapted to match the scale dependencies between adjacent nodes in the tree. Buccigrossi and Simoncelli (1999) computed a linear predictor of scale for each coefficient as a function of the magnitudes of its neighbors. Wainwright, Simoncelli, and Willsky (2001) extended this approach by modeling the wavelet coefficients as observed variables in a gaussian scale mixture, in which random gaussian variables are multiplied by latent scaling variables. Dependence among coefficients adjacent on the wavelet tree is captured through the structure of a gaussian process defined on the scaling variables. In addition to its reliance on a fixed linear representation (the drawbacks of this are outlined above), this model is limited in that it can only describe pairwise dependencies between variables adjacent on the wavelet tree. Adapting a model to learn global statistical regularities, as opposed to local representations of class structure or pairwise dependence, allows it to capture a wider range of intrinsic structures. Also, learning an efficient basis to describe these dependencies facilitates their interpretability and provides a better fit to the underlying structure. 7 Conclusion We have introduced a hierarchical, generative Bayesian model that can be considered a nonlinear extension to ICA. It uses parametric density estimation to learn statistical regularities from the data and makes no assumptions about the type of structure it expects to find. The model is general, it is not specific to any domain and can be applied to any data set with rich statistical structure. Because the model forms distributed representations at all levels of its hierarchy, it scales well to large-dimensional data. Adapted to patches from natural images or samples from speech data, the density component model was able to learn nonlinear statistical regularities. It yielded a distributed representation of context, which included higher-order spatial relationships for image data and frequency and harmonic structure for audio data. Sampling from the model produced data with the same statistical regularities observed in the training data sets and the model’s implicit normalization of the lower-order code accounted for the residual dependencies observed in various data sets. Recently, it has been argued that higher-order properties of natural signals change slowly across time or space and that this spatial and temporal coherence can be used to extract higher-order structure from the data (Foldiak, 1991; Kayser, Einh¨auser, Dummer, ¨ Konig, ¨ & Kording, ¨ 2001; Wiskott & Se-
Learning Density Components
419
jnowski, 2002; Hurri & Hyv¨arinen, 2003). We show that in some cases, simply learning higher-order statistical regularities in the data leads the model to recover more abstract properties that tend to vary slowly with time or space. This raises the possibility that the explicit computational goal of extracting coherent (slowly changing) parameters is helpful, but not necessary to learning intrinsic structures that underlie the variation in the data. One result of learning global statistical regularities is that the learned structure is not necessarily obvious; for example, density components adapted to natural images describe a variety of statistical regularities, some of which are not easily interpreted. This is true for many unsupervised learning models that do not specify in advance the structure to be learned. For example, ICA applied to natural images yields a matrix of basis functions whose functional interpretation has ranged from edge detectors (Bell & Sejnowski, 1997) to models of biological sensory systems (van Hateren & van der Schaaf, 1998). The work presented here suggests that as more powerful unsupervised learning models are developed, the analysis of learned parameters and data representations will gain in importance. The approach taken in this work is to attack a difficult problem—capturing intrinsic regularities in complex high-dimensional data—incrementally. Although the model is able to capture some nonlinear statistical regularities, the structure it learns is still quite low level. This step-wise approach stands in contrast to other computational schemes that solve specific problems, such as perceptual invariance or scene segmentation. This may prove more tractable and robust because it does not rely on preconceived notions of intrinsic structures but learns them from the data. This approach might also give more insight into the organization of biological perceptual systems, where each processing unit performs a relatively simple computational task, and many computational goals might be achieved incrementally and in parallel. Appendix The value of vˆ for a given u was obtained by maximizing the log posterior distribution L = log p(v|u, B) ∝ log p(u|B, v)p(v).
(A.1)
We use the Laplace distribution for the prior on v and a generalized gaussian distribution with the scale parameters λ for the likelihood p(u|B, v), so that L ∝ log ∝
qi qj M ui vj zi exp − zj exp − λi c i=1 j=1
N
M log i=1
qi qj
M ui vj qj qi log − + − 2λi (1/qi ) λi 2(1/qj ) c j=1
(A.2)
(A.3)
420
Y. Karklin and M. Lewicki
∝
qi N M qj ui v j , − log λi − − λi c i=1 j=1
(A.4)
where z = q/(2λ(1/q)) is the normalization term, λi = ce[Bv]i , and c = (1/q)/ (3/q). For a given data sample, u is the N × 1 vector of linear basis function coefficients and v the M × 1 vector of density component coefficients. A is the N × N matrix of linear basis functions, and B is the N × M matrix of density components. We use [Bv]i to denote the ith element of the vector Bv, and Bi to denote the ith row of the matrix B. The MAP estimate vˆ was obtained by gradient ascent, qi N M qj ui vj ∂ ∂L − log λi − − = c ∂vj ∂vj i=1 λi j=1 =
N u qi
|vj |qj −1 i . −Bij + qi Bij [Bv] − sign(vj )qj cqj ce i i=1
(A.5)
(A.6)
The gradient ascent procedure was sensitive to initial conditions and in some cases did not converge to a solution. We tried several alternatives, including a closed-form approximation to the MAP estimate. Ultimately, the most effective learning method was to adjust the step size by the stochastic estimate of the Hessian over each batch of data (LeCun et al., 1998): ηj =
2
∂∂vL2 j
+µ
,
(A.7)
where µ is a small constant that improves stability when the second derivative is very small. The second derivative for a data sample is given by N u qi |vj |qj −2 ∂ 2L i =− q2i B2ij [Bv] − qj (qj − 1) qj . 2 c ce i ∂vj i=1
(A.8)
The density component matrix B was estimated by maximizing the posterior over the data ensemble, log p(B|x1 , . . . , xN , A) ∝ log p(x1 , . . . , xN |A, B)p(B) log p(xn |A, B)p(B) ∝
(A.9) (A.10)
n
∝
log p(un |B, vˆ n )p(vˆ n )p(B)/| det A|. (A.11)
n
Let Lˆ n = log p(un |B, vˆ n )p(vˆ n )p(B). We place a gaussian prior on B and im plement gradient ascent B = N1 n ∂ Lˆ n /∂Bij , where the posterior for each
Learning Density Components
421
data sample xn is ∂ ∂ Lˆ log p(un |B, vˆ n ) + log p(vˆ n ) + log p(B) = ∂Bij ∂Bij qi
N,M N M qj B2ij ui vj ∂ − = − log λi − − c ∂Bij i=1 λi 2 j=1 i=1,j=1 u qi i = −vj + vj qi [Bv] − Bij . ce i
(A.12)
(A.13) (A.14)
Acknowledgments We thank Bruno Olshausen and Eero Simoncelli for helpful discussions. This work was supported by a Department of Energy Computational Science Graduate Fellowship to Y.K. and National Science Foundation grant no. 0238351 to M.S.L. References Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Res., 37(23), 3327–3338. Bollerslev, T., Engle, R. F., & Nelson, D. B. (1994). ARCH models. In R. F. Engle & D. L. McFadden (Eds.), Handbook of Econometrics. Amsterdam: Elsevier. Buccigrossi, R. W., & Simoncelli, E. P. (1999). Image compression via joint statistical characterization in the wavelet domain. IEEE Transactions on Image Processing, 8(12), 1688–1701. Cardoso, J.-F. (1997). Infomax and maximum likelihood for blind source separation. IEEE Signal Processing Letters, 4, 109–111. Choi, S., Cichocki, A., & Belouchrani, A. (2002). Second order nonstationary source separation. Journal of VLSI Signal Processing, 32(1–2), 93–104. Cichocki, A., & Amari, S.-I. (2002). Adaptive blind signal and image processing: Learning algorithms and applications. New York, Wiley. Everson, R., & Roberts, S. (1999). Non-stationary independent component analysis. In Proceedings of the 8th International Conference on Artificial Neural Networks (pp. 503–508). Berlin: Springer-Verlag. Foldiak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3(2), 194–200. Hoyer, P. O., & Hyv¨arinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Res., 42, 1593–1605. Hurri, J., & Hyv¨arinen, A. (2003). Simple-cell-like receptive fields maximize temporal coherence in natural video. Neural Computation, 15(3), 663–691.
422
Y. Karklin and M. Lewicki
Hyv¨arinen, A., & Hoyer, P. (2000). Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12, 1705–1720. Hyv¨arinen, A., Hoyer, P. O., & Inki, M. (2001). Topographic independent component analysis. Neural Comput., 13, 1527–1558. Hyv¨arinen, A., Hurri, J., & V¨ayrynen, J. (2003). Bubbles: A unifying framework for low-level statistical properties of natural image sequences. Journal of the Optical Society of America A, 20(7), 1237–1252. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Karklin, Y., & Lewicki, M. (2003). Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14, 483–499. Kayser, C., Einh¨auser, W., Dummer, ¨ O., Konig, ¨ P., & Kording, ¨ K. (2001). Extracting slow subspaces from natural videos leads to complex cells. Artificial Neural Networks, 2130, 1075–1080. Kruger, ¨ N. (1998). Collinearity and parallelism are statistically significant second order relations of complex cell responses. Neural Processing Letters, 8, 117–129. LeCun, Y., Bottou, L., Orr, G., & Muller, ¨ K. (1998). Efficient backprop. In G. Orr & K. Muller ¨ (Eds.), Neural networks: tricks of the trade. New York: Springer. Lee, T.-W., Koehler, B., & Orglmeister, R. (1997). Blind source separation of nonlinear mixing models. In Proceedings of IEEE International Workshop on Neural Networks for Signal Processing. Lee, T.-W., & Lewicki, M. S. (2002). Unsupervised classification, segmentation and de-noising of images using ICA mixture models. IEEE Trans. Image Proc., 11(3), 270–279. Lee, T.-W., Lewicki, M. S., & Sejnowski, T. J. (2000). ICA mixture models for unsupervised classification of non-Gaussian sources and automatic context switching in blind signal separation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(10), 1078–1089. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature, 381, 607–609. O’Neill, J. C. (1999). DiscreteTFDs time-frequency analysis software. Available online at: http://tfd.sourceforge.net/. Pearlmutter, B. A., & Parra, L. C. (1996). A context-sensitive generalization of ICA. In International Conference on Neural Information Processing (pp. 151–157). New York: Springer. Pham, D.-T., & Cardoso, J.-F. (2001). Blind separation of instantaneous mixtures of non stationary sources. IEEE Trans. Signal Processing, 49(9), 1837–1848. Romberg, J., Choi, H., & Baraniuk, R. (2001). Bayesian tree-structured image modeling using wavelet domain Hidden Markov models. IEEE Transactions on Image Processing, 10(7), 1056–1068. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4, 819–825. Simoncelli, E. P. (1997). Statistical models for images: Compression, restoration and synthesis. In Proc. 31st Asilomar Conference on Signals, Systems and Computers. Pacific Grove, CA.
Learning Density Components
423
van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. Royal Soc. Lond. B, 265, 359–366. Wainwright, M. J., Simoncelli, E. P., & Willsky, A. S. (2001). Random cascades on wavelet trees and their use in analyzing and modeling natural images. Applied Computational and Harmonic Analysis, 11, 89–123. Welling, M., Hinton, G. E., & Osindero, S. (2003). Learning sparse topographic representations with products of student-t distributions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15, Cambridge, MA: MIT Press. Wiskott, L., & Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Received December 23, 2003; accepted July 2, 2004.
LETTER
Communicated by Aapo Hyvarinen
Extended Gaussianization Method for Blind Separation of Post-Nonlinear Mixtures Kun Zhang
[email protected] Lai-Wan Chan
[email protected] Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong
The linear mixture model has been investigated in most articles tackling the problem of blind source separation. Recently, several articles have addressed a more complex model: blind source separation (BSS) of postnonlinear (PNL) mixtures. These mixtures are assumed to be generated by applying an unknown invertible nonlinear distortion to linear instantaneous mixtures of some independent sources. The gaussianization technique for BSS of PNL mixtures emerged based on the assumption that the distribution of the linear mixture of independent sources is gaussian. In this letter, we review the gaussianization method and then extend it to apply to PNL mixture in which the linear mixture is close to gaussian. Our proposed method approximates the linear mixture using the CornishFisher expansion. We choose the mutual information as the independence measurement to develop a learning algorithm to separate PNL mixtures. This method provides better applicability and accuracy. We then discuss the sufficient condition for the method to be valid. The characteristics of the nonlinearity do not affect the performance of this method. With only a few parameters to tune, our algorithm has a comparatively low computation. Finally, we present experiments to illustrate the efficiency of our method. 1 Introduction Blind source separation (BSS) aims at identifying and separating signals from their mixtures by linear or nonlinear transformation. It has received more and more attention due to its potential applications in signal processing areas, such as speech recognition, telecommunications, and medical signal processing. Most BSS algorithms are based on the theory of independent component analysis (ICA). As a generative model, ICA aims to find the independent components from the mixture of statistically independent sources by optimizing different criteria (for review, see Hyv¨arinen, 1999b; Hyv¨arinen & Oja, 2000; Cardoso, 1998; Mansour, Barros, & Ohnishi, 2000; Neural Computation 17, 425–452 (2005)
c 2005 Massachusetts Institute of Technology
426
K. Zhang and L.-W. Chan
Jutten & Taleb, 2000). In many cases, the terms independent component analysis and blind source separation mean essentially the same thing. ICA for linear instantaneous mixtures has been solved in many different ways (Jutten & H´erault, 1991; Amari & Cardoso, 1997; Oja, 1997; Hyv¨arinen, 1999a; Cardoso & Souloumiac, 1993; Pham & Garat, 1997; Comon, 1994; Pham, 2002; Bell & Sejnowski, 1995). Recently some extensions of classical ICA have been investigated. These extensions are more realistic for some situations in the real world. One important extension is nonlinear ICA. As shown in Hyv¨arinen and Pajunen (1999), for general nonlinear ICA, there always exist an infinite number of solutions if the space of the nonlinear mixing functions is unlimited. The independent components extracted from the observations are not necessarily the true source signals. Furthermore, in general, nonlinear ICA suffers from high computational complexity. Solving the nonlinear BSS problem appropriately with only the independence assumption of the sources is possible only in some special cases, for example, when the mixtures are post-nonlinear (PNL) with some weak assumptions (Taleb & Jutten, 1999b). Due to the practicability and relative uniqueness of the solutions, blind separation of PNL mixtures plays an important role in the nonlinear ICA area. The nonlinear model in Yang, Amari, and Cichocki (1998) is similar to the PNL mixing model. Lee, Koehler, and Orglmeister (1997) proposed a set of algorithms for BSS of PNL mixtures using parametric nonlinear functions. They focused on a parametric sigmoidal nonlinearity and on high-order polynomials. In a series of papers (Taleb & Jutten, 1997, 1999a, 1999b; Jutten & Taleb 2000), Taleb and Jutten approximated the inverse nonlinear function with multilayer perceptrons (MLP). Taleb and Jutten (1999b) exhibited their results on this topic and gave a general learning procedure based on the minimization of mutual information, given any parametric model for the inverse nonlinear functions. In Ziehe, Kawanabe, Harmeling, and Mueller (2001), alternating conditional expectation (ACE), a technique from nonparametric statistics, is used to approximately invert the nonlinear function. Due to the large residual nonlinear distortion that cannot be eliminated using this technique in some cases, the performance of this method is sensitive to the ICA algorithm in the linear stage. In fact, in this method, the linear mixture of independent sources is assumed to be gaussian (Breiman & Friedman, 1985). Under this assumption, the inverse nonlinear function can be constructed directly using the gaussianization technique (Ziehe, Kawanabe, Hermeling, & Mueller, 2003). Almost all existing methods for BSS of PNL mixtures impose a parametric model, such as an MLP or a given functional space directly on the inverse of the nonlinear function generating observations. Although multilayer feedforward networks are universal approximators (Hornik, Stinchcombe, & White, 1989) they may behave poorly for some peculiar nonlinearities. These methods are sensitive to actual nonlinear functions used. In principle, when we use a parametric model to fit an arbitrary nonlinear function on which
Extended Gaussianization for Separation of PNL Mixtures
427
we have very little prior knowledge, either many parameters are needed or the constraint imposed on the actual function must be restrictive. In this letter, we consider the BSS problem of PNL mixtures when the distributions of latent linear mixtures are close to gaussian. We do not construct the inverse nonlinearity directly, but recover the nonlinearity from the distributions of linear mixtures. In the gaussianization method for BSS of PNL mixtures (Sol-Casals, Babaie-Zadeh, Jutten, & Pham, 2003; Ziehe et al., 2003), the distributions of the linear mixtures are simply approximated by the gaussian distribution. We first briefly review this method by discussing its foundation, its limitation, and some practical considerations in its implementation. Next, we propose an extended version of gaussianization to tackle the case when the distributions of the linear mixtures are not exactly gaussian but close to gaussian. Assuming the distributions of the linear mixtures are close to gaussian, this method models the linear mixtures by the Cornish-Fisher expansion and gives their distributions freedom to adapt in the vicinity of the standard gaussian distribution. In noiseless cases, its performance is not affected by the actual nonlinearities. It depends only on how close the estimated distributions of latent linear mixtures are to their actual distributions. With few parameters to tune, this method is comparatively practical—it should be less prone to getting stuck at local optima, and it avoids the demanding computation other nonlinear ICA algorithms require in applications. The rest of this letter is organized as follows. In section 2, we give a brief introduction to the PNL mixing model and discuss its identifiability. The gaussianization method is reviewed in section 3. In section 4, we develop an extended version of the gaussianization. In section 5, experimental results are shown to illustrate the performance of our method. Section 6 concludes the article. 2 Post-Nonlinear Mixing Model Blind separation of PNL mixtures, which was introduced by Taleb and Jutten (1997), is a special case of nonlinear ICA. In this model, the sources si are first mixed linearly according to the basic ICA model, and after that, a nonlinear function fi is applied to the linear mixture ti to generate the observation xi , xi = fi (ti ) = fi
n
aij sj ,
(2.1)
j=1
where si are statistically independent sources, fi are unknown invertible nonlinear functions, aij denote entries of a regular m × n mixing matrix A, and ti are latent linear mixtures. The PNL assumption can be used to model a nonlinear sensor distortion in some signal processing applications.
428
K. Zhang and L.-W. Chan
PNL mixing system s1
t1
s2
t2
. . . sn
A linear mixing
. . . tn
f1 f2 . . . fn nonlinear distortion
PNL de-mixing system x1 x2 . . . xn
g1 g2 . . . gn
z1
y1
z2
y2
. . . zn
inverse nonlinear transform
W
. . . yn
linear de-mixing
Figure 1: PNL mixing system and PNL demixing system. The mixing system consists of a linear mixing stage (with the mixing matrix A) and a nonlinear transform stage (applying fi to linear mixtures ti ). The demixing system does an inverse operation, which is a nonlinear transform stage followed by a linear demixing stage.
For simplicity, we have assumed there is no additive noise in this system. And in the following discussion, the number of independent sources, n, is assumed to be equal to that of observations, m. As a counterpart of the mixing model 2.1, the separation of PNL mixtures is also a two-stage procedure: a nonlinear stage followed by a linear demixing stage. Figure 1 shows the PNL mixing system and the demixing system, where gi are elements of g used to invert the nonlinear mapping f, z has n elements zi as an estimate of the latent linear mixtures ti , and W is a linear demixing matrix transforming zi to yi , the estimate of independent sources si . Taleb and Jutten (1999b), showed that when A has at least two nonzero entries per row or per column and si accepts a density function that vanishes at one point at least, one can affirm that the output y has mutually independent components if and only if gi ◦ fi is linear and W is a linear separating matrix for z. In other words, under these assumptions, the sources can be separated up to the same indeterminacies as in the linear ICA model, which are permutation and scaling indeterminacies. (In some work, for instance, Hyv¨arinen, Karhunen, & Oja, 2001, it is said there is one more indeterminacy, named sign indeterminacy, which can be considered as a special case of the scaling indeterminacy.) In fact, if the mean of si is unknown, one more indeterminacy emerges in the BSS problem of PNL mixtures, which we call mean indeterminacy.
Extended Gaussianization for Separation of PNL Mixtures
429
If the sources si are not zero mean, let si = snew + E(si ) and ti = tnew + E(ti ). i i We have n new xi = fi (ti ) = fi aij (sj + E(sj )) = fi (tnew + E(ti )) = finew (tnew i i ), (2.2) j=1
where finew is a new function obtained by shifting the nonlinear function fi (to the left if E(ti ) > 0, and to the right if E(ti ) < 0) by |E(ti )| units. We can change the mean of ti and at the same time shift the nonlinear function fi with corresponding units, while the observations xi do not change. In other words, in the problem of blind separation of PNL mixtures, the mean of the latent random vector s, as well as the actual functions fi , cannot be recovered without any additional prior knowledge. In some cases, this indeterminacy may be trivial. Moreover, in practice, if the actual nonlinear distortion functions fi or the actual values of si are required for further investigation, we can sometimes find some hints, such as the intersection of the nonlinear function fi with the x-axis, or prior knowledge in distributions of si , to eliminate this indeterminacy. Even if si are not zero mean while yi are zero mean, each gi ◦ fi is also affine. This is because the nonlinear block g in the demixing system is used to invert the unknown nonlinear function f new , new = k · ti − k · E(ti ), zi = (gi ◦ fi )(ti ) = (gi ◦ finew )(tnew i ) = k · ti
(2.3)
where k is a constant. In the following, in order to eliminate the scaling indeterminacy, we assume each function gi is strictly monotonically increasing, and zi and yi are of unit variance. In order to eliminate the mean indeterminacy, zi are assumed to be zero mean. 3 Gaussianization for BSS of PNL Mixtures Gaussianization (Chen & Gopinath, 2001) transforms a random variable into a standard gaussian random variable. In BSS of PNL mixtures, by approximating the distributions of linear mixtures as the gaussian distribution, gaussianization can be used to recover the linear mixtures directly from the observations. 3.1 Review of the Central Limit Theorem. The central limit theorem (CLT), as one of the fundamentals in probability theory, provides an imprecise but intuitive way of measuring independence between nongaussian variables using their individual nonnormality. This has been investigated in the ICA literature. The CLT also inspires the application of gaussianization in BSS of PNL mixtures.
430
K. Zhang and L.-W. Chan
The CLT provides sufficient conditions for asymptotic normality of normalized sums of random variables (Neuts, 1995). Assume Xi , i = 1, 2, ... are independent variables with zero mean and variance σi2 , respectively, and let Sn = X1 + X2 + ... + Xn . The classical CLT says that if Xi are independent and identically distributed, then Sn /(σ1 n1/2 ) is asymptotically normally distributed as n → ∞ (note that σ1 = σ2 = ... = σn ). In another form of CLT, it is claimed that Sn /sn is asymptotically normally distributed under n E(|Xi |3 ) → 0 as Lyapunov’s condition that E(|Xi |3 ) are finite and that s13 n
i=1 n→ = σ12 +σ22 +...+σn2 . Lindeberg’s condition is much weaker; n 1 it states that Sn /sn is asymptotically normal if s2 x2 dFi (x) → 0 for n j=1 |x|≥εsn
∞, where s2n
any ε > 0, where Fi (x) is the cumulative distribution function (cdf) of Xi (Neuts, 1995). In the latter two conditions, the constraint that Xi are identically distributed is not necessary. We can roughly say that in general, sums of independent variables tend to be close to gaussian. Of course, there are many exceptions, for instance, when the independent variables are very nongaussian and the number of independent variables is very small, or when one strong nongaussian source dominates the sum. Here, we focus on the case that linear mixtures of independent sources are comparatively close to gaussian.
3.2 BSS of PNL Mixtures with Gaussianization. Now we approximate the distributions of the recovered linear mixtures zi by the standard gaussian distribution. Then the nonlinearity gi is used to transform the variable xi into the standard gaussian variable zi . Assume the cdf of xi is Fxi (xi ), which can be estimated from observed samples. Let G be the cdf of the standard gaussian distribution: G(z) =
√1 2π
z
−∞
e−t
2
/2
dt. BSS of PNL mixtures with
gaussianization is illustrated as a two-stage procedure (Sol-Casals et al., 2003; Ziehe et al., 2003): 1. Construct each inverse nonlinear function by gaussianization: gi = G−1 ◦ Fxi . The latent linear mixtures are estimated by zi = gi (xi ). 2. Use a linear ICA algorithm to extract independent components yi from z: y = Wz. y is the estimate of s. We address the following practical issues: • The estimation of the cdf Fxi can be done in a variety of ways. For instance, it can be constructed using the pdf estimated by kernel estimation. For simplicity, we use the empirical cumulative distribution function (ecdf) Fˆxi (x) instead, which is the proportion of samples less
Extended Gaussianization for Separation of PNL Mixtures
ˆ than or equal to x, that is, F(x) =
1 N+1
N
431
I(x(i) ≤ x), where x(i) is the ith
i=1
sample of variable x, N is the total number of samples, and I(x(i) ≤ x) is an indicator function with value 1 if x(i) ≤ x and with value 0 otherwise. • The inverse of the cdf of the standard gaussian distribution G−1 can be calculated numerically. Or we can use an analytical form to approximate it (Abramowitz & Stegun, 1968). For 0.5 ≤ p < 1, G−1 (p) can be approximated by c0 + c1 u + c2 u z = G−1 (p) ∼ (0.5 ≤ p < 1), where =u− 1 + d1 u + d2 u2 + d3 u3 u = ln[1/(1 − p)2 ], and c0 = 2.515517, c1 = 0.802853, c2 = 0.010328, d1 = 1.432788, d2 = 0.189269, d3 = 0.001308. The absolute error in this approximation is less than 4.5 × 10−4 . For 0 < p ≤ 0.5, z = G−1 (p) can first be rewritten as z = −G−1 (1 − p) and then evaluated using the same approximation. • If necessary, parametric estimates of gi can be computed using a least(k) N −1 ˆ square method on points {x(k) i , G (Fxi (xi ))}k=1 , with a LevenbergMarquardt search algorithm (Gill, Murray, & Wright, 1981). • In principle, any linear ICA algorithm is valid in the linear demixing stage, such as JADE (Cardoso & Souloumiac, 1993), FastICA (Hyv¨arinen, 1999a), Extended INFOMAX algorithm (Lee, Girolemi, & Sejnowski, 1999), and others. However, in some cases, the latent linear mixtures are far from gaussian, such that the residual distortion in the recovered linear mixtures is considerable, and hence these methods may provide different results. We found that methods that do not ensure perfect decorrelation between independent components, such as the INFOMAX algorithm (Bell & Sejnowski, 1995; Lee et al., 1999), and the natural gradient algorithm (Amari, Cichocki, & Yang, 1996), exhibit better separability. And if there is a time correlation in the independent sources, methods using time structure, such as SOBI (Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997), are more robust to the residual distortion. In general, the separation result with the gaussianization method is not accurate because this method involves a rough approximation—it simply approximates the distribution of linear mixtures by the gaussian distribution. However, its computational cost is very low. Sol-Casals et al. (2003) describe this method as a technique providing a pertinent initialization of the demixing system and improving the learning speed. We aim to find
432
K. Zhang and L.-W. Chan
a reasonable parametric model for the linear mixtures to provide a better estimate for them, and the gaussianization method will be used as an initialization step. 4 Extended Gaussianization When the linear mixtures ti are not so close to gaussian, we denote by g(1) the nonlinear mapping constructed by gaussianization and introduce another nonlinear mapping g(2) to the demixing system, as shown in Figure 2B. In this system, v = (v1 , v2 , ..., vn )T and vi follow the standard gaussian distribution. In other words, in the extended gaussianization method, the nonlinear stage in the demixing system consists of two parts. First, we use an invertible nonlinear mapping g(1) to obtain v from the observed vector x by gaussianization: −1 vi = g(1) i (xi ) = G (Fxi (xi )).
(4.1)
After v has been obtained, we consider v as a new observation. The latent linear mixture z is then recovered from v by another invertible nonlinear mapping g(2) , (1) gi = g(2) i ◦ gi .
(4.2)
(2) Similar to gi , g(1) i and gi are strictly monotonically increasing. When the (2) nonlinear mapping g is estimated, we can construct the overall inverse nonlinear functions by equation 4.2.
4.1 Parametric Model for Function g(2) i Using Cornish-Fisher Expansion. The functions g(2) transform the standard gaussian variables vi into i the variables zi . Since the distributions of zi are assumed to be close to gaussian, the Cornish-Fisher (C-F) expansion helps to construct a simple and reasonable parametric model for g(2) i . 4.1.1 Cornish-Fisher Expansion. We aim to find how to express quantiles for a distribution close to gaussian using those of the gaussian distribution. In fact, it is possible to obtain explicit polynomial expansions for standardized quantiles of a general distribution in terms of its normalized cumulants and the corresponding quantiles of the standard gaussian distribution (Johnson & Kotz, 1970). The basis behind this kind of expansion is that if p(x) is a pdf with cumulants κ1 , κ2 , ..., then the function i ∞ (−D) j ∞ ∞ j j=1 j j! (−D) p(x) = j p(x) g(x) = exp j! i! i=1 j=1
Extended Gaussianization for Separation of PNL Mixtures
433
A
x
g
z
W
inverse nonlinear transform
y
linear de-mixing
B
x
g(1)
v
g(2)
inverse nonlinear transform
z
W
y
linear de-mixing
Figure 2: (A) The separating system with gaussianization. (B) The separating system used in the extended gaussianization method. In A, components of z and the nonlinear transform g are constructed by gaussianization. In B, we do not construct the nonlinear transform g directly, but construct g(2) ◦ g(1) instead. g(1) and v are constructed by gaussianization. g(2) and W are tuned to minimize the mutual information between yi .
will have cumulants κ1 + 1 , κ2 + 2 , ..., where D is the differentiation operator and D j p(x) = d j p(x)/dx j . Using this principle, we can obtain useful approximate representation of a distribution with known cumulants in terms of a known distribution. The Edgeworth expansion and the Gram-Charlier expansion arise from choosing the standard gaussian distribution as this known distribution. Furthermore, quantiles for a distribution are totally determined by the pdf of this distribution, and hence one can approximate the quantiles of a distribution close to gaussian using those of the standard gaussian distribution and known cumulants. Cornish and Fisher (1937) propounded a form of expansion in which the terms are polynomial functions of the appropriate standard gaussian quantile, with functions of known cumulants as coefficients. The four-term C-F expansion for α-quantile is 1 1 1 x(vα ) ∼ = vα + (v2α − 1)κ3 + (v3α − 3vα )κ4 − (2v3α − 5vα )κ32 , (4.3) 6 24 36 where κ3 and κ4 are the third-order and fourth-order cumulants (or skewness and kurtosis) of x, respectively, vα is the α-quantile for the standard gaussian distribution, and x(vα ) is zero mean and of unit variance. Figure 3 shows the pdf’s of x(vα ) obtained by equation 4.3 and the mapping functions from quantiles for the standard gaussian distribution to those for distributions determined by equation 4.3 with different values of κ3 and
434
K. Zhang and L.-W. Chan
κ4 . Jaschke (2002) showed that the C-F approximation is a competitive technique if the target distribution is relatively close to normal. According to the CLT, the higher-order statistics of the sum of independent variables tend to vanish, so the sum can be modeled well with only the third- and fourth-order statistics. And the most prominent use and motivation of this expansion is to model the distribution of the normalized sum of independent variables. 4.1.2 Parametric Model for g(2) i . Assume the distributions of the linear mixtures ti are close to gaussian and ti can be modeled by the four-term C-F expansion. According to equation 4.3, the recovered linear mixtures zi can be expressed in terms of the gaussian signals vi : zi = g(2) i (vi |θi ) 1 1 1 2 = vi + (v2i − 1)κ3,i + (v3i − 3vi )κ4,i − (2v3i − 5vi )κ3,i . 6 24 36
(4.4)
Parameters in the nonlinear stage are θi = {κ3,i , κ4,i }, i = 1, 2, ...n. 4.2 Learning Rule for Extended Gaussianization. Now we derive the learning rules for the unknown parameters in Figure 2B based on the minimization of the mutual information between yi . 4.2.1 Mutual Information: An Independence Criterion. Mutual information is a measure of the information that members of a set of random variables have on the other random variables in the set.1 In information theory, the mutual information I between n random variables yi , i = 1, 2, ..., n is defined as follows: I(y1 , y2 , ..., yn ) =
n
H(yi ) − H(y),
(4.5)
i=1
where y = (y1 , y2 , ..., yn )T , and H(•) denotes the differential entropy.2 Alternatively, mutual information can be interpreted as the Kullback-Leibler (KL) divergence between the joint pdf py (y) and the product of marginal densities ni=1 pyi (yi ): n py (u) δ(py (y), du. pyi (yi )) = py (u) log n i=1 pyi (ui ) i=1 Mutual information is always nonnegative, and it is zero if and only if the variables are independent (Cover & Thomas, 1991). 1 Mutual information was originally defined for two variables. It has more than one multivariate generalizations. For details, see Bell (2003). 2 Shannon entropy is defined for a discrete variable. Differential entropy is the extension of Shannon entropy for continuous random variables.
Extended Gaussianization for Separation of PNL Mixtures
435
A 0.7
standard normal skew:0, kurt:2 skew:0.6, kurt:0 skew:−1, kurt:2 skew:−0.3, kurt:−0.7
0.6
0.5
p(x)
0.4
0.3
0.2
0.1
0 −4
−3
−2
−1
0
1
2
3
4
x (obtained by the C−F expansion)
B 4
3
2
standard normal skew:0, kurt:2 skew:0.6, kurt:0 skew:−1, kurt:2 skew:−0.3, kurt:−0.7
α
x(v )
1
0
−1
−2
−3
−4 −3
−2
−1
0
1
2
3
vα Figure 3: Quantiles obtained by the C-F expansion with different cumulants κ3 and κ4 . (A) Distributions of quantiles obtained. (B) Quantiles obtained versus quantiles for the standard gaussian distribution—the mapping x(·) defined in equation 4.3.
436
K. Zhang and L.-W. Chan
Mutual information is a reliable measurement for independence between random variables. It has been widely used for constructing contrast functions in the ICA problem (Comon, 1994; Pham, 2000; Taleb & Jutten, 1999b). 4.2.2 Blind Separation of PNL Mixtures by Minimization of Mutual Information: A General Procedure. Let θi be the parameter set in the nonlinear function gi in Figure 1. According to Figure 1, we have zi = gi (xi |θi ),
(4.6)
y = Wz.
(4.7)
Since the demixing system for PNL mixtures is of cascade architecture and the linear demixing stage is to minimize the mutual information between yi given variables zi as inputs, the learning rule for W will not involve parameters in θi explicitly. It is just the same as in the linear ICA model except that the inputs are not the original observations but the estimated linear mixtures. Therefore, any learning rule for the linear ICA model based on minimization of mutual information can be used, for instance, the BellSejnowski algorithm (Bell & Sejnowski, 1995; Cardoso, 1997), W ∝ [W T ]−1 + E[ψ(y)zT ],
(4.8)
where ψ is a component-wise vector function that consists of the so-called score functions ψyi of the distributions of yi = wTi z (wTi is the ith row of W), defined as ψyi (u) = (log pyi (u)) =
pyi (u) pyi (u)
.
(4.9)
Multiply the right-hand side of equation 4.8 with W T W; we obtain W ∝ (I + E[ψ(y)yT ])W,
(4.10)
which is known as the natural gradient proposed by Amari (1998) and Cardoso and Laheld (1996). In order to make yi of unit variance when the algorithm converges, we modify equation 4.10 as follows: W ∝ (I−diag{E(y21 ), ..., E(y2n )}+E[ψ(y)yT ]−diag{E[ψ(y)yT ]})W,(4.11) where diag{E(y21 ), ..., E(y2n )} denotes the diagonal matrix with E(y21 ), ..., E(y2n ) as its diagonal entries, and diag{E{ψ(y)yT }} denotes the diagonal matrix whose diagonal entries are those on the diagonal of E{ψ(y)yT }. Taleb and Jutten (1999b) derived the mutual information-based learning rule for the parameters θi in the nonlinear stage: n ∂ log |gi (xi |θi )| ∂gi (xi |θi ) ψyk (yk )wki . (4.12) +E θi ∝ E ∂θi ∂θi k=1
Extended Gaussianization for Separation of PNL Mixtures
437
4.2.3 Learning Rule for Extended Gaussianization. The learning rule for extended gaussianization is obtained by applying the general learning rule for separating PNL mixtures (Taleb & Jutten, 1999b) to the specific demixing (2) procedure (g(2) i , W). Let Di be the derivative of gi (vi |θi ) with respect to vi : 1 1 2 1 2 Di = g(2) κ (6v2 − 5). (4.13) i (vi |θi ) = 1 + κ3,i vi + (vi − 1)κ4,i − 3 8 36 3,i i Let φ = (φ1 , φ2 , ..., φn )T = W T ψ. equation 4.12 becomes 1 2 κ3,i ∝ E vi − 6 κ3,i (6vi −5) + φi v2 − 1 − 3Di 6 i κ4,i ∝ E v2i −1 + φi (v3 − 3vi ) . 8Di 24 i
κ3,i (2v3i −5vi ) 3
,
(4.14)
These two rules, together with equation 4.11, are the update rules for all the unknown parameters in the demixing system of PNL mixtures. Of course, suitable learning rates are also essential for the convergence speed and stability of the learning procedure. Estimation of the score function ψ will be treated later. 4.2.4 Performance Analysis. The gaussianization technique is very simple, and its computational cost is very low. It is considered a preprocessing step in the extended gaussianization method. After this step, the inverse of remaining nonlinearity in each channel is modeled using only two unknown parameters (κ3,i and κ4,i ). When the linear mixtures are not far from gaussian, our method is more stable and computationally lighter compared to ordinary methods, which simply exploit some generic structures, such as MLPs and polynomials, to model the nonlinearity (Taleb & Jutten, 1999b; Lee et al., 1997). In the extended gaussianization method, the component-wise function, defined as g(1) ◦f, has elements that are strictly increasing functions mapping ti to standard gaussian variables vi . Given the same latent mixture t, different nonlinear functions f do not affect the realization value of v. Therefore, the performance of the extended gaussianization method is not affected by the choice of nonlinearities fi .3 The extended gaussianization method can be considered as a specific application of the general learning rule derived in Taleb and Jutten (1999b) with gaussianization as a preprocessing step and the C-F expansion for representing the inverse of remaining nonlinearities. The performance of this method depends on how well the C-F expansion can fit the actual distributions of linear mixtures. Hence, in order to analyze the performance of our method, we first investigate the property of the C-F expansion. 3 This is true only in the noiseless situation, and only in the training set. We thank one of the reviewers for pointing this out.
438
K. Zhang and L.-W. Chan
To ensure that the C-F expansion corresponds to the quantile of a distribution and that the mapping x(vα ) (see equation 4.3) is one-to-one, equation 4.3 should be a strictly monotonic function of vα . However, it is not always satisfied for all possible values of (κ3 , κ4 ). We can identify the region in the (κ3 , κ4 )-plane guaranteeing the strict monotonicity of the mapping x(vα ). For values of (κ3 , κ4 ) in this region, the derivative 1 1 1 dx(vα ) = 1 + κ3 vα + (v2α − 1)κ4 − κ32 (6v2α − 5) dvα 3 8 36
(4.15)
should be positive definite. Since equation 4.15 is a simple function of vα with the highest order no more than 2, we can easily find the domain for (κ3 , κ4 ) α) in which dx(v dvα is always positive for any value of vα , as shown in Figure 4A. Values in this region always result in super-gaussian distributions of x(vα ). In practice, we can concentrate on a bounded range of vα because the value of vα is bounded. In this way, the monotonicity constraint is relaxed. For instance, we can implement the monotonicity constraint for vα between the 0.001 quantile (−3.09) and the 0.999 quantile (3.09) for the standard gaussian distribution. After some straightforward but tedious derivations, the domain for κ3 and κ4 is obtained, as sketched in Figure 4B. The region in this figure shows us the feasibility of approximating zi with the C-F expansion when zi is close to gaussian, especially when it is super-gaussian. When (κ3 .κ4 ) lies in the lower part of Figure 4B, the C-F expansion may become less and less reliable for α → 0 (or α → 1). Due to the limitation of approximation of a distribution by the C-F expansion, the performance of the extended gaussianization method varies for different characteristics of the distribution of the linear mixture: • When the linear mixtures ti are close to gaussian and their statistics (κ3,i , κ4,i ) lie in the domain shown in Figure 4A, the C-F expansion models the linear mixtures ti well, and hence good performance is achieved. • When ti are close to gaussian, {κ3,i , κ4,i } lie in the domain indicated by the lower part of Figure 4B, and the monotonicity of the C-F expansion is guaranteed for all samples of vi , the linear mixtures ti can still be modeled well by the C-F expansion, which also results in a good performance. • When ti are close to gaussian, (κ3,i , κ4,i ) lie in the domain indicated by the lower part of Figure 4B, and some samples of vi are so large as to violate the monotonicity of the C-F expansion, we can drop these samples, and the extended gaussianization method still works. This is because mutual information as an independence measure is robust to outliers. But inevitably the performance declines compared to the first two cases.
Extended Gaussianization for Separation of PNL Mixtures
439
A
B
Figure 4: The domain of (κ3 , κ4 ) guaranteeing monotonicity of the four-term C-F expansion, equation 4.3. (A) Theoretical solution. (B) The domain guaranteeing monotonicity for vα ∈ [−3.093.09]. It consists of two parts. The upper part is the same as the domain in A.
440
K. Zhang and L.-W. Chan
• When the distributions of linear mixtures are far from the gaussian distribution, for instance, when they are multimodal, this method may fail. 4.3 Estimation of the Score Function. In the ICA algorithms for linear mixtures, estimation of the score functions (see equation 4.9) does not have to be very accurate. In practice, only two nonlinearities (one for supergaussian signals and the other for subgaussian signals) used to approximate the score functions are sufficient (Lee et al., 1999). However, separation performance in nonlinear mixtures depends greatly on the estimation performance of score functions (Taleb & Jutten, 1999b). Currently, there are three main ways to estimate the score function. The first is to estimate the pdf pyi directly and find its derivative. Finally, the score function can be obtained. Many ways such as the kernel estimation can be used to estimate the pdf. Although the kernel estimation method may provide a noisy estimation, its simplicity makes it popular when the number of samples is small. However, since this method has a heavy memory demand, it is not suitable when the number of samples is large. The second way is to use the Gram-Charlier expansion to approximate the pdf pyi and to find the score function based on this expansion. The performance of this method is good when the nonlinearities are not strong, but it is poor in the case of strong nonlinearities due to the limitation of accuracy of the Gram-Charlier expansion (Taleb & Jutten, 1999b). The third way is to use some nonlinear models, such as nonlinear regressors and MLPs, to estimate the score function directly. But for nonlinear regressors, it is hard to choose suitable projection basis functions, and for MLPs, many parameters need to be learned. Here we propose to use a mixture of gaussians to model each output density. The mixture density model is widely used to model an arbitrary density. In Xu, Cheung, Yang, and Amari (1997), the cdf of a mixture of densities is used to model the nonlinearity in Bell and Sejnowski’s informationtheoretic ICA framework. In the independent factor analysis model (Attias, 1999), each source density is described by a mixture of gaussians. In our experiments, the density of output yi is modeled as the mixture of ni gaussians, and the qi th gaussian has the means mi,qi , variances vi,qi , and mixing weights πi,qi : pyi (u) =
ni
πi,qi G (u|mi,qi , vi,qi ).
(4.16)
qi =1
Then the score function can be obtained analytically, ψyi (u) =
pyi (u) pyi (u)
=
ni qi =1
p(qi |u)
u − mi,qi , vi,qi
(4.17)
Extended Gaussianization for Separation of PNL Mixtures
441
where p(qi |u) denotes the posterior probability that u, as a realization of yi , is generated by the qi th gaussian. During the learning process, all parameters in the mixture model are tuned to trace current outputs based on maximum likelihood. The expectation-maximization (EM) algorithm is used to maximize the likelihood. After each update of the demixing system, the parameters in the mixture model are updated according to the EM algorithm (N denotes the number of samples, and the subscript t in yi,t denotes the tth sample of yi ): (k) (k) πi,q G (yi,t |m(k) i,qi , vi,qi ) i p(k) (qi |yi,t ) = n , (k) (k) (k) i qi =1 πi,qi G (yi,t |mi,qi , vi,qi ) N 1 p(k) (qi |yi,t ), N t=1
N (k) t=1 p (qi |yi,t )yi,t = , N (k) t=1 p (qi |yi,t )
(4.18)
(k+1) πi,q = i
(4.19)
m(k+1) i,qi
(4.20)
= N v(k+1) i,qi
1
(k) t=1 p (qi |yi,t )
N
2 p(k) (qi |yi,t )(yi,t − m(k+1) i,qi ) .
(4.21)
t=1
When the nonlinear stage (with parameters θi , i = 1, ..., n), the linear stage (with parameter W), and the EM stage for modeling the output density (with parameters πi,qi , mi,qi , vi,qi , i = 1, ..., n and qi = 1, ..., ni ) converge, the learning process terminates. We observe that using this method, all parameters exhibit a small oscillation in the learning process. The reason is that parameters in the nonlinear stage are very sensitive to the estimated output densities, and parameters in the linear stage and the EM stage depend crucially on those in the nonlinear stage. We have two ways to eliminate the oscillation. The first is to let the EM algorithm iterate several times to get a stable distribution, instead of just once, after each update of the demixing system. The second way is to reduce the learning rate in the nonlinear stage as training goes on. 5 Experiments In this section, we present some experiments to show the validity of our methods. When the number of independent sources increases, generally their linear mixtures are closer to gaussian, so that the accuracy of our methods is improved. In our experiments, we use only two sources for illustration purpose. In order to compare the performance of different methods quantitatively, the classical residual cross talk C(y, s) = 10 log10 E[(y − s)2 ] is used as the performance index, in which y is an estimate of s, and by multiplying by a constant, y and s are made to be of unit variance and with positive correlation.
442
K. Zhang and L.-W. Chan
In the first experiment, two independent sources are an amplitudemodulated signal, that is, s1 (t) = cos(0.0157t) sin(0.2t), and a gamma(3,2) distributed noise. They have been made zero mean and of unit variance. The linear mixing matrix is chosen randomly: A=
0.0691 0.8821
0.1257 . 0.9395
In the first experiment, the nonlinearities chosen are said to be “hard” in Taleb and Jutten’s work:
f1 (t1 ) = (5t1 )3 , f2 (t2 ) = t2 + 2 tanh(t2 ).
Figure 5 shows the independent sources si , their linear mixtures ti , and the distributions of PNL observations xi . We can see that ti are skewed and supergaussian. The learning rule for the nonlinear stage is described in equation 4.14. Parameters in the nonlinear stage are {κ3,i , κ4,i }, i = 1, 2. We initialized all of these parameters as 0. The demixing matrix W in the linear demixing stage was initialized as the identity matrix. The learning rate in the nonlinear stage is 0.5, and that in the linear demixing stage is 0.2. For simplicity, the ecdf Fˆxi is used as the estimate of the cdf Fxi . In estimation of the score function, the gaussian mixture model consists of five gaussians. After about 800 iterations, the learning process converges. At convergence, residual distortion in each channel, independent components yi , and their joint distributions are given in Figure 6. The performance index is C(s1 , y1 ) = −27.8 dB, C(s2 , y2 ) = −27.7 dB. Figure 6A compares the residual distortion obtained by the extended gaussianization method with that by gaussianization. The independent components obtained by gaussianization are given in Figure 7. They are shown for comparison with the separation result of the extended gaussianization method. The performance index of the gaussianization method is C(s1 , y2 ) = −11.1 dB, C(s2 , y1 ) = −10.5 dB. We can see that the separation result of the extended gaussianization method is much superior to those obtained by gaussianization. The pdf’s generated by the C-F expansion with the parameters {κ3,i , κ4,i } at convergence are shown in Figure 8 by the solid line, together with the distribution histograms of the linear mixtures ti (which have been normalized to unit variance). These pdf’s fit the actual distributions of ti well. However, due to the limited freedom of the C-F expansion, and more important, due to the particularities of the actual distributions of mixtures ti (e.g., both independent sources are bounded or left bounded), the distributions of the linear mixtures estimated by the C-F expansion might not approach their actual distributions so well. This can also be seen by comparing the cumu-
Extended Gaussianization for Separation of PNL Mixtures
443
A 2
1
1
s
0
−1 −2 0
200
400
600
800
1000
200
400
600
800
1000
6
2
4
s
2 0
B
−2 0
1
t1
0.5 0 −0.5 0
200
400
600
800
1000
200
400
600
800
1000
10
t2
5 0 −5 0
6
6 4
5
2
2
t2
2
4
s
10
x2
C
0
0
0 −2 −2
−5
−2
−1
0
s
1
1
2
−4 −0.5
0
t
1
0.5
1
−10
0
20
x1 40
60
Figure 5: (A) Independent sources used in the first experiment. The column on the right shows their normalized histograms. For comparison, the gaussian distribution with the same mean and variance is given by the dashed curve. (B) Linear mixtures and their histograms. (C) Scatter plots of the sources, linear mixtures, and PNL observations in the first experiment (left to right).
444
K. Zhang and L.-W. Chan
A 5 Channel 1
4 3 2
Channel 2
1 0 −1 −2 −3
Extended Gaussianization Gaussianization
−4 −4
B
−2
0
2
4
6
y1
2
0
−2 400
450
500
550
450
500
550
6
y2
4 2 0 −2 400
C 4
4
2
2 0
y
z
2
2
0
−2 −4
−2
0
z1 2
4
−2
−2
y1
0
2
Figure 6: Separation result by the extended gaussianization. (A) Residual distortion gi ◦ fi in each channel. (For comparison, that obtained by gaussianization is shown by the dash-dot line.) (B) Recovered independent components (the original sources are shown by the dash-dot line). (C) Scatter plots of the recovered linear mixtures and independent components. Performance index: C(s1 , y1 ) = −27.8 dB, C(s2 , y2 ) = −27.7 dB.
Extended Gaussianization for Separation of PNL Mixtures
445
Figure 7: Separation result (independent components) obtained by gaussianization (shown for comparison). The original sources are shown by the dash-dot line. Performance index: C(s1 , y2 ) = −11.1 dB, C(s2 , y1 ) = −10.5 dB.
Figure 8: Distributions constructed by the C-F expansion with parameter values at convergence (solid line) and distribution histograms of the linear mixtures ti (normalized to unit variance). Gaussian distributions are given by the dash-dot line for comparison.
lants of the true mixtures and those of the estimated ones. Skewness and kurtosis of t1 are 0.92 and 1.37, and those of t2 are 0.50 and 0.62, respectively. While z1 has the skewness 0.79 and the kurtosis 0.98, z2 has 0.41 and 0.20. (The parameters at convergence are κ3,1 = 0.77, κ4,1 = 1.20, κ3,2 = 0.43, and κ4,2 = 0.35.) Figure 9 shows the output densities estimated by the gaussian mixture model as well as their normalized distribution histograms. We can see the gaussian mixture model has been adapted to the output densities.
446
K. Zhang and L.-W. Chan
Figure 9: Output distribution histograms and their densities learned by the gaussian mixture model (five gaussians are used).
We also compare our result with that obtained by the method described in Taleb and Jutten (1999b). The latter uses an MLP to model the nonlinearities gi . A two-layer MLP with five hidden neurons is used. Sigmoid activation function is used in the hidden layer, and other neurons are linear. Even with gaussianization as an initialization step, it takes at least 2500 iterations to achieve the same performance index (-27.8 dB for the first source and -27.7 dB for the second source), which is similar to the result Taleb and Jutten reported. Compared to the extended gaussianization method, the MLP method is much more time-consuming. Moreover, the learning is more prone to local optima. In order to demonstrate that our method also performs well in other cases, we repeat the above experiment for 20 runs, and in each run the (1, 1)th entry of the mixing matrix A is randomly chosen between 0 and 0.15. The independent sources are always successfully recovered by the extended gaussianization method. In the worst of the 20 runs, the performance index is −23.6 dB (for the first source) and −23.3 dB (for the second source). And the extended gaussianization method always outperforms the gaussianization method. The average performance index of the extended gaussianization method is −26.1 dB (with standard deviation 1.3) and −27.2 dB (with standard deviation 1.7), while the gaussianization method has an average performance index of −8.9 dB (with standard deviation 4.5) and −8.7 dB (with standard deviation 3.8). In cases when the linear mixture is far from gaussian, the assumptions used in both gaussianization and the extended gaussianization method break down. As a demonstration, we set the (2, 2)th entry of the mixing matrix A as 0 and repeat the above experiment. Now the second linear mixture t2 is far from gaussian, as seen from the distribution of s1 shown in Figure 5A. Both gaussianization and the extended gaussianization method
Extended Gaussianization for Separation of PNL Mixtures
447
behave poorly. The performance index of the gaussianization method is −8.6 dB and −9.9 dB. And the performance index of the extended gaussianization method is −9.1 dB and −9.6 dB. In another experiment, we modify the nonlinear functions fi as if t1 ≥ 0 (5t√1 )3 , f1 (t1 ) = − −t1 , if t1 < 0 t + 2 tanh(t2 ), if t2 ≥ 0 f2 (t2 ) = 2 t2 , if t2 < 0. The nonlinearities are neither smooth nor symmetrical. Other settings are the same as the first experiment. The separation result by the extended gaussianization method is exactly the same as that in the first experiment. This verifies the robustness of our method to different nonlinearities. We also test our method with natural audio sources. The two sources are a speech signal (with kurtosis of 6.1) and a music signal (with kurtosis 0.5), as shown in Figure 10A. They are mixed by a randomly chosen matrix. The linear mixtures are shown in Figure 10B. The nonlinearities in both channels are the same as in the first experiment. Since the separation result is insensitive to the nonlinearities, the PNL mixtures are not given in the figure. The learning rates in the nonlinear stage and the linear stage are both 0.2. After about 900 iterations, the learning process converges. The parameters at convergence are κ3,1 = −0.04, κ4,1 = 1.06, κ3,2 = −0.02, and κ4,2 = 0.34. The recovered sources are shown in Figure11A, with the performance index C(s1 , y1 ) = −18.1 dB, C(s2 , y2 ) = −25.6 dB. From the performance index, we can see the extended gaussianization exhibits a good separation performance. This can also be confirmed by listening to the recovered signals and the original sources—we perceive very little cross talk between them. Figure 11B shows the recovered signals by gaussianization. And Figure 11C compares the residual distortion obtained using these two methods. Clearly the separation result of the extended gaussianization method is much better. 6 Conclusion and Discussion In this article, we propose the extended gaussianization method to tackle the problem of blind source separation of post-nonlinear mixtures when the linear mixture of independent sources is close to gaussian. The gaussianization technique assumes the linear mixtures are gaussian and estimates the inverse nonlinearities directly. Our method extends gaussianization for the cases when the linear mixtures are not exactly gaussian, just not far from gaussian. Gaussianization can be considered a preprocessing step for our method. After this step, the inverse of remaining nonlinear distortion is modeled by the Cornish-Fisher expansion. Consequently, the inverse nonlinearity can be learned with only a few unknown parameters. The region of these parameters guaranteeing the validity of the Cornish-Fisher expansion
448
K. Zhang and L.-W. Chan
s1
A 0.5 0 −0.5 0
500
1000
1500
2000
2500
0
500
1000
1500
2000
2500
−1 0 0.5
500
1000
1500
2000
2500
500
1000
1500
2000
2500
s2
0.5 0 −0.5
B
t
2
t
1
1 0
0
−0.5 0
Figure 10: Two-channel audio data set. (A) Waveforms of the original sources. (B) Linear mixtures of the original sources.
is also derived. These parameters, as well as the linear demixing matrix, are learned based on the minimization of mutual information between outputs. Although the region indicates the linear mixtures should not be far away from gaussian, this is a reasonable assumption according to the central limit theorem. In the noiseless case, the performance of this method is not affected by different forms of the nonlinearity generating observations. Compared to the gaussianization method, our method extends the region of validity to include distributions in the vicinity of gaussian using parameters adaptation. Compared with other adaptation methods for blind separation of PNL mixtures, our method reduces the problem complexity and simplifies the learning procedure by using the assumption that linear mixtures of independent variables are close to gaussian. The nonlinearity is modeled with the help of the Cornish-Fisher expansion, and this specific parametric model has good flexibility and adaptability under the assumption of our method. Experimental results have been given to support the theoretical claims. In summary, our extended gaussianization method combines two major approaches for blind separation of post-nonlinear mixtures and gives good separation performance with low computational cost.
Extended Gaussianization for Separation of PNL Mixtures
449
A y1
0
y2
5
−5 0 4 2 0 −2 −4 0
B
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
500
1000
1500
2000
2500
y1
5 0
y
2
−5 0 5 0
−5 0
C 5
Channel 2
Channel 1
0
Extended Gaussianization Gaussianization −5 −1
−0.5
0
0.5
1
Figure 11: Separation result of PNL mixtures of audio signals. (A) Recovered sources using the extended gaussianization method. Performance index: C(s1 , y1 ) = −18.1 dB, C(s2 , y2 ) = −25.6 dB. (B) Recovered sources with gaussianization. Performance index: C(s1 , y1 ) = −10.6 dB, C(s2 , y2 ) = −17.0 dB. (C) Residual distortion gi ◦ fi in each channel.
450
K. Zhang and L.-W. Chan
Acknowledgments The work reported here was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China. We are very grateful to the anonymous referees for their valuable comments and suggestions, which led to a strong improvement of this article. References Abramowitz, M., & Stegun, I. A. (1968). Handbook of mathematical functions. New York: Dover. Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Comput., 10, 251–276. Amari, S.-I., & Cardoso, J.-F. (1997). Blind source separation—semiparametric statistical approach. IEEE Trans. on Signal Processing, 45, 2692–2700. Amari, S.-I., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: The MIT Press. Attias, H. (1999). Independent factor analysis. Neural Computation, 11, 803–851. Bell, A. J. (2003). The co-information lattice. Proc. ICA2003 (pp. 921–926). Nara, Japan. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., Abed-Meraim, K., Cardoso, J., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Trans. on Signal Processing, 45, 434–444. Breiman, L., & Friedman, J. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80, 580–598. Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. IEEE Letters on Signal Processing, 4, 112–114. Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proceedings of the IEEE, 9, 2009–2025. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. Signal Processing, 44, 3017–3030. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. IEE Proceeding-F, 140, 362–370. Chen, S. S., & Gopinath, R. A. (2001). In T.K. Leen, T.G. Dietterich, & V. Tresp (Eds.), Gaussianization. Advances in neural information processing systems 13 (pp. 423–429). Cambridge, MA: MIT Press. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Cornish, E. A., & Fisher, R. A. (1937). Moments and cumulants in the specification of distributions. Review of the International Statistical Institute, 5, 307–320.
Extended Gaussianization for Separation of PNL Mixtures
451
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Gill, P. E., Murray, W., & Wright, M. H. (1981). Practical optimization. New York: Academic Press. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Hyv¨arinen, A. (1999a). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A. (1999b). Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13, 411–430. Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12, 429–439. Jaschke, S. R. (2002). The Cornish-Fisher expansion in the context of deltagamma-normal approximations. Journal of Risk, 4, 33–52. Johnson, N. L., & Kotz, S. (1970). Continuous univariate distributions (Vol. 1). New York: Wiley. Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Jutten, C., & Taleb, A. (2000). Source separation: From dusk till dawn. In 2nd International Workshop on Independent Component Analysis and Blind Signal Separation (ICA 2000) (pp. 15–26). Helsinki, Finland. Lee, T.-W., Girolami, M., & Sejnowski, T. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and supergaussian sources. Neural Computation, 11, 417–441. Lee, T.-W., Koehler, B., & Orglmeister, R. (1997). Blind source separation of nonlinear mixing models. In N. Morgan, J. Principe, E. Wilson, & L. Gile (Eds.), Neural Networks for Signal Processing, 7 (pp. 406–415). Piscataway, NJ: IEEE Press. Mansour, A., Barros, A., & Ohnishi, N. (2000). Blind separation of sources: Methods, assumptions and applications. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Special Section on Digital Signal Processing in IEICE EA., E83-A, 1498–1512. Neuts, M. F. (Ed.). (1995). Advanced probability theory (2nd ed.). New York: Marcel Dekker. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17, 25–45. Pham, D.-T. (2000). Blind separation of instantaneous mixture of sources based on order statistics. IEEE Trans. on Signal Processing, 48, 363–375. Pham, D.-T. (2002). Mutual information approach to blind separation of stationary sources. IEEE Trans. on Information Theory, 48, 1935–1946. Pham, D.-T., & Garat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal Processing, 45, 1712–1725.
452
K. Zhang and L.-W. Chan
Sol-Casals, J., Babaie-Zadeh, M., Jutten, C., & Pham, D.-T. (2003). Improving algorithm speed in PNL mixture separation and Wiener system inversion. Proc. ICA2003 (pp. 639–644). Nara, Japan. Taleb, A., & Jutten, C. (1997). Nonlinear source separation: The post-nonlinear mixtures. In Proc. ESANN (pp.179–284). Bruge, Belgium. Taleb, A., & Jutten, C. (1999a). Batch algorithm for source separation in postnonlinear mixtures. In Proc. First Int. Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 155–160). Aussois, France. Taleb, A., & Jutten, C. (1999b). Source separation in post-nonlinear mixtures. IEEE Trans. on Signal Processing, 47, 2807–2820. Xu, L., Cheung, C., Yang, H., & Amari, S.-I. (1997). Independent component analysis by the information-theoretic approach with mixture of densities. Proc. of 1997 IEEE Intl. Conf on Neural Networks (IEEE-INNS IJCNN97) (pp. 1821–1826). Houston, TX. Yang, H. H., Amari, S.-I., & Cichocki, A. (1998). Information-theoretic approach to blind separation of sources in nonlinear mixture. Signal Processing, 64, 291–300. Ziehe, A., Kawanabe, M., Harmeling, S., & Mueller, K.-R. (2001). Separation of post-nonlinear mixtures using ACE and temporal decorrelation. In Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2001) (pp. 433–438). Ziehe, A., Kawanabe, M., Harmeling, S., & Mueller, K.-R. (2003). Blind separation of post-nonlinear mixtures using Gaussianizing transformations and temporal decorrelation. InProc. ICA2003 (pp. 269–274). Received November 7, 2003; accepted July 2, 2004.
LETTER
Communicated by Stephen Roberts
Bayesian Analysis of Nonlinear Autoregression Models Based on Neural Networks A. Menchero
[email protected] R. Montes Diez
[email protected] D. R´ıos Insua
[email protected] GECD, Rey Juan Carlos University, Madrid 28925, Spain
P. Muller ¨
[email protected] M. D. Anderson Cancer Center. University of Texas, Houston, TX 77030, U.S.A.
We show how Bayesian neural networks can be used for time-series analysis. We consider a block-based model building strategy to model linear and nonlinear features within the time series: a linear combination of a linear autoregression term and a feedforward neural network (FFNN) with an unknown number of hidden nodes. To allow for simpler models, we also consider these terms separately as competing models to select from. Model identifiability problems arise when FFNN sigmoidal activation functions exhibit almost linear behavior or when there are almost duplicate or irrelevant neural network nodes. New reversible-jump moves are proposed to facilitate model selection, mitigating model identifiability problems. We illustrate this methodology analyzing several timeseries data examples.
1 Introduction Many neural network models have been applied to time-series analysis and forecasting. There has been some interest in using recurrent networks such as Elman networks (Elman, 1990), Jordan networks (Jordan, 1986), and realtime recurrent learning networks (Williams & Zipser, 1989). However, the model most frequently used is the multilayer feedforward neural network (FFNN). Many articles compare FFNNs and standard statistical methods for time-series analysis (e.g., Tang, de Almeida, & Fishwick, 1991; Foster, Collopy, & Ungar, 1992; Stern, 1996; Hill, O’Connor, & Remus, 1996; Faraway & Chatfield, 1998). Several articles have found FFNN superior to linear methods such as ARIMA models for several time-series problems and Neural Computation 17, 453–485 (2005)
c 2005 Massachusetts Institute of Technology
454
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
comparable to other nonlinear methods like generalized additive models or projection pursuit regression. We will consider FFNNs to model nonlinear autoregressions. The net output will represent the time-series predicted value, when past values of the series are given as net inputs. Fitting an FFNN model requires many choices about the model structure: activation functions, number of hidden layers, number of hidden nodes, inputs, and so on. A nonlinear optimization algorithm is typically used to estimate weights to optimize some performance criterion, for example, minimization of mean square error, with weight decay. To fix the model structure, rules of thumb traditionally are used. Remus, O’Connor, and Griggs (1998) suggest appropriate rules for FFNN in time-series forecasting. Alternatively, a Bayesian approach to FFNN modeling (Mackay, 1992; Neal, 1996; Muller ¨ & R´ıos Insua, 1998; R´ıos Insua and Muller, ¨ 1998; Holmes & Mallick, 1998; Andrieu, de Freitas, & Doucet, 1999) provides a coherent framework to deal with these issues. FFNN parameters are regarded as random variables whose posterior distribution is inferred in the light of data. Most important, perhaps, we may include the number of hidden nodes as an additional parameter and model its uncertainty. Predictions are obtained by averaging over all possible models and parameter values according to their posterior distributions. In this letter, we introduce a Bayesian FFNN forecasting model for timeseries data. We present an inference scheme based on Markov chain Monte Carlo (MCMC) simulation. To deal with model selection, the standard birth and death reversible-jump moves developed by Muller ¨ and R´ıos Insua (1998) result in a slowly mixing MCMC. Instead, we propose new reversiblejump moves to add or delete special kinds of nodes characterized as linearized, irrelevant, or duplicate. Two examples are used to illustrate the methodology: the lynx time-series data (Priestley, 1988) and the airline passengers data (Box & Jenkins, 1970). 2 Model Definition Consider univariate time-series data {y1 , y2 , . . . , yN }. We would like to model the generating stochastic process in an autoregressive fashion, N p y1 , y2 , . . . , yN = p(y1 , . . . , yq ) p(yt | yt−1 , yt−2 , . . . , yt−q ). t=q+1
We shall assume that each yt is modeled by a nonlinear autoregression function of q past values plus a normal error term: yt = f (yt−1 , yt−2 , . . . , yt−q ) + t , t = q + 1, . . . , N t ∼ N(0, σ 2 ),
so that yt | yt−1 , yt−2 , . . . , yt−q ∼ N f (yt−1 , yt−2 , . . . , yt−q ), σ 2 , t = q + 1, . . . , N.
Bayesian Analysis of Nonlinear Autoregression Models
455
We complete model specification using a block-based strategy to describe f . We propose a mixed model as a linear combination of a linear autoregression term and an FFNN. An FFNN model with q input nodes, one hidden layer with M hidden nodes, one output node, and activation function ϕ is a model relating a response variable yt and q explanatory variables, in our case, xt = (yt−1 , . . . , yt−q ): yt (xt ) =
M
βj ϕ(x t γj + δj ),
j=1
with βj ∈ R, γj ∈ Rq . Biases δj may be assimilated to the rest of the γj vectors if we consider an additional input with constant value one, say, xt = (1, yt−1 , . . . , yt−q ), so that, γj = (γ0j , γ1j, . . . , γqj ) ∈ Rq+1 . To be specific, in the following we will assume a logistic activation function: ϕ(z) = exp(z)/(1 + exp(z)). However, the discussion remains valid for any other sigmoidal function. In the proposed model, the linear term accounts for linear features, whereas the FFNN term could take care of nonlinear ones: f (yt−1 , yt−2 , . . . , yt−q ) = x t λ +
M
βj ϕ(x t γj ), t = q + 1, . . . , N.
(2.1)
j=1
Initially, the parameters in our mixed model are the linear coefficients λ = (λ0 , λ1 , . . . , λq ) ∈ Rq+1 , the hidden to output weights β = (β1 , β2 , . . . , βM ), the input to hidden weights γ = (γ1 , γ2 . . . , γM ), and the error variance σ 2 . As there will be uncertainty about the number of hidden nodes M to include, we shall model this uncertainty considering M as an unknown parameter. Note that the mixed model, equation 2.1, embeds, as a particular case, the standard linear autoregression model, when M = 0, and the FFNN model, when λ = 0. We shall assume the autoregressive order q to be known in advance. If desired, one could model uncertainty about q as a problem of variable selection in a model selection context. To allow for simpler models, we also consider the linear and nonlinear terms separately as competing models to select from: a simple linear autoregression model, yt = f (yt−1 , yt−2 , . . . , yt−q ) = x t λ, t = q + 1, . . . , N,
(2.2)
and a nonlinear autoregression feedforward neural net model, yt = f (yt−1 , yt−2 , . . . , yt−q ) =
M j=1
for each value of M.
βj ϕ(x t γj ), t = q + 1, . . . , N,
(2.3)
456
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
As in R´ıos Insua and Muller ¨ (1998), we assume a normal/inverse gamma prior βj ∼ N(µβ , σβ2 ), λ ∼ N(µλ , σλ2 I) γj ∼ N(µγ , γ ), σ 2 ∼ InvGamma(aσ , bσ ).
(2.4)
When there is nonnegligible uncertainty about prior hyperparameters, we may extend the prior model with additional hyperpriors. We shall use the following standard conjugate choices in hierarchical models: µβ ∼ N(aµβ , bµβ ), σβ2 ∼ InvGamma(aσβ , bσβ ) µλ ∼ N(aµλ , bµλ ), σλ2 ∼ InvGamma(aσλ , bσλ ) µγ ∼ N(aµγ , bµγ ), γ ∼ InvWishart(aλ , bλ ).
(2.5)
Hyperparameters are a priori independent. Given hyperparameters, parameters are a priori independent. Since the likelihood is invariant with respect to relabelings, we include an order constraint to avoid trivial posterior multimodality due to index permutation. For example, we may use γ1q ≤ γ2q · · · ≤ γMq . 3 MCMC Posterior Inference with a Fixed Model Consider first the mixed model 2.1 with fixed M. The complete likelihood for a given data set D = {y1 , y2 , . . . , yN } is p(D | λ, β, γ , σ 2 ) = p(y1 , . . . , yq | λ, β, γ , σ 2 )p(D | y1 , . . . , yq , λ, β, γ , σ 2 ), where D = {yp+1 , . . . , yN } and p(D | y1 , . . . , yq , λ, β, γ , σ 2 ) =
N
p(yt | yt−1 , yt−2 , . . . ,t−q , λ, β, γ , σ 2 )
t=q+1
is the conditional likelihood given first q values. From here on, we will make inference conditioning on the first q values, assuming they are known without uncertainty (alternatively, we could include an informative prior over first q values in the model and perform inference with the complete likelihood). Together with the prior assumptions 2.4 and 2.5, the joint posterior distribution is given by p(λ, β, γ , σ 2 , χ | D ) ∝ p(yq+1 , . . . , yN | y1 , . . . , yq , λ, β, γ , σ 2 ) 2
p(λ, β, γ , σ , χ)M!,
(3.1)
Bayesian Analysis of Nonlinear Autoregression Models
457
where p(λ, β, γ , σ 2 , χ) = p µλ , σλ2 , µβ , σβ2 , µγ , γ p(σ 2 ) p(λ | µλ , σλ2 I)p(β | µβ , σβ2 I)
M
p(γi | µγ , γ )
i=1
is the joint prior distribution, χ = (µλ , σλ2 , µβ , σβ2 , µγ , γ ) is the set of hyperparameters, and M! appears because of the order constraint on γ . As in Muller ¨ and R´ıos Insua (1998), we propose a hybrid, partially marginalized MCMC posterior sampling scheme to implement inference in the fixed architecture mixed model, equation 2.1. We use a Metropolis step to update the input to hidden weights γj using the marginal likelihood over (β, λ): p(D | y1 , . . . , yq , γ , σ 2 ) to partly avoid the random walk nature of the Metropolis algorithm: 1. Given current values of χ and σ 2 (β and λ are marginalized), for each γj , j = 1, . . . , M, generate a proposal γj ∼ N(γj , cγ ), calculate the acceptance probability,
p( γ | µγ , γ )p(D | y1 , . . . , yq , γ , σ 2) a = min 1, p(γ | µγ , γ )p(D | y1 , . . . , yq , γ , σ 2 ) where γ = γ1 , . . . , γj−1 , γj , γj+1 , . . . , γM , and γj , γ = γ1 , . . . , γj−1 , γj+1 , . . . , γM . With probability a, replace γ by γ and rearrange indices if necessary to satisfy order constraint. Otherwise, leave γj unchanged. 2. Generate new values for parameters, drawing from their full conditional posteriors: ∼ p β | D , γ , λ, σ 2 , χ is a multivariate normal distribution. β λ ∼ p λ | D , γ , β, σ 2 , χ is a multivariate normal distribution. 2 ∼ p σ 2 | D , γ , β, λ is an inverse gamma distribution. σ 3. Given current values of γ , β, λ, σ 2 , generate a new value for each hyperparameter by drawing from their complete conditional posterior distributions: µ β ∼ p µβ | D , β, σβ2 is a normal distribution. 2 ∼ p σ 2 | D , β, µ is an inverse gamma distribution. σ β β β µ λ ∼ p µλ | D , λ, σλ2 is a multivariate normal distribution.
458
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
2 ∼ p σ 2 | D , λ, µ is an inverse gamma distribution. σ λ λ λ µ γ ∼ p µγ | D , γ , γ is a multivariate normal distribution. γ ∼ p γ | D , γ , µγ is an inverse Wishart distribution. A similar sampling scheme can be used when using a neural net model without linear term 2.3, but likelihood marginalization would be just over β. Posterior inference with the normal linear autoregression model 2.2 is straightforward (see Gamerman, 1997). 4 Modeling Uncertainty About the Architecture We now extend the model to include inference about model uncertainty. In fact, the posterior distribution 3.1 should be written including a reference to the model k considered: p(λ, β, γ , σ 2 , χ | D , k) = p θk | D , k , where θk = λ, β, γ , σ 2 , χ represents parameters and hyperparameters in model k. We index models with a pair of indexes mhk , where h = 0(1) indicates absence (presence) of the linear term; k = 0, 1, . . . indicates the number of hidden nodes in the NN term. Therefore, m10 is the linear autoregression model (see equation 2.2), m0k is the FFNN model (see equation 2.3) with k hidden nodes, and m1k is the mixed models (see equation 2.1) with k hidden nodes. Note that models m0k , k ≥ 1 are nested, as well as m1k , k ≥ 1. It would be possible to think of the linear model m10 as a degenerate case of the mixed model m11 when β = 0 and of models m0k , k ≥ 1, as degenerate m1k , k ≥ 1 when λ = 0 and finally consider all models above as nested models. However, given our model exploration strategy outlined below, we prefer to view them as nonnested. We wish to design moves between models to get good coverage of the model space (see Figure 1). When dealing with nested models, it is common to add or delete model components, using add/delete or split/combine move pairs (Green, 1995; Richardson & Green, 1997). Similarly, we could define two reversible-jump pairs: add/delete an arbitrary node selected at random and add/delete the linear term. With such strategy, described in Muller ¨ and R´ıos Insua (1998), it would be possible to reach a model from any other model. However, our experience with time-series data shows that the acceptance rate of such an algorithm is low, that the model space is not adequately covered, and we have identified cases of nonconvergence. That happened, for example, when we tried to model the lynx time series and the airline passengers data sets used in section 5 as numerical examples of our methodology.
Bayesian Analysis of Nonlinear Autoregression Models j2a j3a
m10
j2a j3a j4a
m11 j2d j3d
j2a j3a j4a
j2a j3a j4a
m12 j2d j3d j4d
j1a
j1d
j1d
j2a j3a j4a
...
j2d j3d, j4d j1a
j1d
j2a j3a j4a
m02 j2d j3d j4d
m1k+1
m1k j2d j3d j4d
j1a
m01
459
j1d
j2a j3a j4a
m0k j2d j3d j4d
j1a
m0k+1 j2d j3d j4d
Figure 1: Possible models along with moves available from each model.
To avoid this, let us introduce the notion irrelevant, of LID (linearized, and duplicate) nodes. Consider a node βj , γj with γj = γ0j , γ1j, . . . , γpj in a model m0k or m1k . We call it: Definition 1. (Approximately) linearized if its output βj ϕ(x t γj ) is nearly a hyperplane in the input-output space for the input data range. Definition 2.
(Approximately) irrelevant if βj 0.
Definition 3. We say nodes (βj , γj ) and (βk , γk ) are (approximately) duplicate nodes if γj − γk 0. Note that a model with an (approximately) irrelevant node has (almost) identical likelihood as a model without it. A model with (approximately) du- plicate nodes will perform similarly to a model with a single node βj +βk , γj instead of them. Finally, the linear behavior of a linearized node may be assimilated into a linear term or another linearized nodes. The appearance of LID nodes therefore may cause problems of model identifiability and multimodality in model space. Recall that the acceptance probability of a proposed model is computed using the marginal likelihood over β and/or λ whenever possible. This marginalization accelerates convergence, but γ is not easily marginalized. Thus, any proposed model structure change is evaluated with the currently imputed value of γ . Adding irrelevant, duplicate, or linearized nodes implies less structural change, so that in our simulation, it was more common to accept add moves that actually add LID nodes. And, vice versa, it is very unlikely to accept a jump to a simpler model than propose an arbitrary node to be deleted. In fact, it is usually easier for a new LID node to be added before old ones are deleted. This motivates an MCMC scheme that includes add/delete of LID nodes as additional moves to a standard reversible-jump scheme. Once moves to delete LID nodes are defined, we shall propose the corresponding add nodes
460
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
moves. Note that whereas delete LID nodes would mitigate our problem, add LID nodes would seem to complicate our scheme, following our above discussion. In principle, we could propose these add LID nodes with low probability. However, besides ensuring balance, add moves have a useful side effect. Note that with add moves, we have control over when and how an LID node is added. Adding LID nodes will have more probability of being accepted than adding an arbitrary node because it usually implies a smaller structural change. We expect that once a model with a new LID node becomes the current model, the next iterations will delete some other LID nodes. In this way, more complex models have a chance of being visited, so it can help to get a wide coverage of model space. In a sense, this is related to some global search optimization techniques, like simulated annealing (Geman & Geman, 1984), in which a worse solution is sometimes proposed to escape from a local minima, with the hope of reaching a better, possibly global, minima from it. As Bayesian inference implicitly embodies Occam’s razor principle and clever delete moves are defined, we expect simpler models to be more common. In the rest of this section, we show how to characterize a linearized node so as to generate it at random or propose deletion. We also introduce modifications to a basic thin/seed move to deal with duplicate nodes, and we put an ad hoc distribution over nodes to increase the probability of an irrelevant node to be proposed for deletion. Note that there is a decision to make about how deterministic a move should be. For example, we could look deterministically for two (approximately) duplicate nodes in the set of nodes and, if they exist, delete them, but in general, it is good to allow for some randomness in move definitions (Green, 1995). At the end of this section, a complete reversible-jump scheme is outlined. A full description of add/delete moves is given in the appendixes. 4.1 (Approximately) Linearized Nodes. To facilitate explanation, consider an FFNN model in a regression problem: y(x1 , . . . , xq ) = f (x1 , . . . , xq ) =
M
βj ϕ(ζj )
j=1
ζj (x1 , . . . , xq ) = γj0 + γj1 x1 + · · · + γjq xq . The net output y is given by a linear combination of hidden node outputs. Each hidden node output ϕ(ζj ) = 1+exp1 −ζ , j = 1, . . . , M, represents a ( j) is approximately linear in its middle range sigmoidal surface in Rq+1 , which (see Figure 2). For any data x , i = 1, . . . , N, we may find weights , . . . , x i1 iq γj0 , γj1 , . . . , γjp so that −ρ ≤ ζj (x1 , . . . , xq ) ≤ ρ, i = 1, . . . , N,
(4.1)
Bayesian Analysis of Nonlinear Autoregression Models
461
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 6 0.2
4 2
0.1 0 0 −6
−2 −4
−2
0
−4 2
4
6
−6
Figure 2: Sigmoidal surface when input space dimension is q = 2.
where ρ and ϕ(ζj ) resembles a hyperplane in the input data domain. Note also that when ζj > η, say η ≥ 4, the sigmoid is saturated (see Figure 3) and ϕ(ζj ) 1. It is possible to adjust weights so that saturation is verified for the whole data set. When fitting an FFNN model, some hidden nodes will ocasionally work in the linear zone, behaving approximately as linear regression terms, or in the saturation zone, behaving as a constant term. For example, in Figures 4 and 5, an FFNN with four nodes fits a cosine function. The joint output of three of the four nodes has a linear trend, but none of them is working in the linear zone. The fourth node helps to compensate this linear trend so that the cosine is fitted. Also, when using mixed models, some nodes may add up their linear behavior to the linear term. Appendix A describes computational aspects of the derivation for generation and deletion of linearized nodes. 4.2 (Approximately) Irrelevant Nodes. We let chance deal with the generation of an irrelevant node. We modify a basic birth-death move when selecting a node for deletion. We choose a node at random with probabilities ψ −1 2 pj = M+1j , j = 1, . . . , M where ψj = exp{ 2 2 , βj }, so that more weight k=1
ψk
is given to those nodes whose βj is almost zero. (Full details are given in section B.3.)
462
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨ 1.0
ϕ(ζj) 0.8
0.6
0.4
0.2
ζj
0.0
-5
-4
-3
-2
-ro -1
0
+ro 1
2
3
4
5
20
Figure 3: When −ρ ≤ z ≤ ρ, sigmoid function could be well approximated by its tangent in z = 0.
-10
0
10
joint output first 3 nodes
-20
output 4th node
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
Figure 4: Feedforward neural network with four nodes fitting a cosine function. Three of the four nodes fit the cosine waves with a global up-linear trend. The fourth node adds up its linear behavior to compensate the trend.
4.3 (Approximately) Duplicate Nodes. Consider two (approximately) duplicate nodes βj , γj and (βk , γk ) in a model m0k or m1k . We could propose a simpler model, collapsing them into a new node βj + βk , γj . The proposed add or delete pair is a basic thin or seed move with minor modifications.
463
-1.0
-0.5
0.0
0.5
1.0
Bayesian Analysis of Nonlinear Autoregression Models
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
Figure 5: The joint output of the four nodes in the neural net model resembles the target cosine function.
To add a duplicate node, take a node βj , γj at random, and introduce small perturbations to produce two (approximately) duplicate new nodes, j , j+1 , hence substituting βj , γj by β γj and β γj+1 with j = βj (1 − v), β j+1 = βj v β γj = γj ,
γj+1 = γj + δ.
Perturbations v and δ are generated from v ∼ Beta(2, 2) δ ∼ N 0, cγ , where c is a small enough constant (say, c = 0.01). As we introduced an order constraint on γ , the add move proposed will be rejected if γj and γj+1 do not satisfy the order constraint. Given the on γ , we expect that for a given βj , γj , the order constraint next node βj+1 , γj+1 could be an almost duplicate version of it. Then we j = βj + βj+1 and propose removing βj+1 , γj+1 and transforming βj , γj as β γj = γj . (See appendix B for full details.) 4.4 Our Reversible-Jump Scheme. We propose the following moves: 1. Add or delete linear term (j1a , j1d ). 2. Add or delete linearized node (j2a , j2d ). 3. Add or delete irrelevant nodes (j3a , j3d ).
464
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
4. Add or delete duplicate nodes (j4a , j4d ). Not all of them are reachable from a given model. For example, we cannot delete a linear term when the current model is the linear model, or delete duplicate nodes if we have just one node (see Figure 1). Specifically: • From the linear autoregression model m10 , valid moves are j2a and j4a . • From a feedforward neural net model with one node m01 , valid moves are j1a , j2a , j3a , and j4a . • From a feedforward neural net model with k ≥ 2 nodes m0k , valid moves are j1a , j2a , j2d , j3a , j3d , j4a , and j4d . • From a mixed model with k ≥ 1 and the nodes m1k , valid moves are j1d , j2a , j2d , j3a , j3d , j4a , and j4d . We will assume that from a given model, all reachable moves are equally likely. Formally, unreachable moves are given zero probability. Note that the definition of each move depends on the current model where to jump from. For example, adding a linearized node when the current model is m10 involves proposing neural network parameters and hyperparameters absent in m10 . However, when the current model is m0k , k ≥ 1, adding a linearized node is a jump between nested models with shared parameters and hyperparameters. Shared parameters and hyperparameters could be proposed with the same values as in the previous model. But unshared parameters and hyperparameters have to be proposed from scratch, for example, from their prior distribution. This usually implies less of a chance for the move to be accepted. It would be useful for convergence purposes to generate unshared parameters using proposals centered in the values they had the last time the model was visited. We shall not deal with this matter here and will use priors as proposals. Similar comments apply to other moves. (See appendix B for a complete description.) The general reversible-jump posterior inference scheme is as follows: 1. Start with an initial model mhk , h = {0, 1}, k = 0, 1, . . . and initial values for its parameters (and hyperparameters if needed) (for example, prior means). Until convergence is achieved, iterate through steps 2 to 4: 2. With probability p1 (say, p1 = 0.5), decide to stay within the current model; otherwise (with probability 1 − p1 ), decide to move to another model. 3. If staying in the current model, perform the MCMC scheme described in section 3. 4. If moving to another model, select at random a move from the list of reachable moves with probabilities assigned as mentioned and propose a new model accordingly. If accepted, the new model is the current model.
465
4000 0
1000
2000
3000
lynx
5000
6000
7000
Bayesian Analysis of Nonlinear Autoregression Models
1820
1840
1860
1880
1900
1920
Time
Figure 6: Annual number of lynx trappings in the Mackenzie River District of Northwest Canada, 1821–1934 (Priestley, 1988).
5 Examples We illustrate the methodology with two well-known time-series data sets and two simulated examples. 5.1 Lynx Data. First, we consider time-series data giving the annual number of lynx trappings in the Mackenzie River District of Northwest Canada for the period 1821 to 1934 (Priestley, 1988; see Figure 6). In neural network applications, the data set is often split into two subsets: one for estimation and the other for validation. We split the lynx data set into a training data set (the first 72 observations) and a test data set (the last 38 observations). Prior distributions for the unknown parameters and hyperparameters in the model are chosen as described in section 2. As for the choice of the hyperhyperparameters, our experience shows that at that level of the hierarchical model, the results do not change much when using different values. Here,
466
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
15 Initial value M = 0 Initial value M = 15
M
10
5
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Figure 7: Lynx data. Trace plots of the Markov chains for the number of hidden nodes, from two well-dispersed starting points, M = 0 and M = 15.
we used the following choices for the hyperparameter distributions: µβ ∼ N(0, 3),
σβ2 ∼ InvGamma(9, 1)
µλ ∼ N((0, 0, 0), 3I),
σλ2 ∼ InvGamma(9, 1)
µγ ∼ N((0, 0, 0), 3I),
γ ∼ InvWishart(10, 2.5I)
2
σ ∼ InvGamma(1, 1) Also, the unknown number of nodes M is given a geometric prior distribution with parameter α = 14 , which implies a prior mean E(M) = 4. Two runs were carried out from different starting points: M = 0 and M = 15. A burn–in of 1000 iterations was used, and then 9000 additional iterations were monitored for inference purposes. Figures 7 and 8 show trace plots of the Markov chains for the number of nodes in the FFNN and histogram of the posterior distribution of M, respectively, suggesting M = 2 as the most likely number of nodes for the hidden layer of the FFNN term, although M = 1, 3, 4 received nonnegligible probability. In any case, the analysis clearly indicates the presence of important nonlinearity. Finally, Figure 9 shows the observed values (+) (note that the time series has been log transformated and detrended), the one-step-ahead forecast values (solid line), and 90% probability interval (dash-dot line), showing good performance for our LID-RJMCMC scheme. For comparison, we show the predictions obtained with the unscented particle filter algorithm proposed by van der Merwe, Doucet, de Freitas, & Wan (2000); dotted line). Their algorithm consists of a particle filter (PF) that uses an unscented Kalman filter (UKF) to generate the importance proposal distribution. Our method seems to perform similarly to the PF-UKF algorithm. 5.2 Airline Data. As a second example, we consider the well-known international airline data by Box and Jenkins (1970). This time series is one of the most commonly used seasonal ARIMA models. It was modeled by
Bayesian Analysis of Nonlinear Autoregression Models
467
Histogram of M 0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0
1
2
3
4
5
6
7
M Figure 8: Lynx data. Histogram of the posterior distribution of M, suggesting M = 2 nodes for the hidden layer of the FFNN term.
Box and Jenkins as ARIMA(011)(011)12 and is usually known as airline model due to this time series. Exploratory data analysis suggests seasonal behavior of the time series with similar properties exihibited every 12 observations (every 12 months). We shall therefore model each yt on the basis of the inmediately past values, yt−1 , yt−2 as well as corresponding seasonal values yt−12 , yt−13 . We initialized the MCMC algorithm by two well-dispersed values for the number of hidden nodes M and let the reversible-jump MCMC algorithm described run for a sufficient number of iterations: 2000 burn-in and another 18,000 iterations for making inferences. Similar comments about prior choices in the previous example apply here. Figure 10 shows the histogram of the posterior distribution of M; in this case, M = 3 seems to be the most likely number of nodes, but probabilities for M = 2, 4, 5 are also quite important. It is of interest to look at the reversible-jump algorithm development. Table 1 shows the number of times each move has been proposed and accepted. Note that acceptance rates are considerably larger for irrelevant nodes than for linearized or duplicate nodes, given, perhaps, to the simplicity of the movement.
468
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
TEST 4
3
log−lynx detrended
2
1
0
−1
−2
−3
True y PF−UKF estimate Bayes−FFNN estimate 95% prob. interval
−4
−5 1890
1895
1900
1905
1910
1915
1920
1925
1930
1935
Figure 9: Predicted values for the Lynx time series (log transformed and detrended).
For the airline time series, we also consider fitting both a classical ARIMA model, as presented by Box and Jenkins (1970) and an FFNN with three hidden nodes. Figure 11 presents prediction values (solid line) and 95% intervals (dash-dot line) for our LID-RJMCMC scheme, as well as comparison with the ARIMA(011)(011)12 model (dashed line) and FFNN(3) (dotted line), showing the superiority of neural networks in contrast to classical Table 1: Number of Times Proposed, and Times Accepted for the Different Moves. Move
Times Proposed
Times Accepted
j1a j1d j2a j2d j3a j3d j4a j4d
1 840 483 831 731 823 428 837
1 1 130 36 229 396 68 7
Bayesian Analysis of Nonlinear Autoregression Models
469
Histogram of M 0.35
0.30
Density
0.25
0.20
0.15
0.10
0.50
0.00
0
1
2
3
4
5
6
7
8
9
10
11
12
M Figure 10: Airline data. Histogram of the posterior distribution of M, showing M = 3 as the most likely number of nodes for the hidden layer of the FFNN term.
ARIMA models. Note that our results are very similar to those obtained from the classical FFNN with fixed number of nodes. Yet the main point we are trying to make is that with our algorithm, model selection is performed automatically, without the need to decide on a fixed architecture in advance. Table 2 compares the one-step-ahead mean square errors (MSE) obtained with each method. Again, our method leads to a slightly better performance over standard methods such as classical ARIMA and FFNN. 5.3 Simulated Examples. Finally, we apply our LID-RJMCMC scheme to simple linear and nonlinear simulated processes. Here, we are interested in seeing how our method performs for simple standard processes, deTable 2: One-Step-Ahead Mean Square Errors. Method
LID
ARIMA
FFNN(3)
MSE
0.0399
0.0455
0.0438
470
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
TEST 9
8
True y Bayes−NN estimate 95% prob. interval Arima FFNN(3)
7
air
6
5
4
3
2
120
122
124
126
128
130
132
Figure 11: Predicted values for the airline time series.
tecting them and not producing overly complex models. We generate two simple time series: a linear stationary AR(1) process and a nonlinear process generated by a FFNN with one node in the hidden layer. Table 3 shows the posterior distributions for indexes H (indicator of the linear term) and M (number of hidden nodes), after carrying out 20,000 MCMC iterations. Despite the complicated model, the scheme is able to identify the simpler standard autoregresive linear models (AR(1)) or neural networks (FFNN(1)), according to Occam’s principle of preferring simpler models.
Table 3: Posterior Distribution for Indexes H and M. H
AR(1) FFNN(1)
M
0
1
0
1
2
3
4
0.00 0.62
1.00 0.38
0.87 0
0.12 0.83
0.01 0.11
0.00 0.06
0.00 0.00
Bayesian Analysis of Nonlinear Autoregression Models
471
6 Conclusion We have presented a reversible-jump algorithm for the analysis of timeseries data through FFNNs viewed as nonlinear autoregression models. The advantages of the Bayesian approach outweigh its additional computational effort: as no local optimization algorithms are used, local minima issues are mitigated; model selection is performed as part of Bayesian methodology without the need of deciding explicitly a range of models to select from and having to use ad hoc methods to rank them. Bayesian analysis naturally embodies Occam’s razor—the principle of preferring simpler models to complex models. It is also possible to include a priori information about parameters and number of hidden nodes in accordance with this principle. The final product of a Bayesian analysis is a predictive distribution that averages over all possible models and parameter values, so uncertainty about predictions is estimated and overfitting risk is reduced. Many issues remain to be explored. An obvious extension is the application of the reversible-jump algorithm based on LID nodes to other standard neural network applications like regression, classification, or density estimation. Also, we have confined this analysis to FFNNs, but many other neural network models, analyzed from a Bayesian point of view, might prove useful. Relating to time-series analysis, here we have assumed autoregressive order q to be known in advance. We could also model uncertainty about q as a problem of model selection. Appendix A Linearized Nodes: Technical Details. Comments on section 4.1 emphasize the need to add and remove linearized nodes to get good coverage of the model space. When adding a linearized node, we will have to randomly generate weights γj0 , γj1 , . . . , γjq so that the node behaves in the linear zone for the given data set. To show how to achieve this, consider the case in which q = 2. For a given node γ0 , γ1, γ2 , ζ = γ0 + γ1 x1 + γ2 x2 belongs to 2 a family of straight lines with slope −γ γ1 . This node is linearized if equation 4.1 is verified. Rewrite equation 4.1 as −ρ − γ0 ≤ z (xi1 , xi2 ) ≤ ρ − γ0 ∀ (xi1 , xi2 ) , i = 1, . . . , N,
(A.1)
where z (x1 , x2 ) = γ1 x1 + γ2 x2 = κ (s1 x1 + vs2 x2 ) with si = sgn(γi ), i = 1, 2 κ = |γ1 | v=
|γ2 | . |γ1 |
For some data points, z is maximized or minimized (see Figure 12), so
472
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
x2
zmax=g1*x1+g2*x2
x1 zmin=g1*x1+g2*x2
Figure 12: To find zmin , we and zmaxmax force −ρ − γ0 ≤ zmin ≤ zmax ≤ ρ + γ0 into a min rectangle xmin , xmax containing all input data points. 1 , x2 1 , x2
the above condition can be summarized as −ρ − γ0 ≤ zmin ≤ zmax ≤ ρ + γ0 , (A.2) zmin zmin zmax zmax where zmin = z x1 , x2 and zmax = z x1 , x2 . To find zmin and zmax ,
min , xmax , xmax it is easier to force equation A.2 into a rectangle xmin 1 , x2 1 2 containing all input data points: = min (xi1 ) , xmin 1
xmax = max (xi1 ) 1
xmin 2
xmax 2
i
= min (xi2 ) , i
i
= max (xi2 ) . i
We may then write, 1 1 1 + sgn(γi ) xmax 1 − sgn(γi ) xmin + , i = 1, 2 i i 2 2 1 1 = + , i = 1, 2, 1 + sgn(γi ) xmin 1 − sgn(γi ) xmax i i 2 2
xizmax = xizmin and
zmax = γ1 x1zmax + γ2 x2zmax = κ s1 x1zmax + vs2 x2zmax zmin = γ1 x1zmin + γ2 x2zmin = κ s1 x1zmin + vs2 x2zmin .
Bayesian Analysis of Nonlinear Autoregression Models
473
Let L = zmax − zmin = δ=
κ δ
1 . min ) + v(xmax − xmin ) (xmax − x 1 1 2 2
Then, rewrite equation A.2 as: L ≤ 2ρ −ρ − γ0 + L ≤ zmax ≤ ρ − γ0 .
(A.3)
We have three degrees of freedom to verify equation A.3. Setting v = γγ21 and signs si , i = 1, 2, fixes the slope of the straight line z, which fixes the sigmoidal surface orientation. Once this is defined, we could take a value for κ so that L ≤ 2ρ, fixing the sigmoidal surface steepness. Finally, we should take γ0 so that −ρ − γ0 + L ≤ zmax ≤ ρ − γ0 , which locates the sigmoidal surface. 1 Rewrite δ = xmaxu−x min for some u1 and 1
1
κ max (x − xmin 1 ) = 2ρb u1 1
− xmin − xmin xmax 1 1 − u1 xmax 1 1 1 1 v = max −1 = u1 x2 − xmin δ(xmax − xmin xmax − xmin 2 1 1 ) 2 2
L=
κ=
2ρbu1 − xmin 1
xmax 1
γ0 = −ρ + L − zmax + u0 (2ρ − L) . These new quantities may be interpreted as follows: • u1 is related to the orientation of the sigmoidal surface, and therefore 1 with the slope of the straight lines z. In fact, 1−u u1 should take values in [0, ∞). • b is related to the width L of the interval (zmin , zmax ) , which is related to the steepness of the sigmoidal surface. It should take values in (0, 1). • u0 is related to the sigmoidal surface location given by how interval (zmin , zmax ) is placed within interval (−ρ − γ0 , ρ − γ0 ). It should take values in (0, 1). Let us rewrite γ0 , γ1 , and γ2 in terms of the above quantities: γ1 = s1
2ρbu1 − xmin 1 )
(xmax 1
474
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
2ρb (1 − u1 ) (xmax − xmin 2 2 ) γ0 = 2ρu0 1 − b + ρ(2b − 1) − (γ1 x1zmax + γ2 x2zmax ). Suppose now that we randomly generate weights γ0 , γ1, γ2 from a multivariate normal distribution, accepting those verifying equation A.3. The sample distribution of these weights has no recognizable shape, but the distributions of the other quantities above are suggested by their histograms: γ2 = s2
p(si = +1) = 0.5; b ∼ Beta(2, 2);
p(si = −1) = 0.5 u1 ∼ U(0, 1) ;
u0 ∼ U(0, 1).
We can then generate random weights corresponding to linearized nodes. The generalization to q > 2 is straightforward: γi =
2ρϕi , i = 1, . . . , q − xmin ) i
(xmax i
γ0 = 2ρu0 (1 − ϕ0 ) + ρ(2ϕ0 − 1) − (γ1 x1zmax + · · · + γq xqzmax ), q with ϕ0 = i=1 |ϕi | and xizmax defined as above. To generate ϕi , i = 1, . . . , q, first generate: q−1 ξ1 , . . . , ξq−1 ∼ Dirichlet(1, 1, . . . , 1); ξq = 1 − ξi i=1
b ∼ Beta(q, 2) p(si = +1) = p(si = −1) =
1 , i = 1, . . . , q 2
u0 ∼ U(0, 1), and define ϕi = bsi ξi . To propose a linearized node for deletion, we simply choose a node at random. Instead of finding out deterministically which node, if any, is a linearized one, we allow some randomness in the move definition. Once a node is selected, we recover the quantities defined previously: ξi = |ϕi | = u=
|γi | (xmax − xmin ) i i 2ρ
i = 1, . . . , q
γ0 + (γ1 x1zmax + · · · + γq xqzmax ) − ρ(2ϕ0 − 1) 2ρ(1 − ϕ0 )
b = ϕ0 =
q i=1
|ϕi | .
Bayesian Analysis of Nonlinear Autoregression Models
475
Should the node selected be approximately linearized, the following conditions would verify: q
ξi = 1 and ξi < 1
∀i = 1, . . . , q
i=1
u ∈ [0, 1] b ∈ [0, 1]. The density of any of these parameters (see appendix B) would be zero if they do not satisfy that condition. Hence, the probability of deleting a nonlinearized node with this move is zero; the move would be rejected. Appendix B: LID Reversible Jumps This appendix presents the mathematical details for the reversible-jump MCMC scheme for models containing both a linear term and an FFNN term, Linear + NN(m). Throughout this appendix, we use the following notation cur: current model pro: proposed model θ : parameters in current model φ: parameters in proposed model uθ : parameters to be added by the move uθ : parameters to be added by the move uφ : parameters to be deleted by the move According to the general reversible-jump MCMC scheme, proposed models are accepted with probability p(y | φ) p(φ | pro) p(pro) q(uφ | pro) ppro−→cur · · · · ·J , α = min 1, p(y | θ) p(θ | cur) p(cur) q(uθ | cur) pcur−→pro where p(y|φ) p(y|θ) p(φ|pro) p(θ|cur) p(pro) p(cur) q(uφ |pro) q(uθ |cur)
is the likelihood ratio, is the prior ratio for parameters for given models, is the prior ratio for models, is the proposal ratio,
476
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨ pprocur pcurpro
J
is the transition probabilities ratio, and is the Jacobian of the transformation φ = g(θ, uθ ).
Note the following: • For simplicity, we shall assume all possible models to be equiprobable p(pro) so that p(cur) = 1 for all jumps considered. • Extension to models without the linear term, NN(m) is straightforward. • Likelihoods p(y | φ) and p(y | θ) may be substituted by marginal likelihoods in β. • Rearrangement of γ to satisfy order constraint should be done once the jump between models has been accepted, to save computational time. B.1 Add/Delete Linear Term. We propose this move with probability B1 . B.1a Add Linear Term. NN(m) −→ Linear + NN(m), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uθ = {λ, µλ , λ } φ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ , λ, µλ , λ . The Jacobian of the one-to-one transformation φ = g(θ, uθ ) is J = 1. The proposal is given by µλ ∼ N(mλ , Sλ ) λ ∼ IWishart(wλ , λ ) λ ∼ N(µλ , λ ). Thus, p(φ | pro) = p(λ | µλ , λ )p(µλ | mλ , Sλ )p(λ | wλ , λ ) p(θ | cur)
Bayesian Analysis of Nonlinear Autoregression Models
ppro−→cur B1d = =1 pcur−→pro B1a
assuming B1d = B1a =
477
B1 2
q(uφ | pro) 1 = , q(uθ | cur) N(λ | µλ , λ )N(µλ | mλ , Sλ )IW(λ | wλ , λ ) p(y|φ) so that the acceptance probability is given by α = min 1, p(y|θ ) . B.1b Delete Linear Term . Linear + NN(m) −→ NN(m), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ , λ, µλ , λ φ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uφ = {λ, µλ , λ } . The Jacobian of the one-to-one transformation (φ, uφ ) = g(θ) is J = 1 and 1 p(φ | pro) = p(θ | cur) p(λ | µλ , λ )p(µλ | mλ , Sλ )p(λ | wλ , λ ) ppro→curr B1a = =1 pcur−→pro B1d
assuming B1d = B1a =
B1 2
q(uφ | pro) = N(λ | µλ , λ )N(µλ | mλ , Sλ )IW(λ | wλ , λ ), q(uθ | cur) p(y|φ) so that the acceptance probability is given by α = min 1, p(y|θ ) . B.2 Add/Delete Linearized Node. We propose this move with probability B2 . B.2a: Add Linearized Node. Linear + NN(m) −→ Linear + NN(m + 1), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ
478
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
uθ = , ϕ1 , ϕ2 , . . . , ϕp , u 2 , . . . , β M , β M+1 , γ . 1 , β γ1 , γ2 , . . . , γM , γM+1 , σ 2, µβ , σβ2 , µγ , φ= β The transformation φ = g(θ, uθ ) is defined by j = βj , β M+1 = , β
γj = γj γM+1,i =
j = 1, . . . , M 2ρϕi − xmin ) i
i = 1, . . . , p
(xmax i
γM+1,0 = 2ρu(1 − ϕ0 ) + ρ(2ϕ0 − 1) − ( γM+1,1 x1zmax + · · · + γM+1,p xpzmax ) σ 2 = σ 2, ϕ0 =
p
µβ = µβ ,
σβ2 = σβ2 ,
γ = γ µγ = µγ ,
|ϕi |
i=1
xizmax =
1 1 + , i = 1, . . . , p. 1 + sgn(ϕi ) xmax 1 − sgn(ϕi ) xmin i i 2 2
The Jacobian of the transformation is 2ρ(1 − ϕ0 ) 0 ... 2ρ − ... (xmax −xmin ) 1 1 − 0 ... J = − 0 ... − 0 ...
0
0
0
0
0
0
2ρ max min (xp−1 −xp−1 )
0
(2ρ)p (1 − ϕ0 ) = p . max − xmin ) i i=1 (xi The proposal is given by u ∼ U(0, 1) ∼ N(µβ , σβ2 ) b ∼ Beta(p, 2) ξ1 , . . . , ξp−1 ∼ Dirichlet(1, 1, . . . , 1) p(si = +1) = p(si = −1) = Then define ϕi = si bξi , i = 1, . . . , p.
1 , i = 1, . . . , p. 2
0 2ρ (xpmax −xpmin )
Bayesian Analysis of Nonlinear Autoregression Models
479
and it can be shown that p 1 fϕ (ϕ1 , . . . , ϕp ) = p (p + 2) 1 − ϕi , 2 i=1 |ϕk | f or0 < |ϕi | < 1 −
i = 1, . . . , p.
k=i
Thus, p(φ | pro) γ (M + 1)! µβ , σβ2 )p γM+1 | µγ , = p(βM+1 | p(θ | cur) M! ppro→curr B21d = pcur−→pro B21a q(uφ | pro) = q(uθ | cur)
1 M+1
1 2p (p
1
=
1 M+1
+ 2) 1 −
assuming B2d = B2a =
B2 2
1 p
. 2) ϕ , σ N( | µ i β β i=1
B.2b Delete Linearized Node. Linear + NN(m + 1) −→ Linear + NN(m), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , βM+1 , γ1 , γ2 , . . . , γM , γM+1 , σ 2 , µβ , σβ2 , µγ , γ 1 , β 2 , . . . , β M , γ φ= β γ1 , γ2 , . . . , γM , σ 2, µβ , σβ2 , µγ , uφ = , ϕ1 , ϕ2 , . . . , ϕp , u . Select node h = 1, . . . , M + 1 at random, take nodes in θ : 1, . . . , h − 1, h + 1, . . . , M + 1, and relabel as 1, . . . , M. The transformation φ, uφ = g(θ) is defined by j = βj , β γj = γj
j = 1, . . . , M
γ = γ σ 2 = σ 2, µβ = µβ , σβ2 = σβ2 , µγ = µγ , = βh ϕi =
γh,i (xmax − xmin ) i i 2ρ
u=
γh,0 + (γh,1 x1zmax + · · · + γh,p xpzmax ) − ρ(2ϕ0 − 1) 2ρ(1 − ϕ0 )
i = 1, . . . , p
480
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
ϕ0 =
p
|ϕi | ,
i=1
xizmax =
1 1 + , i = 1, . . . , p. 1 + sgn(ϕi ) xmax 1 − sgn(ϕi ) xmin i i 2 2
The Jacobian of the transformation is 1 0 ... 2ρ(1−ϕ0 ) max min (x1 −x1 ) ... − 2ρ − 0 ... J= − 0 ... − 0 ...
0
0
0
0
0
0
max min (xp−1 −xp−1 ) 2ρ
0
0
(xpmax −xpmin ) 2ρ
p =
max − xmin ) i i=1 (xi (2ρ)p (1 − ϕ0 ).
If γh is not a linearized node, then the acceptance probability is α = 0; otherwise, 1 p(φ | pro) M! = p(θ | cur) p(βh | µβ , σβ2 )p γh | µγ , γ (M + 1)! p(prop) =1 p(curr) ppro→curr
assuming Kequiprobable models p(i) =
1 , ∀i K
B21a 1 B21 = M+1 assuming B21d = B21a = 1 pcur−→pro B21d M+1 2 p p q(uφ | pro) 1 I[0,1− |ϕk |] (|ϕi |) p (p + 2) 1 − ϕi = k=i q(uθ | cur) 2 i=1 i=1 =
I[0,1] (u)N( | µβ , σβ2 ). B.3 Add/Delete Irrelevant Node. We propose this move with probability B3 . B.3a Add Irrelevant Node . Linear + NN(m) −→ Linear + NN(m + 1), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ
Bayesian Analysis of Nonlinear Autoregression Models
481
uθ = {βM+1 , γM+1 } φ = β1 , β2 , . . . , βM , βM+1 , γ1 , γ2 , . . . , γM , γM+1 , σ 2 , µβ , σβ2 , µγ , γ . The proposal is given by γM+1 ∼ N(µγ , γ ) βM+1 ∼ N(µβ , σβ2 ), rearranging γ to satisfy order constraint γ1,p < . . . < γM+1,p . The Jacobian of the one-to-one transformation φ, uφ = g(θ) is J = 1. Also, 2 p(φ | pro) p(βM+1 | µβ , σβ )p γM+1 | µγ , γ (M + 1)! = p(θ | cur) 1 M! p(prop) =1 p(curr)
assuming Kequiprobable modelsp(i) =
ppro→curr B3d ψM+1 ψM+1 = = M+1 M+1 pcur−→pro B3a k=1 ψk k=1 ψk
1 , ∀i K
assuming B3d = B3a =
B3 2
q(uφ | pro) 1 , = q(uθ | cur) p(βM+1 | µβ , σβ2 )p γM+1 | µγ , γ p(y|φ) M+1 M+1 . so that the acceptance probability is given by α = min 1, p(y|θ ) (M+1)ψ k=1
ψk
B.3b Delete Irrelevant Node. Linear + NN(m + 1) −→ Linear + NN(m), m ≥ 0 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , βM+1 , γ1 , γ2 , . . . , γM , γM+1 , σ 2 , µβ , σβ2 , µγ , γ φ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uφ = {, δ} . Let ψj =
√1 2π
exp
−1 2 2
x2
j = 1, . . . , M + 1 for a given . Choose node ψ
h at random with probabilities pj = M+1j h, and relabel as 1, . . . , M.
k=1
ψk
, j = 1, . . . , M + 1. Remove node
482
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
The transformation φ, uφ = g(θ) is defined by = βh δ = γh , with Jacobian J = 1 and p(φ | pro) M! 1 = 2 p(θ | cur) p(βh | µβ , σβ )p γh | µγ , γ (M + 1)! p(prop) =1 p(curr)
assuming K equiprobable models p(i) = M+1
ppro→curr
B3a 1 = ψh pcur−→pro B3d M+1 k=1
= ψk
k=1
ψk
ψh
1 , ∀i K
assuming B3d = B3a =
B3 2
p( | µβ , σβ2 )p δ | µγ , γ q(uφ | pro) = , q(uθ | cur) 1 M+1 ψk p(y|φ) k=1 so that the acceptance probability is given by α = min 1, p(y|θ ) (M+1)ψ . h B.4 Add/Delete Duplicate Node. We propose this move with probability B4 . B.4a Add Duplicate Node. Linear + NN(m) −→ Linear + NN(m + 1), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , γ1 , γ2 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uθ = {ν, δ} h , β h+1 , . . . , βM , φ = β1 , β2 , . . . , β
γ1 , γ2 , . . . , γh , γh+1 , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ .
h+1 , h , γh and (β γh+1 ) where Choose h = 1, . . . , M at random and propose β h = βh (1 − ν) β γh = γh
Bayesian Analysis of Nonlinear Autoregression Models
483
h+1 = βh ν β γh+1 = γh + δ. Reject if (γ1 , γ2 , . . . , γh , γh+1 , . . . , γM ) is not ordered. The Jacobian of the transformation φ = g(θ, uθ ) is defined by 0 1 0 1 ∂ β h+1 , γh , β γh+1 (1 − ν) 0 ν 0 h , = |βh | . J= = −βh ∂ (γh , βh , ν, δ) 0 βh 0 0 0 0 1 The proposal is given by ν ∼ Beta(2, 2) δ ∼ N(0, cSγ ). Thus, h+1 | µβ , σ 2 )p h | µβ , σ 2 )p γh | µγ , γ p(β γh+1 | µγ , γ p(φ | pro) p(β β β = p(θ | cur) p(βh | µβ , σβ2 )p γh | µγ , γ (M + 1)−1 1 ppro→curr B4d M 1 = = pcur−→pro B4d 1 M
assuming B4d = B4a =
B4 2
q(uφ | pro) 1 = . q(uθ | cur) Beta(ν | 2, 2)N(δ | 0, cSγ ) B.4d Delete Duplicate Node. Linear + NN(m + 1) −→ Linear + NN(m), m ≥ 1 (θ, uθ ) −→ φ, uφ θ = β1 , β2 , . . . , βM , βM+1 , γ1 , γ2 , . . . , γM , γM+1 , σ 2 , µβ , σβ2 , µγ , γ h , . . . , βM , γ1 , γ2 , . . . , φ = β1 , β2 , . . . , β γh , . . . , γM , σ 2 , µβ , σβ2 , µγ , γ uφ = {ν, δ} . h , Select h = 1, . . . , M at random. Remove βh+1 , γh+1 and propose β γh , where h = βh + βh+1 β
484
A. Menchero, R. Montes Diez, D. R´ıos Insua, and P. Muller ¨
γh = γh ν=
βh+1 βh + βh+1
δ = γh+1 − γh . Then relabel as 1, . . . , M. The Jacobian of the transformation φ, uφ = g(θ) is γh , ν, δ ∂ βh , = J= ∂ γh , γh+1 , βh , βh+1
0 0 1
1 0 0
1
0
0 0
−βh+1 (βh +βh+1 )2 βh (βh +βh+1 )2
1 = βh + βh+1 0
−1 1 0
and h | µβ , σ 2 )p γh | µγ , γ (M + 1)−1 p(β p(φ | pro) β = p(θ | cur) p(βh | µβ , σβ2 )p γh | µγ , γ p(βh+1 | µβ , σβ2 )p γh+1 | µγ , γ ppro→curr B4a 1 = =M 1 pcur−→pro B4d M
assuming B4d = B4a =
B4 2
q(uφ | pro) = I(0,1) (ν)Beta(ν | 2, 2)N(δ | 0, cSγ ). q(uθ | cur) Acknowledgments This work has been supported by projects from CICYT, CAM, and URJC. R.M.D. gratefully acknowledges receipt of a postdoctoral fellowship from CAM (Comunidad Autonoma ´ de Madrid). We also thank the referees for their illuminating comments. References Andrieu, C., de Freitas, J. F. G., & Doucet, A. (1999). Robust full Bayesian learning for neural networks (Tech. Rep. CUED/FINFENG/TR 343). Cambridge: Cambridge University, Engineering Department. Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis: Forecasting and control. San Francisco: Holden-Day. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Faraway, J., & Chatfield, C. (1998). Time series forecasting with neural networks: A comparative study using the airline data. Journal of the Royal Statistical Society, Applied Statistics (Series C), 47, 231–250. Foster, B., Collopy, F., & Ungar, L.H. (1992). Neural network forecasting of short noisy time series. Computers and Chem. Engr., 16(4), 293–298.
Bayesian Analysis of Nonlinear Autoregression Models
485
Gamerman, D. (1997). Markov chain Monte Carlo: Stochastic simulation for Bayesian inference. London: Chapman & Hall. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741. Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. Hill, T., O’Connor, M., & Remus, W. (1996). Neural network models for time series forecasts, Management Sci., 42, 1082–1092. Holmes, C. C., & Mallick, B. K. (1998). Bayesian radial basis functions of variable dimension. Neural Computation, 10, 1217–1233. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Conference of the Cognitive Science Society. Mackay, D. J. C. (1992). Bayesian methods for adaptive models. Unpublished doctoral dissertation, California Institute of Technology. Muller, ¨ P., & R´ıos Insua, D. (1998). Issues in Bayesian analysis of neural network models. Neural Computation, 10, 571–592. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Priestley, M. B. (1988). Non-linear and non-stationary time series analysis. London: Academic Press. Remus, W., O’Connor, M., & K. Griggs, (1998). The impact of information of unknown correctness on the judgmental forecasting process. International Journal of Forecasting, 14, 313–322. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society B, 59, 731–792. R´ıos Insua, D, & Muller, ¨ P. (1998). Feedforward neural networks for nonparametric regression. In D. Dey, P. Muller, ¨ & D. Sinha (Eds.), Practical nonparametric and semiparametric bayesian statistics. Berlin: Springer. Stern, H. (1996). Neural networks in applied statistics (with discussion). Technometrics, 38, 205–220 Tang, Z., de Almeida, C., & Fishwick, P. A. (1991). Time series forecasting using neural networks versus Box-Jenkins methodology. Simulation, 57, 303–310 van der Merwe, R., Doucet, A. de Freitas, N., & Wan, E. (2000). The unscented particle filter (Tech. Rep. CUED/F-INFENG/TR 380). Cambridge: Cambridge University Department of Engineering. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Received January 7, 2004; accepted July 1, 2004.
ˇ ıma Communicated by Jiri S´
LETTER
Loading Deep Networks Is Hard: The Pyramidal Case David Windisch
[email protected] Bahnplatz 5, A-2371 Hinterbruehl, Austria
The question of whether it is possible to load deep neural network architectures efficiently is examined by considering the class of pyramidal architectures. This class allows only a low interaction of the nodes. Still, the loading problem is found to be NP-complete. This provides evidence that depth alone is a factor accounting for loading hardness.
1 Introduction The loading problem for neural networks was first considered by Judd (1990). While he proved the loading problem to be NP-complete for arbitrary networks by reduction of the NP-complete 3 satisfiability problem (see Garey & Johnson, 1979), he found a polynomial-time loading algorithm for a restricted class of shallow architectures. Judd also proved NP-hardness of the loading problem for several classes of shallow architectures that do not satisfy his restriction but left the loading problem for deep (i.e., nonshallow) architectures unaddressed . This raises the question whether deep architectures can possibly be loaded in polynomial time. Deep architecˇ ıma (1994), who proved NP-completeness for tures were considered by S´ the class of triangular architectures, and by Hammer (1998a), who proved NP-hardness for certain classes of multilayer perceptrons. Finally, de Souto and de Oliveira (1999) introduced the restricted class of deep pyramidal architectures. These are similar to triangular architectures but allow only a minimal interaction between the nodes. de Souto and de Oliveira also proved NP-completeness of the loading problem for pyramidal architectures, but they overlooked that their proof holds only for a fan-in strictly greater than 2. Their result remains true for fan-in equal to 2 but requires a modified proof, which is given below. Also, NP-completeness is extended to another class of pyramidal architectures, where the number of hidden layers is fixed, and the fan-in varies instead, and to the node function set of linearly separable functions. This result is of interest because it shows that not even for the restricted class of pyramidal architectures, in which only minimal interaction of the nodes is possible, can loading be achieved in polynomial time, provided P = NP. This leads to the conclusion that depth by itself makes loading hard. Neural Computation 17, 487–502 (2005)
c 2005 Massachusetts Institute of Technology
488
D. Windisch
q
q
q
q
d q
Figure 1: A pyramidal architecture. The input nodes (empty circles) do not count as a layer of the architecture.
2 Terminology Definition 1. (pyramidal architecture) A pyramidal architecture π(d, q) is a feed forward architecture whose nodes form a q-tree (q ≥ 2) with d + 1 layers, where all the nodes have the same fan-in q. The leaves (in layer 0) are the input nodes feeding the nodes in layer 1. The root of the tree (in layer d) is the output node. Thus, every node receives input from its q children and propagates its output to its parent. See Figure 1 for a schematic illustration and Figure 2 for an illustration of the case d = q = 3. Definition 2. (node function sets used) The functions computed by the nodes can be chosen from either one of the following two sets: 1. AOFns = {AND, OR}, 2. LSFns = {[w, . ≥ c] : w ∈ Rq , c ∈ R}, where ., . denotes the inner product on Rq and 1 if w, y ≥ c [w, y ≥ c] = 0 otherwise By v, we denote a node of an architecture as well as the function computed by it. Also, we do not distinguish between a configured architecture and the function computed by it. This does not cause any confusion throughout this letter.
Loading Deep Networks Is Hard
489
Definition 3. (configuration, item, task, performability) Let A be a class of architectures and F a set of node functions. We call a configuration of A ∈ A a mapping from the set of nodes of A to F . By a task T = {Ii }, we mean a set of correct input-output combinations Ii = (xi , αi ), called items. If the architecture A is equipped with a configuration (i.e., if A has node functions attached to its nodes), then we say that A performs T if, for any item I = (x, α) ∈ T, the configured architecture A, when given the input x specified in I, returns the output α specified in I. Definition 4. (loading problem—decision version) Let A be a class of architectures and F a set of node functions. We define the loading problem as the following decision problem: Given an architecture A ∈ A and a task T, is there a configuration of A such that A performs T? Formally, the loading problem is the problem of recognizing the following language: PerfF = {(A, T) : F allows a configuration of A such that A performs T},
(2.1)
where A denotes a network architecture in A, T a task. A pyramidal architecture (see definition 1, Figure 1) depends on two different parameters: depth d and fan-in q. In order to consider the hardness of the loading problem, we must let one of these parameters vary. We first examine the problem for variable d and fixed q. 3 Variable Depth, Fixed Fan-In We now state the theorem de Souto and de Oliveira (1999) attempted to prove and provide a modified proof that also holds for the case q = 2. Then, we explain where their proof fails. Theorem 1. PerfAOFns is NP-complete for the class of pyramidal architectures with fixed q ≥ 2 and variable d ≥ 3. Proof. The proof is accomplished by reduction of the NP-complete satisfiability problem (SAT). For an explanation of this technique, see Garey and Johnson (1979). By vi , i = 1, 2, . . . , qd−2 , we denote the nodes in layer 2, and by vik , k = 1, 2, . . . , q, the corresponding nodes of layer 1 (i.e., the input nodes of the node vi are the nodes {vik , k = 1, 2, . . . , q}; see Figure 2). As indicated previously, the variables for the nodes also denote the functions computed by them.
490
D. Windisch
v11
v12
v13
v1
v21
v22
v23
v31
v2
v32
v33
v3
v
Figure 2: A pyramidal architecture, where d = 3, q = 3. The empty circles are input nodes.
The SAT (satisfiability) problem can be defined as follows: Given a Boolean expression in conjunctive normal form, is there a truth assignment to its variables, such that the expression is true? Formally, the SAT problem is the problem of recognizing the following language: SAT = {(Z, C) : ∃ such that (C) is true}, where Z = {ζ1 , ζ2 , . . . , ζr } is a set of boolean variables, C a boolean expression in terms of conjunctions of disjunctions of literals, and : Z → {0, 1} is some truth assignment to the variables in Z. We let (Z, C) be an arbitrary instance of the SAT problem, where C = {c1 , c2 , . . . , ce } is a conjunction of disjunctions over the set Z. For an arbitrarily chosen disjunction cj of C, we use the following notation: cj = {ζ˜i1 , ζ˜i2 , . . . , ζ˜im },
(3.1)
where 1 ≤ i1 < i2 < . . . < im ≤ r, and ζ˜ip is the literal derived from the variable ζip (i.e., ζ˜ip = ζip or ζ˜ip = ¬ζip ). 3.1 Reduction of SAT. We now reduce this instance (Z, C) of SAT to the instance (π, T) of the loading problem. To define the architecture π = π(d, q), we let every node vi in layer 2 represent one of the variables ζi . Hence, we need at least r nodes in layer 2. To meet this criterion, we choose d = 2 + logq r.
(3.2)
Note that the input dimension is then qd . For any disjunction cj in C, we define a corresponding item Ij in the task, T = {I0 , I1 , . . . , Ie , J1 , J2 , . . . , Jqd−2 , K1 , K2 , . . . , Kqd−2 },
Loading Deep Networks Is Hard
491
as follows (and give the justification below): Ij = (([ζ˜1 ]1q
2
−2q
), ([ζ˜2 ]1q
2
−2q
). . .([ζ˜r ]1q
2
−2q
)0q
d
−rq2
, 1), j = 1, 2,. . ., e, (3.3)
where [ζ˜i ] denotes an encoding of the literal ζ˜i , and is defined as follows: (11)1q−2 (10)1q−2 if ζ˜i = ζi ˜ (3.4) [ζi ] = (10)1q−2 (11)1q−2 if ζ˜i = ¬ζi (10)1q−2 (10)1q−2 if ζ˜i is not present in cj , where the number of 1’s is chosen such that half of the encoding [ζ˜i ] (i.e., (..)1q−2 ) constitutes the whole input to the node vi1 and the other half the 2 input to vi2 . In every bracket ([ζ˜i ]1q −2q ) in equation 3.3, the number of 2 1’s is chosen such that the bracket ([ζ˜i ]1q −2q ) constitutes the whole input affecting the layer 2 node vi . We define the item I0 (corresponding to the empty disjunction) in the same way but with required output 0: I0 = ([(10)1q−2 (10)1q−2 1q
2
−2q r qd −rq2
]0
, 0).
(3.5)
The following items Ji , i = 1, 2, . . . , qd−2 , force all nodes in layers ≥ 3 to compute OR. This is because only the ith layer 2 node will output 1 from the input of item Ji , so all nodes along the path from vi to the output node (below vi ) must compute OR: Ji = (0(i−1)q 1q 00 . . . 0, 1), 2
2
i = 1, 2, . . . , qd−2 .
(3.6)
And finally, we construct the following items Ki , i = 1, 2, . . . , qd−2 , such that Ki generates the input (100 . . . 0) to layer 2 node vi , and thus forces vi to compute AND: Ki = (0(i−1)q 1q 00 . . . 0, 0), 2
i = 1, 2, . . . , qd−2 .
(3.7)
The construction can be performed in polynomial time with respect to the size of (Z, C). To show this, we note that the length of any item I is qd + 1 = O(r) (by the definition of d, see equation 3.2), the number of items is equal to 1 + e + 2qd−2 = O(e + r), the number of nodes of A is qd+2 −1
1 + q + . . . + qd+1 = q−1 = O(qd ) = O(r), and the number of edges of A is one less than the number of nodes of A. We now give an illustrative example of the above construction for the case q = 3. Let (Z, C) be the following instance of SAT: Z = {ζ1 , ζ2 } C = {c1 , c2 } c1 = {ζ1 } c2 = {¬ζ1 , ¬ζ2 }.
492
D. Windisch
Following the above construction of a corresponding instance of PerfAOFns , we choose d = 2 + log3 2 = 3. This gives the architecture π(3, 3) depicted in Figure 2. The items in the task T are the following: I1 = (111101111
101101111
000000000, 1),
I2 = (101111111
101111111
000000000, 1),
I0 = (101101111
101101111
000000000, 0),
J1 = (111111111
000000000 000000000, 1),
J2 = (000000000 111111111
000000000, 1),
J3 = (000000000 000000000 111111111, 1), K1 = (111000000 000000000 000000000, 0), K2 = (000000000 111000000 000000000, 0), K3 = (000000000 000000000 111000000, 0). Proof. To prove that (Z, C) ∈ SAT ⇔ (π, T) ∈ PerfAOFns , we first assume that (Z, C) ∈ SAT and show that this implies that (π, T) ∈ PerfAOFns . The assumption means that there is a truth assignment : Z → {0, 1}, such that (C) is true. We make the following choices for the node functions:
AND OR vi1 , vi2 = vi1 =
if (ζi ) = 1 if (ζi ) = 0,
vik = AND ∀(i, k) : (i ≤ r, k ≥ 3) or i > r vi = AND ∀i (this is layer 2),
(3.8)
and all other nodes compute OR. First, we consider the item I0 (see equation 3.5). For any i, both nodes vi1 and vi2 receive at least one 0 as input. But either vi1 or vi2 computes the AND function, so vi receives one 0 (either from vi1 or from vi2 , at least) as its input. Since vi computes AND, it follows that vi outputs 0. Hence, all the layer 2 nodes output 0, so the final output is 0, as required. Now we consider all other items Ij , j = 1, 2, . . . , e, in the task. Each of these requires output 1. Since all nodes in layers ≥ 3 compute OR, it suffices to check that there is at least one node in layer 2 that outputs 1 in each case. Here, we will use the established correspondence with the disjunctions in C. We choose an arbitrary item Ij = (xj , αj ), say, and we denote the corresponding disjunction again by cj = {ζ˜i1 , ζ˜i2 , . . . , ζ˜im } (see equation 3.1). Since we are assuming the boolean expression C to be satisfied by the assignment , it would suffice to show that vl (xj ) = (ζ˜l ) ∀l ∈ {i1 , i2 , . . . , im }.
Loading Deep Networks Is Hard
493
To show that this is indeed true, we need only to check the four possible cases: (1) If (ζl ) = 1, then the node functions were chosen such that vl1 = AND, and vl2 = OR (see equation 3.8): • If ζ˜l = ζl , then we see from the definition of Ij (see equations 3.3 and 3.4) that the only node of all vlk , k = 1, 2, . . . , q, receiving a 0 as input is node vl2 . But vl2 = OR, so all inputs to node vl are 1’s, and vl (xj ) = 1, as required. • If ζ˜l = ¬ζl , then we see from the definition of Ij (see equations 3.3 and 3.4) that only vl1 receives a 0 as input. Since vl1 = AND, this implies that vl receives a 0 as input, and hence vl (xj ) = 0, as required. (2) If (ζl ) = 0, then the node functions were chosen such that vl1 = OR, and vl2 = AND (see equation 3.8). The same considerations as in case 1 show that vl (xj ) = (ζ˜l ) in this case, too. Hence, π maps all items Ij , j = 0, 1, . . . , e, correctly. If we consider the input of any item Ji , i = 1, 2, . . . , qd−2 (see equation 3.6), we see that it generates input 1q to the layer 2 node vi , so that vi must output 1. Since all nodes below layer 2 compute OR, this implies that the output of the architecture is 1, as required. Finally, the input of an item Ki , i = 1, 2, . . . , qd−2 (see equation 3.7), generates input 0q to all layer 2 nodes, except vi , whose input is 10q−1 . Since vi computes AND, vi outputs 0, like all other layer 2 nodes. Therefore, the output is 0, as required. We therefore conclude that we indeed have (π, T) ∈ PerfAOFns . Now, we assume that (π, T) ∈ PerfAOFns and show that this implies that (Z, C) ∈ SAT. To this end, we define the truth assignment by 1 if vi1 = AND (ζi ) = 0 otherwise, qd−2
and show that (C) is true. First, we note that the set of items {Ji }i=1 requires all nodes in layers ≥ 3 to compute OR (see equation 3.6), and that qd−2
therefore the items {Ki }i=1 require all layer 2 nodes vi to compute AND (see equation 3.7). We need to choose any disjunction cj (j ∈ {1, 2, . . . , e}) and show that at least one literal in it is assigned the value 1. To do that, we consider the corresponding item Ij , and compare it with I0 . We note that the outputs required by these two items differ. This difference can be accounted for only by some node functions at the nodes where the inputs of these two items differ. Comparing equations 3.3 and 3.4 with equation 3.5, we see that these nodes must be some nodes vi1 or vi2 , say, where i ≤ r. Remembering that the corresponding node vi computes AND, we see that this implies that exactly one of the nodes vi1 , vi2 must compute the AND function, for
494
D. Windisch
in the other two cases, the output of vi would remain the same for inputs from I0 and Ij . Equation 3.4 shows that if vi1 = AND (which implies that (ζi ) = 1), then ζ˜i = ζi , and if vi1 = AND (which implies that (ζi ) = 0), then (vi2 = AND and) ζ˜i = ¬ζi . Hence, we have in both cases (ζ˜i ) = 1, as required, so (C) is true and (Z, C) ∈ SAT. This completes the proof that (Z, C) ∈ SAT ⇔ (π, T) ∈ PerfAOFns . To show that PerfAOFns is in NP, we note that an algorithm checking whether a given configuration of an architecture A performs T has to evaluate all node functions once for every item in T, and hence runs in polynomial time. This completes the proof. It should be noted that the construction of (A, T) in the last proof could be achieved with fewer items Ji , since a Ji for every layer 3 node would suffice. This, however, would complicate the notation unnecessarily. This last proof is very similar to the one used by de Souto and de Oliveira (1999), except for two modifications. The first modification is that instead of the modified satisfiability problem (MSAT), proved to be NP-complete ˇ ıma (1994), we use the classical version SAT. The difference between by S´ these two problems is that MSAT requires positive and negative literals to alternate in every disjunction. In fact, even the proof by de Souto and de Oliveira (1999) can be left unchanged if MSAT is replaced by SAT. The crucial modification, however, is the extension of the task T (of the constructed instance of the loading problem) by the additional items Ji and Ki . These items force all nodes in layers ≥ 3 to compute OR and all nodes in layer 2 to compute AND. This property of a yes-instance of PerfAOFns is needed to infer that exactly one of two layer 1 nodes vi1 , vi2 must compute AND in order to perform the task T. But this last fact is essential for the final argument that given a performing configuration for π, the truth assignment satisfies the boolean expression C. If the fan-in q is strictly greater than 2 then even without items Ji and Ki , an argument similar to the above allows concluding that exactly one of two layer 1 nodes vi1 , vi2 must compute AND in order to perform the task, because the corresponding node vi must compute AND to distinguish between a pair of items I0 and Ij . However, this last argument breaks down if q = 2. To illustrate this point, we let q = 2, and provide an example of a no instance of SAT (as well as MSAT), which is mapped to a yes instance of PerfAOFns if the items Ji and Ki are omitted. Let Z = {ζ1 , ζ2 }, C = {c1 , c2 , c3 }, c1 = {¬ζ1 , ζ2 }, c2 = {¬ζ2 }, c3 = {ζ1 }. Obviously, C is then not satisfiable. However, using the reduction constructed in the proof, this instance gets mapped to π(3, 2) (i.e., a pyramidal architecture with three layers) with task T = {I0 , I1 , I2 , I3 }, I0 = (10101010, 0), I1 = (10111110, 1), I2 = (10101011, 1), I3 = (11101010, 1). This task is performed by π(3, 2) with node functions (v denotes the output node, otherwise notation as in the above proof) v11 = AND, v12 = OR, v21 = AND, v22 = AND, v1 = AND, v2 = OR, v = OR.
Loading Deep Networks Is Hard
495
q−1
q
Figure 3: The class of architectures described in theorem 2.
It remains an open question whether the same hardness result as the one just proved also holds for the node function set LSFns. A generalization of the method used in the last proof to the LSFns case does not seem obvious. 4 Fixed Depth, Variable Fan-In We now consider the set of pyramidal architectures with fixed d and variable q. We note that these architectures still qualify as deep networks because the support cone configuration space is unbounded. (The support cone configuration space of an output node is the set of all possible configurations of the nodes affecting it. See Judd, 1990.) We consider the node function sets LSFns and AOFns separately. 4.1 LSFns. We first restrict the set under consideration to the set of pyramidal architectures with fixed d = 2. From Blum and Rivest (1992), we know that the loading problem is NP-complete for a similar class of architectures: Theorem 2. PerfLSFns is NP-hard for the class of architectures mapping {0, 1}q−1 to {0, 1}, with q nodes in one hidden layer, each connected to all input nodes and computing LSFns, and one output node computing AND. (See Figure 3.) . (In fact, the proof We denote the problem in the last theorem by Perf given by Blum & Rivest, 1992, covers even the more general case, where the number of hidden nodes is bounded by a polynomial in q.) We will use theorem 2 to prove the following lemma: Lemma 1. PerfLSFns is NP-hard for the class of pyramidal architectures with fixed d = 2 and variable q ≥ 2.
496
D. Windisch
x1 x2
v^1
v^2
x1 x 2 0 x 1 x 2 0 x 1 x 2 0
v^3
^
v
v1
v2
v3
v
(left) is reduced Figure 4: Illustration of the proof of lemma 1. An instance of Perf to the instance on the right. The input shows the construction of the first set of ¯ the task T. (the idea is illustrated in Figure 4). Proof. The proof is by reduction of Perf , where Perf is the problem ˆ ˆ We denote by (A, T) an arbitrary instance of Perf proved to be NP-hard in theorem 2. The input dimension of this instance ¯ T) ¯ is q − 1 and the number of hidden nodes q. We construct an instance (A, ¯ Note that this implies that the of our problem PerfLSFns with fan-in q of A. input dimension of A¯ is q2 . Let Tˆ = {(xi , αi )}. We then choose some item of Tˆ with required output 1 and redenote its input by x1 , if necessary. If ˆ T) ˆ is a no such item exists, that is, if all required outputs are 0, then (A, ), and the reduction can be completed by ˆ T) ˆ ∈ Perf trivial yes instance ((A, constructing a trivial yes instance of PerfLSFns (it suffices to choose the node functions such that the output of A¯ is constant, with a corresponding task). Otherwise, we continue by defining the task T¯ as follows: ˆ T¯ = {((xi 0)q , αi ) : (xi , αi ) ∈ T} ∪{((x1 1)(x1 0)q−1 , 0), ((x1 0)(x1 1)(x1 0)q−2 , 0), . . . , ((x1 0)q−1 (x1 1), 0)}. ˆ where the second set is to make sure that That is, T¯ is an enlargement of T, the output node of A¯ is forced to compute the AND function (without loss of generality; see argument below). To check that this construction can be performed in polynomial time ˆ T)|, ˆ we note that the number of nodes in A¯ is equal to with respect to |(A, ˆ that the number of edges in A¯ is q2 + q = q2 + q + 1 = O(q2 ) = O(|A|), ˆ that |T| ¯ = |T| ˆ + q = O(|(A, ˆ T)|), ˆ O(q2 ) = O(|A|), and that the length of every 2 ¯ ˆ item in T is q + 1 = O(|A|). It remains to be shown that Aˆ can perform Tˆ if and only if A¯ can perform ¯ We first assume that there is a configuration of Aˆ such that Aˆ performs T. ˆ and we show that this implies that there is a configuration of A¯ such T,
Loading Deep Networks Is Hard
497
¯ We denote the hidden nodes (as well as the functions that A¯ performs T. ˆ and v¯ i (for A), ¯ i = 1, . . . , q, and the output computed by them) by vˆ i (for A) ¯ respectively. We denote the performing configuration of Aˆ as node as vˆ or v, ˆ i , y ≥ cˆi ], w ˆ i = (wˆ i1 , wˆ i2 , . . . , wˆ i(q−1) ) (and vˆ = AND). We then vˆ i (y) = [w choose the configuration of A¯ as follows: v¯ i (y) = [(wˆ i1 , wˆ i2 , . . . , wˆ i(q−1) , θ), y ≥ cˆi ] v¯ = vˆ = AND (i.e. r¯(y) = [(11 . . . 1), y ≥ q]), where θ is some low enough value, such that a 1 in the last input bit to any v¯ i generates a threshold value of less than cˆi , and hence the output 0 (we ˆ ˆ i , xj : i = 1, 2, . . . , q, j = 1, 2, . . . , |T|}). may choose any θ < min{ˆci − w The ¯ architecture A¯ configured in this way then performs T. Conversely, we assume that there is some configuration of A¯ such that A¯ ¯ The second set in the definition of T¯ then implies that there is performs T. ¯ z, say, such that v(z) ¯ exactly one possible input to the output node v, = 1, because any change of the input to v¯ must decrease the threshold value of the function computed by v¯ to a value low enough to make v¯ output 0. Thus, the function computed by v¯ is entirely determined by z, the unique vector ¯ in {0, 1}q such that v(z)=1. The node function set LSFns is closed under negation. This is due to the fact that ¬[w, . ≥ c] = [−w, . > −c] = [−w, . ≥ −c + ] for > 0 small enough. Therefore, if we replace the function computed by a node v¯ i ¯ Furthermore, it is by its negation, we still have a valid configuration of A. q easy to see that for any z ∈ {0, 1} , the boolean function from {0, 1}q to {0, 1}, which is equal to 1 only z, is in the set LSFns (in fact, this function can at q be written as [ˆz, . ≥ i=1 zi /2], where zˆ i = zi − 1/2). Since v¯ computes ¯ so such a function, we may therefore adjust the function computed by v, that the mapping computed by the entire architecture remains the same after the previous replacement of v¯ i by its negation. Indeed, it suffices to replace zi by its negation (i.e., to make z˜ = (z1 , z2 , . . . , zi−1 , ¬zi , zi+1 , . . . , zq ) ¯ z) = 1), since after the replacement of v¯ i by its the unique vector such that v(˜ negation, the ith input bit to v¯ is always the negation of what it was before the replacement, whereas the other input bits remain unchanged. Since the mapping is the same after this adjustment, the task is still performed by the architecture. By making the appropriate replacements just described for every i such that zi = 0, we obtain a performing configuration under which ¯ v(y) = 1 only if y = (11 . . . 1), which means that v¯ computes AND. The choice of the first set in the definition of T¯ then implies that Aˆ perˆ Using the above notation for the (adjusted) node functions, we let forms T. ˆ i = (w¯ i1 , w¯ i2 , . . . , w¯ iq ) (we drop the last entry w¯ i(q+1) ), and cˆi = c¯i . w ¯ which completes Hence, Aˆ can perform Tˆ if and only if A¯ can perform T, the proof.
498
D. Windisch
v11/ v11
v12/ v12
v21/ v21 v31 v(d−1)1
v(d−1)2
vd1=v ¯ is 2, the architecture A conFigure 5: In the LS-functions case, if the fan-in of A structed in the proof of theorem 3 looks like this. The circled nodes correspond to ¯ The input nodes shown (empty circles) are the only nodes receiving nonzero A. input by the constructed task T.
Using another reduction, we can generalize this last result to an arbitrary number of fixed layers. Theorem 3. PerfLSFns is NP-complete for the class of pyramidal architectures with fixed d ≥ 2 and variable q ≥ 2. Proof. The proof is accomplished by by reduction of PerfLSFns for pyramidal architectures with fixed d = 2 and variable q ≥ 2, which was proved to ¯ T) ¯ be be NP-hard in lemma 1. We denote this last problem by Perf . Let (A, an arbitrary instance of Perf with fan-in q. We also let T¯ = {(xi , αi )}. Then we construct an instance (A, T) of our problem PerfLSFns with the same fan-in q, and with the following task: ¯ T = {(xi 00 . . . 0, αi ) : (xi , αi ) ∈ T}. From this construction, it follows that the nodes receiving nonconstant input by items in T correspond exactly to A¯ (these are the circled nodes in Figure 5). ¯ T)|, ¯ beThis construction can be performed in polynomial time in |(A, ¯ ¯ cause |T| = |T|, and the length of every item in T is O(|A|), as is the number of nodes and edges in A.
Loading Deep Networks Is Hard
499
It remains to be shown that A¯ performs T¯ if and only if A performs T. We ¯ respectively (see denote the jth node in layer i by vij and v¯ ij , for A and A, ¯ To see that this implies Figure 5), and we first assume that A¯ performs T. ¯ All the other that A performs T, we choose vij = v¯ ij for all nodes v¯ ij of A. node functions of A are chosen arbitrarily, except those along the path from v21 to the output node vd1 = v, which are chosen such that the output of v21 equals the output of v for any input in the task T (the node vk1 just propagates the input received from node v(k−1)1 (k = 3, 4, . . . , d), regardless of all other input bits). With this choice, A performs T. Next, we assume that A performs T. If all outputs occurring in T are ¯ by the definition equal, then this is also true for all outputs occurring in T, ¯ Otherwise, we note that for all of T, and hence A¯ can trivially perform T. items in T, the output of v is entirely determined by the output of v21 . Indeed, all other layer 2 nodes produce the same output for every input present in T. Hence, for inputs in T, the outputs of v and v21 are either always equal or always negations of each other (here, we use the fact that not all outputs occurring in T are equal). Therefore, should the two outputs always be negations of each other, we may replace the functions computed by v21 and v by their negations (thus negating every output in T twice) without influencing the performance of T (here, as in the proof of lemma 1, we use closure under negation of LSFns). With the resulting configuration, A still performs T, and the outputs of v21 and v are equal for every input appearing in T (if v has been replaced by its negation, the output of v21 no longer gets negated). We then choose v¯ ij = vij as the configuration of ¯ which shows that A¯ performs T. ¯ Hence, A¯ performs T¯ if and only if A A, performs T. It remains to be shown that PerfLSFns is in NP. It is known (Muroga, Toda, & Takasu, 1961) that any linearly separable boolean function (i.e., any function in the set LSFns) receiving q inputs can be represented using not more than O(q2 log q) bits. Since an algorithm checking whether a given configuration of A performs T needs to evaluate every node function once for every item in T, this algorithm therefore runs in polynomial time. Hence, PerfLSFns is in NP, and this completes the proof. 4.2 AOFns. Similarly, we now consider AOFns as a node function set. The first result is proved in the same way as theorem 1: Lemma 2. PerfAOFns is NP-complete for the class of pyramidal architectures with fixed d = 3 and variable q ≥ 2. Proof. The proof is similar to that of theorem 1. The reduction of SAT to PerfAOFns remains the same, except that instead of choosing a large enough d, we choose here a large enough q = r, where r is the number of variables in the instance of SAT. The rest of the proof is left unchanged.
500
D. Windisch
The proof of the next theorem is similar to the proof of theorem 3: Theorem 4. PerfAOFns is NP-complete for any class of pyramidal architectures with fixed d ≥ 3 and variable q ≥ 2. Proof. The proof is by reduction of the result of lemma 2, proceeding as in the proof of theorem 3. The following corollary gives a result for LSFns as well as AOFns and is a direct consequence of theorems 3 and 4: Corollary 1. Perf (LS functions or AO functions) is NP-complete for the class of pyramidal architectures with variable q ≥ 2 and variable d ≥ 3 (AO functions)/d ≥ 2 (LS functions), where the parameters d and q are understood to be given as input to the loading algorithm. Proof. The proof is by identity reduction of Perf for the class of pyramidal architectures with fixed d and variable q, which was proved to be NP-hard in theorems 3 and 4. The proof that the problem is in NP is the same as the one in the proof of theorem 1 (AOFns case) or theorem 3 (LSFns case), respectively. 5 Conclusion It remains open whether the above results hold for more widely used classes of node functions, especially in the case of variable depth and fixed fan-in, in which NP-hardness was proved only for the node function set AOFns. An important extension would be to the sigmoidal activation functions used in the backpropagation learning algorithm (Rumelhart, Hinton, & Williams, ˇ ıma 1986). A theorem for sigmoidal activation functions was proved by S´ (1996): NP-hardness holds for the three-node network structure, with a restriction on the node functions, which holds, for example, if the threshold of the function computed by the output node is 0. Due to Hammer (1998b), NP-hardness of loading the three-node structure also holds with a different restriction on weights and classification accuracy. Vu (1998) considered a certain class of two-layered sigmoidal networks and showed that it is NP-hard to decide whether there exist weights minimizing the training error up to a certain constant. Similar hardness results (for approximate minimization of the quadratric error) on two-layered LSFns networks with a sigmoidal ˇ ıma (2002) extended output node are due to Bartlett and Ben-David (2002). S´ NP-hardness of approximate minimization of the squared error to the single sigmoidal node (Hoffgen, ¨ Simon, & VanHorn, 1995, showed a similar singlenode NP-hardness result for the LSFns case). DasGupta and Hammer (2000) considered two-layered sigmoidal (and other multilayered) networks, and
Loading Deep Networks Is Hard
501
the problem, proved to be NP-hard, of whether there are weights maximizing the ratio of correctly classified items in a task up to a certain constant. ˇ ıma (2002) gives a more detailed survey of these and other results. While S´ they suggest that hardness results on pyramidal architectures also hold for sigmoidal activation functions, it is not obvious how a generalization to pyramidal architectures could be achieved. Still, the above contribution provides evidence that depth is a factor that can make loading hard by itself. Acknowledgments This research was supported by an Undergraduate Summer Research Bursary from Keele University, Department of Mathematics, Keele, Staffordshire, UK, ST5 5DT. References Bartlett, P., & Ben-David, S. (2002). Hardness results for neural network approximation problems. Theoretical Computer Science, 284, 53–66. Blum, A., & Rivest, R. L. (1992). Training a three-node neural network is NPcomplete. Neural Networks, 5, 117–127. DasGupta, B., & Hammer, B. (2000). On approximate learning by multilayered feedforward circuits. In H. Arimura, S. Jain, & A. Sharma (Eds.), Algorithmic learning theory ’2000, (pp. 264–278). Berlin: Springer. de Souto, M. C. P., & de Oliveira, W. R. (1999). The loading problem for pyramidal neural networks. Electronic Journal on Mathematics of Computation. Available at: http://gmc.ucpel.tche.br/ejmc/. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. San Francisco: Freeman. Hammer, B. (1998a). Some complexity results for perceptron networks. In L. Niklasson, M. Boden, & T. Ziemke (Eds.), International Conference on Artificial Neural Networks ’98 (pp. 639–518). Berlin: Springer-Verlag. Hammer, B. (1998b). Training a sigmoidal network is difficult. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks ’98 (pp. 255–260). Brussels: D-Facto Publications Hoffgen, ¨ K.-U., & Simon, H.-U., & VanHorn, K.S. (1995). Robust trainability of single neurons. Journal of Computer and System Sciences, 50, 114–125 Judd, J.S. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press. Muroga, S., Toda, I., & Takasu, S. (1961). Theory of majority decision elements. Journal of the Franklin Institute, 271, 376–418. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(9), 533–536. ˇ ıma, J. (1994). Loading deep networks is hard. Neural Computation, 6(5), 842– S´ 850. ˇ ıma, J. (1996). Back-propagation is not efficient. Neural Networks, 9(6), 1017– S´ 1023.
502
D. Windisch
ˇ ıma, J. (2002). Training a single sigmoidal neuron is hard. Neural Computation, S´ 14, 2709–2728. Vu, V.H. (1998). On the infeasibility of training neural networks with small squared errors. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 371–377). Cambridge, MA: MIT Press. Received September 4, 2003; accepted July 1, 2004.
NOTE
Communicated by Klaus Obermayer
Maximum Likelihood Topographic Map Formation Marc M. Van Hulle
[email protected] K.U.Leuven, Laboratorium voor Neuro- en Psychofysiologie, B-3000 Leuven, Belgium
We introduce a new unsupervised learning algorithm for kernel-based topographic map formation of heteroscedastic gaussian mixtures that allows for a unified account of distortion error (vector quantization), loglikelihood, and Kullback-Leibler divergence. 1 Introduction Several unsupervised learning algorithms have been devised in recent years that develop topographically organized maps of gaussian mixture densities (for references, see Van Hulle, 2002). Unifying accounts of the homoscedastic (equal-variance) case have been introduced by Graepel and coworkers (Graepel, Burger, & Obermayer, 1998) and Heskes (2001). Graepel et al. adopted a statistical physics approach, called deterministic annealing, showing the connection between different classes of kernel-based topographic map formation algorithms, and eventually, as a limiting case, the batch map version of the self-organizing map (SOM) algorithm (Kohonen, 1995). Heskes showed the connection between minimum distortion topographic map formation and maximum likelihood homoscedastic gaussian mixture density modeling. An approach different from distortion-based learning is to optimize an information-theoretic criterion. Linsker (1989) was among the first to do so by applying his principle of maximum information preservation (infomax) to topographic map formation. Another approach is to minimize the Kullback-Leibler divergence (also termed cross-entropy) between the true and estimated input densities, an idea that has been introduced in kernelbased topographic map formation by Benaim and Tomasini (1991), using homoscedastic gaussians, and extended more recently by Yin and Allinson (2001) to heteroscedastic (different-variance) gaussians. In this article, we introduce an algorithm that develops topographically organized maps of heteroscedastic gaussian mixtures. We show the link between distortion error (vector quantization), log-likelihood and Kullback Leibler divergence. We develop an alternative kernel definition that greatly simplifies the link and introduce a likelihood-based regularizer to avoid nonoptimal solutions. Furthermore, we suggest an alternative approach to log-likelihood, called differential log-likelihood, for monitoring, in an unbiased manner, the learning process. Neural Computation 17, 503–513 (2005)
c 2005 Massachusetts Institute of Technology
504
M. Van Hulle
2 Likelihood Maximization Let v = [v1 , . . . , vd ] be a random vector in V ⊆ d generated from the probability density p(v). Assume we have N formal neurons with gaussian activation functions Ki (v, wi , σi ), i = 1, . . . , N, with center wi = [wi1 , . . . , wid ] ∈ d , and radius σi , 1 −v − wi 2 , (2.1) exp Ki (v, wi , σi ) = d 2σi2 (2πσ 2 ) 2 i
such that we obtain a homogeneous, heteroscedastic gaussian mixture estimate of the unknown input density: ˜ σ) = p(v) ≈ p(v|W,
1 Ki (v, wi , σi ). N i
(2.2)
A standard procedure to estimate the parameters W = (w1 , . . . , wN ) and σ = (σ1 , . . . , σN ) is by maximizing the (average) likelihood or by minimizing the (average) negative log-likelihood for the sample S = {vµ |µ = 1, . . . , M} (Redner & Walker, 1984): F = −log L = −
1 ˜ µ |W, σ ), log p(v M µ
(2.3)
through an expectation-maximization (EM) approach (Dempster, Laird, & Rubin, 1977). Suppose the gaussian mixture is to be developed in such a way that the N neurons form a lattice in which neighboring neurons code for neighboring positions in the input space (topology-preserving map). Define ij as the usual neighborhood function, a symmetric (ij = ji ), monotonically decreasing function of the lattice distance between neurons i and j. Before we define our gaussian activation function for such a lattice, consider the following reasoning. We take the weighted and normalized error term shown next on which we apply a bias-variance decomposition into a term depending on vµ and another independent of vµ , ri vµ − wi 2 ri vµ − wr 2 = + wi − wr 2 , 2 2 2 2σ 2σ 2σ r r r r i with
(2.4)
wi = 1 σ 2i
=
ri wr σ2 rri r σr2 r
ri r
σr2
.
,
(2.5) (2.6)
Maximum Likelihood Topographic Map Formation
The gaussian activation kernel can then be defined as 1 v − wi 2 exp − , Ki (v, wi , σ i ) = d 2σ 2i (2πσ 2i ) 2 so that we obtain the following mixture density model, ˜ σ , Q) = qi Ki (v, wi , σi ), p(v|W,
505
(2.7)
(2.8)
i
ri 2 where Q = (q1 , . . . , qN ), with qi = (exp (− r 2σ 2 wi − wr ))/Z the kerr nel’s prior probability, which contains the topological information, with Z the usual normalization constant so that i qi = 1. Taking the derivatives of the log-likelihood, equation 2.3, with respect to the wi s and putting them equal to zero leads to the following set of fixed point rules, µ µ µ r ir pr v wi = (2.9) µ , ∀i, µ r ir pr µ
where we have substituted for the posterior probabilities P(i|vµ ) ≡ pi = µ qi Ki µ . qK j j j
We then take the derivatives of F with respect to the radii σi and put them equal to zero. This yields the following set of linear equations in (σ 1 )2 , . . . , (σ N )2 , after some straightforward algebraic manipulations: µ
µ
pj (σ j )2 ij =
1 d
µ
j
µ
pj ij vµ − wi 2 , ∀i.
(2.10)
j
Algorithmically, we adopt an EM approach. In the expectation step, we µ determine the new values for pi . In the maximization step, we determine µ new the new wi and σi , given pi . We determine wnew from equation 2.9, i new 2 then (σ i ) by solving equation 2.10, and then σinew by solving the linear equations in σ12 , equation 2.6. i
3 Link with Vector Quantization Consider the following distortion or quantization error functional: E(W, σ ) =
µ
+
i
j
µ
µ µ v
ji pj
j
µ
pj
− wi 2 2σi2
µ ji µ d pi − + p log log . (3.1) i 2 2 1/N i σi µ i
506
M. Van Hulle
The first term on the right-hand side resembles (apart from σi ) the familiar distortion term in topographic map formation (see, e.g., Heskes, 2001), with µ pi the probability that kernel Ki generated input vµ and ij the “confusion probability” that input vµ is instead generated by kernel Kj . The second term contains the effect of having variable kernel radii (note it is a constant when the radii are equal—the homoscedastic case—and is therefore absent in Heskes’s 2001 format, equation 1). The last term is a cross-entropy, intended to minimize the discrepancy between the prior probability assignµ ments P(i) = N1 and the actual ones, pi and thus acts as a regularization term. From the derivative with respect to wi , we obtain equation 2.9. The derivative with respect to σi leads to µ
µ 2 pj d ij ∂E µ v − wi =− ij pj + jk 3 3 ∂σi 2σi σi µ µ j j k 2
=−
µ
j
σk
µ 2 µ ij µ v − wi ij pj + pj d σ j2 3 , 3 2σi σi µ j
(3.2)
which, when put equal to zero, yields equation 2.10. The derivative with µ respect to pi becomes, after substitution of σ i , qi Ki (vµ , wi , σ i ) µ pi = . µ j qj Kj (v , wj , σ j )
(3.3) µ
Finally, when filling in the expression for pi in equation 3.1, and after some algebraic manipulations, we obtain that √ E = (−log L + R)M( 2π)d
(3.4)
with ij 1 R=− log qi exp − wi − wj 2 , 2 M µ i j 2σj
(3.5)
the part that collects the topological information. We observe that, up to a scale factor, distortion minimization is equivalent to the combined loglikelihood maximization of the mixture model and the minimization of the term R, which specifies the topological relations. Finally, one should note that it is straightforward to extend our format to the heterogeneous case by taking the prior probabilities P(i) = N1 .
Maximum Likelihood Topographic Map Formation
507
4 Alternative Kernel Definition Let us adopt the following kernel function, ri v − wr 2 1 Ki (v, W, σ ) = , (4.1) exp − Zi 2σr2 r with Zi such that V Ki (v, W, σ ) dv = 1 (normalization constant), namely, ij √ Zi = ( 2πσ i )d exp − (4.2) wi − wj 2 , 2 j 2σj and which can be used in the following kernel-based mixture estimate: ˜ σ) = p(v) ≈ p(v|W,
1 Ki (v, W, σ ). N i
(4.3)
Consider the following error function, E(W, σ ) =
µ
+
i
r
µ
ri pµ r
i
µ
pi
vµ − wi 2 2σi2
µ pi + log Zi , log 1/N
(4.4)
which is different from equation 3.1 in the log Zi term. Filling in the expresµ sion for pi , equation 3.3, into the expression for E(W, σ ), and after some algebraic manipulations, we obtain that E = −M log L,
(4.5)
which shows the immediate link between log-likelihood maximization and distortion minimization. Conceptually, the alternative kernel is also simpler since it includes the topological information in its argument. If we take M → ∞ in the previous equation, thenwe can write the expectation of E as ˜ E (E) = − V p(v) log p(v|W, σ )dv ≥ − V p(v) log p(v)dv = E (E)min . Hence, ˜ with the latter the Kullback-Leibler divergence. E (E) − E (E)min = KL(p, p), Or, when put in the finite sample format, E = −M log L = M(KL + entropy). Since the log-likelihood is biased by the differential entropy, we cannot use this as a metric for monitoring the convergence process when the actual density p is not known. We suggest a way around that by replacing the differential entropy by the estimate − V p˜ log p˜ dv: ˜ p(v) log p(v|W, σ )dv LL = − V ˜ ˜ + p(v|W, σ ) log p(v|W, σ )dv, (4.6) V
508
M. Van Hulle
which we call the differential log-likelihood, in analogy with the differential entropy (thus also with respect to a bias). When the density estimate equals the true density, then this metric will equal zero. The advantage is that we can generate as large a data set as desired from p˜ (Monte Carlo simulation of data) for obtaining an accurate kernel density estimate of entropy (Ahmad & Lin, 1976), or a sample spacing estimate in the one-dimensional case (Beirlant, Dudewicz, Gyorfi, ¨ & van der Meulen, 1997). 5 Simulation: Topographic Map Formation Consider the standard case of a 10×10 lattice that is trained according to the alternative kernel formulation on a sample S of size M = 1000 drawn from the uniform distribution in the unit square [−1, 1]2 . The weights are initialized by sampling the same distribution; the radii are initialized by sampling the uniform distribution [0.25, 0.75]. We use a gaussian neighborhood function, (i, j, σ ) = exp −
ri − rj 2
2σ2
,
(5.1)
with σ the neighborhood radius and ri neuron i’s lattice coordinate. We adopt the following neighborhood cooling scheme, t σ (t) = σ0 exp −σ0 , tmax
(5.2)
with t the current epoch and tmax the maximum number of epochs, and σ0 the radius at t = 0. We take tmax = 50, σ0 = 5. The results are shown in Figure 1. We observe that the lattice unfolds extremely rapidly; initially, the weights are at the sample’s centroid, and the kernel radii grow rapidly in size and span a considerable part of the input distribution. Finally, we observe in the evolution of LL (see Figure 2) that from about 30 epochs onward, it hovers around zero, as desired. 6 Correspondence with Other Algorithms We can verify that when taking σi = √1 , ∀i (i.e., homoscedastic gaussian β
mixture case), equation 3.1 becomes equal to equation 1 in Heskes (2001), except for the second term, which is missing in Heskes’s case (since it becomes a constant). Heskes established the link between the error function and the log-likelihood in an approximate √ sense, since the second term was omitted, as well as the scale factor M( 2π)d . Hence, we can say that our first approach is the generalization of Heskes’s to the heteroscedastic case. Our approach with the alternative kernel function is, however, different and
Maximum Likelihood Topographic Map Formation
509
0
2
4
5
10
20
40
50
1.0
|
0.5
|
0.0
|
-0.5
|
∆LL
Figure 1: Evolution of a 10 × 10 lattice as a function of time (epochs). The circles demarcate the radii σi of the corresponding gaussian kernels. The boxes correspond to the range spanned by the uniform input density. The values given below the boxes represent the number of epochs elapsed.
|
-1.0 | 0
| 10
| 20
| 30
| | 40 50 epochs
Figure 2: Evolution of LL as a function of epochs.
becomes equal to Heskes’s only when the neighborhood range vanishes (ij = δij ). The advantage of the heteroscedastic case is that no separate procedure is needed for optimizing the (common) kernel radii. Also, a benefit is expected when the input density consists of components with strongly varying spreads. ˜ as Yin and Allinson (2001) used the Kullback-Leibler divergence KL(p, p) the goal function. Since for our alternative kernel function, we could write ˜ = E (E) − E (E)min , with the latter term a constant, given the that KL(p, p) ˜ and E (E) is equivainput density, performing gradient descent on KL(p, p)
510
M. Van Hulle
lent. The weight and radii update rules should correspond to ours (albeit, they used incremental learning); however, they were developed differently µ since Yin and Allinson considered the posterior pi to be an approximation of the neighborhood function (in input space coordinates) and further simplified the radii update rules by adopting a competitive stage (winner µ = arg maxi pi ). We have separate terms for the neighborhood function (in lattice space coordinates) and the posteriors, and no competitive stage. In the soft topographic vector quantization (STVQ) algorithm for homoscedastic mixture modeling (Graepel, Burger, & Obermayer, 1997), the same fixed-point rule for the weight updates is obtained as ours, but with a common gaussian kernel radius σi = √1 ∀i, with β the inverse temperβ
ature, which is decreased exponentially over time (deterministic annealing) and thus serves a different purpose. Graepel et al. (1998) also proposed the kernel-based soft topographic mapping (STMK) algorithm by introducing a nonlinear transformation that maps the data points to a highdimensional “feature” space and, in addition, admits a kernel function, such as a gaussian with fixed radius, as in (kernel-based) support vector machines (SVMs) (Vapnik, 1995). The topographic map’s parameters (“weights”) are expressed as linear combinations of the transformed inputs, so that the map is in fact developed in feature space rather than in input space directly, as in our case. The idea of Graepel and coworkers was taken up again by Andr´as (2002) but with the purpose of optimizing the map’s classification performance, by individually adjusting the kernel radii using a supervised learning algorithm. This basically sets Andr´as’s approach apart from ours. In the kernel-based maximum entropy learning Rule (kMER) (Van Hulle, 1998), the kernel outputs are thresholded, and, depending on the binary activations, the kernel centers and radii are adapted. In the current approach, both the activation states and the learning rules depend on the continuously graded kernel outputs. µ Finally, when taking pi = δii∗ , with i∗ = arg mini vµ − wi 2 , that is, the “winning” neuron, we obtain the SOM algorithm in batch mode (Kohonen, µ 1995), which only the wi s. When we take pi = δii∗µ , with i∗µ = adapts 1 µ 2 arg mini j ij v − wj , and σi = √ , ∀i in equation 3.1, then the error β function becomes E = β2 µ i ii∗µ vµ −wi 2 , plus a nonrelevant constant. This is the format introduced by Heskes and Kappen (1993) and Luttrell (1991) in order to derive the SOM weight update rule from an error function. 7 Regularization Interestingly, we can derive a likelihood-based regularization term for the likelihood functional, equation 2.3, that will help in finding relevant local minima (e.g., nonvanishing kernel radii). Since P(i) = V p(v)P(i|v)dv, with µ 1 ˆ = 1 in our case P(i) = N1 , we should have that P(i) µ pi ≈ N (note that the M
Maximum Likelihood Topographic Map Formation
511
maximum likelihood estimate is asymptotically unbiased for large M; for a textbook account, see Rice, 1995). We suggest the following log-likelihood regularization term, R=−
1 1 µ µ log pi /M = − log pi − log M, NM µ i NM µ i
(7.1)
µ
which is minimized when pi becomes similar to N1 . We can further drop the constant − log M since it is irrelevant to the minimization. The regularized log-likelihood functional then becomes Freg = F + αR, with α a parameter 1 µ (usually α 1). Note that the regularization term − µ i NM log pi is 1 µ minimized, or µ i NM log pi maximized, and that in the error function µ µ pi is minimized. E(·), equation 3.1, the regularization term µ i pi log 1/N Hence, the two regularization terms are different. 8 Simulations: Regularization Consider a one-dimensional lattice (“chain”) consisting of N = 9 neurons with gaussian kernels, trained on a sample S sized M = 1000, drawn from the standard normal distribution G(0, 1). We take tmax = 50, σ0 = 4.5, σi (0) = 0.2, ∀i. In order to explore the effect of the log-likelihood constraint, we consider α = 0.001 and α = 0, with and without the constraint. We initialize the kernel centers by taking N = 9 data points from either the uniform distributions [−2, 2] or [−1, 1] or the sample set itself; for the initial radii, we take 0.2. We run each simulation 19 times and determine, for an independent test set of size M = 1000, the median value of the mean squared error (MSE) between the density estimate and the true distribution, as well as the first and third quantiles (the mean value is dominated by unsuccessful runs). The results are summarized in Table 1. We repeat the simulations for a skewed distribution consisting of a mixture of two gaussians, G(0, 1) and G(1.5, 1/3), with mixing coefficients 3/4 and 1/4. The results are summarized in Table 2. Both sets of results show a clear advantage of likelihood-based regularization in the case of a uniform random initialization, as is usually done in topographic map formation, and, thus, also that the cross-entropy regularization term in equation 3.1 is not very effective. For the case where the data set is sampled, the advantage is less clear; however, it can still matter for sparse distributed data sets. 9 Conclusion We have introduced a log-likelihood maximization approach to variable kernel-based topographic map formation. We have shown that for a specific kernel definition, a simple relation exists between distortion minimization, log-likelihood maximization, and Kullback-Leibler divergence. We have
512
M. Van Hulle
Table 1: MSE Results (in 10−4 ) in Case of Standard Normal Input Density. Initialization [−2, 2] [−1, 1] S
α = 0.001
α=0
1.46 [0.686, 1.91] 1.33 [0.789, 2.00] 1.20 [0.850, 2.44]
27.1 [3.33, 433] 2.12 [0.984, 2.95] 1.29 [0.724, 2.18]
Notes: Results are listed in terms of medians and first and third quartiles (in brackets). The input density is estimated using 1000 data points and N = 9 gaussian kernels, with (α = 0.001) and without log-likelihood constraint (α = 0), and for three different kernel center initializations: by means of samples taken from the uniform distributions [−2, 2] and [−1, 1] or the sample S.
Table 2: MSE Results (in 10−4 ) in Case of a Skewed Bimodal Input Density. Initialization [−2, 2] [−1, 1] S
α = 0.001
α=0
2.15 [1.32, 3.36] 1.92 [1.06, 2.69] 2.18 [0.961, 2.99]
326 [4.03, 651] 2.08 [0.875, 2.56] 2.28 [0.948, 3.41]
Note: Same conventions as in Table 1.
also introduced a regularizer and suggested a possible metric, the differential log-likelihood, for monitoring the convergence process. There are at least two ways to extend the practical value of this work, as exemplified in Heskes (2001): the extension to the missing values case in an EM sense and the use of the marginal probabilities of the neurons of the trained topographic map for data visualization and exploration purposes. All these are clear advantages over the classic SOM algorithm. Showing the practical value of the differential log-likelihood metric is the subject of a follow-up technical paper. Acknowledgments I was supported by research grants from the Belgian Fund for Scientific Research–Flanders (G.0248.03 and G.0234.04), the Flemish Regional Ministry of Education (Belgium) (GOA 2000/11), and the European Commission (IST-2001-32114 and IST-2002-001917). References Ahmad, I. A., & Lin, P. E. (1976). A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory, 22, 372–375.
Maximum Likelihood Topographic Map Formation
513
Andr´as, P. (2002). Kernel-Kohonen networks. Int. J. Neural Systems 12(2), 117– 135. Beirlant, J., Dudewicz, E. J., Gyorfi, ¨ L., & van der Meulen, E. C. (1997). Nonparametric entropy estimation: An overview. Int. J. Math. and Statistical Sciences 6, 17–39. Benaim, M., & Tomasini, L. (1991). Competitive and self-organizing algorithms based on the minimization of an information criterion. In Proc. ICANN’91 (pp. 391–396). Amsterdam: North-Holland. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Roy. Statist. Soc., B, 39, 1–38. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic self-organizing maps. Physical Rev. E, 56(4), 3876–3890. Graepel, T., Burger, M., & Obermayer, K. (1998). Self-organizing maps: Generalizations and new optimization techniques. Neurocomputing, 21, 173–190. Heskes, T. (2001). Self-organizing maps, vector quantization, and mixture modeling. IEEE Trans. Neural Networks, 12(6), 1299–1305. Heskes, T. M., & Kappen, B. (1993). Error potentials for self-organization. In Proc. IEEE Int. Conf. on Neural Networks (pp. 1219–1223). New York: IEEE. Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computat., 1, 402–411. Luttrell, S. P. (1991). Code vector density in topographic mappings: Scalar case. IEEE Trans. Neural Networks, 2, 427–436. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2), 195–239. Rice, J. A. (1995). Mathematical statistics and data analysis. Belmont, CA: Wadsworth. Van Hulle, M. M. (1998). Kernel-based equiprobabilistic topographic map formation. Neural Computat., 10(7), 1847–1871. Van Hulle, M. M. (2002). Joint entropy maximization in kernel-based topographic maps. Neural Computat., 14(8), 1887–1906. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Yin, H., & Allinson, N. M. (2001). Self-organizing mixture networks for probability density estimation. IEEE Trans. Neural Networks, 12, 405–411. Received April 28, 2004; accepted July 27, 2004.
NOTE
Communicated by Kiichi Urahama
On Convergence Conditions of an Extended Projection Neural Network Youshen Xia
[email protected] Department of Applied Mathematics, Nanjing University of Posts and Telecommunications, China
Gang Feng Department of Manufacturing Engg. and Engg. Management, The City University of Hong Kong, Hong Kong, China
The output trajectory convergence of an extended projection neural network was developed under the positive definiteness condition of the Jacobian matrix of nonlinear mapping. This note offers several new convergence results. The state trajectory convergence and the output trajectory convergence of the extended projection neural network are obtained under the positive semidefiniteness condition of the Jacobian matrix. Comparison and illustrative examples demonstrate applied significance of these new results. 1 Introduction We are concerned with the following extended projection neural network developed in Xia (2004): State equation: PX (x − (F(x) + ∇ g(x)y − BT z)) − x du = λ (y + g(x))+ − y dt −Bx + b
(1.1)
Output equation: w(t) = x(t), where λ > 0 is a scaling constant, u = (x, y, z) ∈ Rn ×Rm ×Rr is a state vector, w is an output vector, B ∈ Rr×n , b ∈ Rr , F(x) is a continuously differentiable vector-valued function from Rn into Rn , g(x) = [g1 (x), . . . , gm (x)]T , gi (x)(i = 1, . . . , m) are twice differentiable functions from Rn into R1 , ∇ g(x) = (∇ g1 (x), . . . , ∇ gm (x)), ∇ gi (x) is the gradient of gi (i = 1, . . . , m), X = {x ∈ Rn |l ≤ x ≤ h}, l = [l1 , . . . , ln ]T , h = [h1 , . . . , hn ]T , (y)+ = [(y1 )+ , . . . , (ym )+ ]T , (yi )+ = Neural Computation 17, 515–525 (2005)
c 2005 Massachusetts Institute of Technology
516
Y. Xia and G. Feng
max{0, yi }, and PX : Rn → X is a projection operator defined by xi < li li li ≤ xi ≤ hi PX (xi ) = xi hi xi > hi . As shown in Xia (2004), the extended projection neural network in equation 1.1 has a low computational complexity and a fast convergence rate. Moreover, its equilibrium point is related to the solution of the following constrained variational inequality, (x − x∗ )T F(x∗ ) ≥ 0,
x ∈ ,
(1.2)
where = {x ∈ Rn | g(x) ≤ 0, Bx = b, l ≤ x ≤ h}. More exactly, if u∗ = (x∗ , y∗ , z∗ ) is an equilibrium point of equation 1.1, then x∗ is a solution of equation 1.2 (Bertsekas & Tsitsiklis, 1989; Harker & Pang, 1990). Thus, the desirable solution to equation 1.2 can be obtained by tracking the continuous trajectory of equation 1.1 in real time (Cichocki & Unbehauen, 1993; Golden, 1996; Urahama, 1996). The output trajectory convergence of the extended projection neural network in equation 1.1 has been proven under the condition that the Jacobian matrix ∇F(x) of F is positive definite on X and constraint functions gi (x) (i = 1, . . . , m) are convex on X (Xia, 2004). As a result, the extended neural network is theoretically guaranteed to solve strictly monotone variational inequality problems with general convex constraints. It is well known that studying the state trajectory convergence of the extended neural network is of research interest and applied significance. On the other side, in much of real-world engineering optimization, the nonlinear mapping F(x) is monotone instead of strictly monotone (De Farias & Van Roy, 2003). Moreover, there exist many cases that F(x) and ∇ gi (x) are not monotone (Oja,1992; Bell & Sejnowski, 1995; Welling & Webber, 2001; Ahn & Oh, 2003). Therefore, studying weak convergence conditions of the extended projection neural network in equation 1.1 is very desirable for wide optimization and engineering applications. This note proposes new conditions of the global convergence of the extended projection neural network in equation 1.1. The proposed new results not only remove the positive definiteness condition of the Jacobian matrix of F on X but also allow F(x) and ∇ gi (x) to be nonmonotone. Therefore, the extended projection neural network is further guaranteed to solve constrained monotone variational inequality problems and a class of constrained nonmonotonic variational inequality problems. 2 Main Results Definition 1. A mapping F is said to be monotone on X if, for each pair of points x, y ∈ X, we have (x − y, F(x) − F(y)) ≥ 0,
On Convergence Conditions of an Extended Projection Neural Network
517
where (· , ·) denotes an inner product. F is said to be strictly monotone on X if the strict inequality holds whenever x = y. Throughout the convergence analysis, we assume that the extended projection neural network in equation 1.1 has at least an equilibrium point u∗ = (x∗ , y∗ , z∗ ). Also, we denote the Hessian matrix of gi (x) by ∇ 2 gi (x). Our main results on the convergence of the extended projection neural network in equation 1.1 are the following. Theorem 1. Assume that F(x) and ∇ gi (x) (i = 1, . . . , m) are monotone on X. If ∇F(x∗ ) is positive definite, then the extended projection neural network in equation 1.1 converges globally to an equilibrium point of equation 1.1. Proof. Let u(t) = (x(t), y(t), z(t)) be the state trajectory of equation 1.1 with the initial point u(t0 ) = (x(t0 ), y(t0 ), z(t0 )). Since by Xia and Wang (2000), we see that u(t) will exponentially approach to set 0 = {u = (x, y, z) ∈ Rn+m+r | x ∈ X, y ∈ Rm / X and y(t0 ) < 0, without + } if x(t0 ) ∈ the loss of generality we assume that x(t0 ) ∈ X and y(t0 ) ≥ 0. Consider the following Lyapunov function, 1 1 V(u) = H(u)T T(u) − T(u)2 + u − u∗ 2 , 2 2
u ∈ 0 ,
(2.1)
where u∗ = (x∗ , y∗ , z∗ ) is an equilibrium point of equation 1.1, and F(x) + ∇ g(x)y − BT z , H(u) = −g(x) Bx − b
x − PX (x − (F(x) + ∇ g(x)y − BT z)) . T(u) = y − (y + g(x))+ Bx − b From the result obtained by Xia (2004) we know that V(u) ≥ u(t) ⊂ 0 , and
1 2 u
dV(u) ≤ −λ(H(u∗ ) − H(u))T (u∗ − u) − λT(u)T ∇H(u)T(u), dt where ∇F(x) + ni=1 yi ∇ 2 gi (x) ∇H(u) = −∇ g(x)T BT
∇ g(x) O1 O3
−B O2 , O4
− u∗ 2 ,
(2.2)
518
Y. Xia and G. Feng
and O1 ∈ Rn×m , O2 ∈ Rn×r , O3 ∈ Rr×m , and O4 ∈ Rr×r are zero matrices. n Since2 F(x) and ∇ gi (x) (i = 1, . . . , m) are monotone on X, ∇F(x) + i=1 yi ∇ gi (x) are positive semidefinite on 0 . Thus, ∇H(u) is positive semidefinite, and H(u) is monotone on 0 . It follows that dV(u)/dt ≤ 0, and thus {u(t)} is bounded. By invariant set theorem (Golden, 1996), we see that all solution trajectories of the extended projection neural network in equation 1.1 converge to a largest invariant set χ, where dV(u)/dt = 0. We now prove that dV/dt = 0 if and only if du/dt = 0. Clearly, if du/dt = 0, then ˆ zˆ ) ∈ χ. Then dV/dt = 0 ˆ y, dV(u)/dt = (dV(u)/du)T (du/dt) = 0. Let uˆ = (x, ˆ T (u∗ − u) ˆ = 0. Since implies (H(u∗ ) − H(u)) ˆ T (u∗ − u) ˆ = (H(u∗ ) − H(u))
1
ˆ T (∇F(xs ) (x∗ − x) 2 ∗ ˆ ds, + m i=1 ∇ gi (xs )yi (s))(x − x) 0
ˆ and yi (s) = yˆ i + s(y∗i − yˆ i ) (0 ≤ s ≤ 1), where xs = xˆ + s(x∗ − x) (xˆ − x∗ )T (∇F(xs ) +
m
∇ 2 gi (xs )yi (s))(xˆ − x∗ ) = 0.
i=1
2 ∗ ∗ Let s = 1. Then xs = x∗ , y∗i (s) = y∗i , and (xˆ − x∗ )T (∇F(x∗ ) + m i=1 ∇ gi (x )yi ) m ∗ ∗ 2 ∗ ∗ ∗ · (xˆ − x ) = 0. Since ∇F(x ) + i=1 ∇ gi (x )yi is positive definite, xˆ = x and thus ˆ u) ˆ = r(u) ˆ T (∇F(x∗ ) + ˆ T ∇H(u)T( T(u)
m
ˆ = 0, ∇ 2 gi (x∗ )y∗i )r(u)
i=1
ˆ = 0. Thus, ˆ = px (xˆ − F(x) ˆ It follows that r(u) ˆ − ∇ g(x) ˆ yˆ + BT zˆ ) − x. where r(u) ˆ = 0 and dz/dt = λ(−Bxˆ + b) = 0. Finally, consider the case dx/dt = −λr(u) ˆ Note that dy/dt = λ((yˆ + g(x)) ˆ = ˆ + − y). ˆ + − y) that dy/dt = λ((yˆ + g(x)) −λα yˆ (0 ≤ α ≤ 1). When α > 0, we have yˆ = 0 since {y(t)} is bounded. Thus, dy/dt = 0 also. So dV/dt = 0 if and only if du/dt = 0. The conclusion of theorem 1 thus holds. As an important corollary of theorem 1, we have the following result: Corollary 1.Assume that F(x) + ∇ g(x)y(t) is monotone on X, where y(t) is the trajectory defined in equation 1.1. If ∇F(x∗ ) is positive definite, then the extended protection neural network in equation 1.1 is globally convergent to an equilibrium point of equation 1.1. Proof. Let u(t) = (x(t), y(t), z(t)) be the state trajectory of equation 1.1 with the initial point (x(t0 ), y(t0 ), z(t0 )), where x(t0 ) ∈ X and y(t0 ) ≥ 0. Consider the Lyapunov function V(u) defined in equation 2.1. Note that
On Convergence Conditions of an Extended Projection Neural Network
519
2 ∇F(x)+ m i=1 ∇ gi (x)yi (t) is positive semidefinite on X when F(x)+∇ g(x)y(t) is monotone on X. Similar to the analysis in theorem 1, we can obtain that dV/dt ≤ 0 and dV/dt = 0 if and only if du/dt = 0. Therefore, the conclusion of corollary 1 holds. Theorem 2. Assume that F(x) and ∇ gi (x) (i = 1, . . . , m) are monotone on X. If there exists a nonnegative vector p = [p1 , . . . , pm ]T ≤ y∗ such that ∇F(x) + m 2 i=1 ∇ gi (x)pi is positive definite on X, then the output trajectory of the extended protection neural network in equation 1.1 with y(t0 ) ≥ p is convergent to a solution of equation 1.2 at a convergence rate w(t) − x∗ 2 ≤
δ , ∀t > t0 , λ(t − t0 )
(2.3)
where δ is positive constant. Proof. Let u(t) = (x(t), y(t), z(t)) be the state trajectory of equation 1.1 with the initial point (x(t0 ), y(t0 ), z(t0 )) satisfying x(t0 ) ∈ X and y(t0 ) ≥ p. Consider the Lyapunov function V(u) defined in equation 2.1. By the analysis of theorem 1 we have λ
t
(H(u(τ )) − H(u∗ ))T (u(τ ) − u∗ ) dτ
t0
≤ 2V(u(t0 )) − λ
t t0
1
T(u(τ ))T ∇H(u(τ )))T(u(τ )) dτ.
0
Since dy(t)/dt + y = (y + g(x))+ , y(t) = e−(t−t0 ) y(t0 ) + e−t
t
es (y + g(x))+ ds.
t0
Then y(t) ≥ e−(t−t0 ) p for t ≥ t0 . Using r(u) = PX (x−(F(x)+∇ g(x)y−BT z))−x, we have m
T(u)T ∇H(u(t))T(u) = r(u)T (∇F(x(t))+ ∇ 2 gi (x(t))yi (t))r(u) i=1
≥ r(u)T (∇F(x(t))+e−(t−t0 )
m
∇ 2 gi (x(t))yi (t0 ))r(u)
i=1
= e−(t−t0 ) r(u)T (e(t−t0 ) ∇F(x(t))+
m
∇ 2 gi (x(t))pi )r(u)
i=1
≥ e−(t−t0 ) r(u)T (∇F(x(t)) +
m
i=1
∇ 2 gi (x(t))pi )r(u) ≥ 0.
520
Y. Xia and G. Feng
On the other side, note that (H(u) − H(u∗ ))T (u − u∗ ) =
1 0
+
(x − x∗ )T (∇F(xs )
m
i=1 ∇
2 g (x )y (s))(x i s i
− x∗ ) ds,
where xs = x∗ + s(x − x∗ ) and yi (s) = y∗i + s(yi − y∗i ) for 0 ≤ s ≤ 1. Then t t0
1
(x(τ ) − x∗ )T (∇F(xs ) +
m
0
∇ 2 gi (xs )yi (s))(x(τ ) − x∗ ) ds dτ
i
=
t t0
t
=
1
(u(τ ) − u∗ )T ∇H(u∗ + s(u(τ ) − u∗ ))(u(τ ) − u∗ ) ds dτ
0
(H(u(τ ))−H(u∗ ))T (u(τ )−u∗ ) dτ ≤ V(u(t0 ))/λ,
t0
Since y∗ ≥ 0, yi (s) = (1 − s)y∗i + syi ≥ syi (t) ≥ e−(t−t0 ) spi (i = 1, . . . , m). It follows that for 0 < s ≤ 1, (x − x∗ )T (∇F(xs ) +
m
∇ 2 gi (xs )yi (s))(x − x∗ )
i=1
≥ (x − x∗ )T (∇F(xs ) +
m
∇ 2 gi (xs )(se−(t−t0 ) )pi )(x − x∗ )
i=1
= (x − x∗ )T (∇F(xs ) + (se−(t−t0 ) )
m
∇ 2 gi (xs )pi )(x − x∗ )
i=1
≥ s(e(t−t0 ) )(x − x∗ )T (∇F(xs ) +
m
∇ 2 gi (xs )pi )(x − x∗ ) ≥ 0.
i=1
Note that for 1 > s ≥ 0,
1
(x − x∗ )T (∇F(xs ) +
0
m
∇ 2 gi (xs )yi )(x − x∗ ) ds
i
≥
1
(x − x∗ )T (∇F(xs ) + (1 − s)
0
≥ 0
m
∇ 2 gi (xs )y∗i )(x − x∗ ) ds
i 1
(1 − s)(x − x∗ )T (∇F(xs ) +
m
i
∇ 2 gi (xs )y∗i )(x − x∗ ) ds.
On Convergence Conditions of an Extended Projection Neural Network
521
Using y∗ ≥ p, we have t t0
1
(1−s)(x(τ )−x∗ )T (∇F(xs )+
m
0
∇ 2 gi (xs )pi )(x(τ )−x∗ ) ds dτ
i
≤ V(u(t0 ))/λ. 2 Since ∇F(x) + m i=1 ∇ gi (x)pi is positive definite on X, there exists µ > 0 such that m
∗ T 2 ∗ ∇ gi (xs )yi (x − x∗ ) ≥ µx − x∗ 2 . (x − x ) ∇F(xs ) + i
It follows that t 1 V(u(t0 )) µ x(τ ) − x∗ 2 dτ (1 − s) ds ≤ λ t0 0 and
t
x(τ ) − x∗ 2 dτ ≤
t0
2V(u(t0 )) . λµ
Thus, for any t > t0 ,
t
x(t) − x < x(tˆ) − x = ∗ 2
∗ 2
x(s) − x∗ 2 ds
t0
t − t0
≤
δ , λ(t − t0 )
where t0 < tˆ < t and δ = 2V(u(t0 ))/µ. Hence, w(t) − x∗ 2 ≤
δ , ∀t > t0 . λ(t − t0 )
As an important corollary of theorem 2, we have the following result. 2 Corollary 2.Assume that ∇F(x) + m i=1 ∇ gi (x)yi (t) is positive definite on X, where y(t) is the trajectory defined in equation 1.1. Then the output trajectory of the extended neural network in equation 1.1 is globally convergent to a solution of equation 1.2 with a convergence rate in the form 2.3. Proof. Let u(t) = (x(t), y(t), z(t)) be the state trajectory of equation 2.1 with the initial point (x(t0 ), y(t0 ), z(t0 )), where x(t0 ) ∈ X and y(t0 ) ≥ 0. Consider the Lyapunov function V(u) defined in equation 2.1. According to the analysis in theorem 2, we can obtain that dV/dt ≤ 0 and t t0
1
(x(τ ) − x∗ )T (∇F(xs ) +
0
≤ V(u(t0 ))/λ.
m
i
∇ 2 gi (xs )yi (s))(x(τ ) − x∗ ) ds dτ
522
Y. Xia and G. Feng
Therefore, from the rest of the analysis of theorem 2, the conclusion of corollary 2 follows.
3 Comparison and Illustrative Examples In Xia (2004), the existing convergence condition of the extended projection neural network in equation 1.1 is that ∇F(x) is positive definite on X and ∇ gi (x) (i = 1, . . . , m) are monotone. It is easy to see that theorems 1 and 2 have removed the positive definiteness condition of ∇F(x) on X. Moreover, theorem 1 shows the global convergence of the state trajectory of the extended neural network. Furthermore, from corollaries 1 and 2, we see that F(x) and ∇ gi (x) may be nonmonotone on X. Two examples follow. Example 1. Consider the following quadratic program, 1 T x Qx + cT x 2
minimize
f (x) =
subject to
Dx ≤ b,
(3.1)
and its dual, maximize
1 T 1 y Hy + yT d − cT Q−1 c 2 2
subject to
y ≥ 0,
(3.2)
where x ∈2 , y ∈ R4 , H = DQ−1 DT , d = −b−DQ−1 c, b = [35/12, 35/2, 5, 5]T , and 5/12 D= −1
5/2 1
−1 0
T
0 1
2 , Q= 1
−30 1 . , c= −30 2
Equation 3.1 has a unique optimal solution x∗ = (5, 5), and equation 3.2 has a unique optimal solution y∗ = (0, 6, 0, 0, 9). We use the extended neural network in equation 1.1 to solve both equations 3.1 and 3.2. Let F(x) = Qx + c and = {x ∈ R2 | Dx − b ≤ 0}. Then equation 3.1 can be rewritten as the problem with the form equation 1.2. When the extended projection neural network in equation 1.1 is applied to this problem, the corresponding extended projection neural network in equation 1.1 can become d x −c − QT x − DT y = . (y + Dx − b)+ − y dt y
(3.3)
On the other side, by the dual theory of convex program (Bertsekas, 1989), we see that (x∗ , y∗ ) is an optimal solution to equations 3.1 and 3.2, respec-
On Convergence Conditions of an Extended Projection Neural Network
523
tively, if and only if (x∗ , y∗ ) satisfies the Karush-Kuhn-Tucker (KKT) condition,
c + QT x + DT y = 0, yT (Dx − b) = 0, Dx ≤ b, y ≥ 0.
The KKT condition can be represented by
c + QT x + DT y = 0, (y + Dx − b)+ = y.
Thus, (x∗ , y∗ ) is an optimal solution to equations 3.1 and 3.2, respectively, if and only if (x∗ , y∗ ) is an equilibrium point of equation 3.3. The existing convergence result (Xia, 2004) can show that the output trajectory w(t) of the extended projection neural network is convergent to the optimal solution of equation 3.1, but it cannot ascertain the convergence of the trajectory y(t) to the optimal solution of equation 3.2. However, theorem 1 guarantees that the state trajectory of the extended projection neural network can converge to optimal solutions of both equations 3.1 and 3.2. All simulation results show that the state trajectory u(t) = (x(t), y(t)) of (8) is always convergent to u∗ = (x∗ , y∗ ). For example,let λ = 10. Figure 1 displays the transient behavior of the norm error u(t)−u∗ based on equation 3.3 with 10 random initial points. Example 2. Consider nonlinearly constrained variational inequality 1.2, where = {x ∈ R2 | g(x) ≤ 0, x ∈ X}, X = {x ∈ R2 | x ≥ 0},g(x) = [g1 (x), g2 (x)]T , g1 (x) = −x21 − x2 , g2 (x) = −x22 − x1 , and F(x) = [2x1 − 1, 2x2 − 0.5]T . This problem has a solution x∗ = [0.5, 0.25]T . It is easy to see that ∇F(x) is positive definite, but ∇ g1 (x) and ∇ g2 (x) are not monotone on X. Thus, the existing convergence result Xia (2004) cannot ascertain the extended neural network for solving the above problem. However, note that y(t) always converges to zero. Then there exists T > t0 such that when t ≥ T,
∇F(x) +
2
i=1
∇ 2 gi (x)yi (t) =
2 − 2y1 (t) 0
0 2 − 2y2 (t)
is positive on X. Thus, corollary 2 guarantees that the extended neural network in equation 1.1 globally converges to x∗ and has a convergence rate in the form of equation 2.3. All simulation results show the output trajectory of equation 1.1 is always convergent to x∗ . For example, let λ = 10. Figure 2 displays the transient behavior of the norm error x(t) − x∗ based on equation 2.1 with 10 random initial points.
524
Y. Xia and G. Feng
14
12
||u(t)−u*||
10
8
6
4
2
0
0
0.1
0.2
0.3
0.4
0.5 Time (sec)
0.6
0.7
0.8
0.9
1
Figure 1: Convergence behavior of u(t) − u∗ with 10 random initial points in example 1. 1.4
1.2
1
*
||x(t)−x ||
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4
0.5 Time (sec)
0.6
0.7
0.8
0.9
1
Figure 2: Convergence behavior of x(t) − x∗ with 10 random initial points in Example 2.
On Convergence Conditions of an Extended Projection Neural Network
525
Conclusion This note proposes new convergence conditions of the extended neurel network. Both theoretical analysis and computational examples show that the extended neural network can solve constrained monotone variational inequality problems and a class of constrained nonmonotonic variational inequality problems. References Ahn, J. H., & Oh, J. H. (2003). A constrained EM algorithm for principal component analysis. Neural Computation, 15, 57–66. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice-Hall. Cichocki, A., & Unbehauen, R. (1993). Neural networks for optimization and signal processing. New York: Wiley. De Farias, D.P., & Van Roy, B. (2003). The linear programming approach to approximate dynamic programming. Operations Research, 51, 850–865. Golden, R. M. (1996). Mathematical methods for neural network analysis and design. Cambridge, MA: MIT Press. Harker, P. T., & Pang, J. S. (1990). Finite-dimensional variational inequality and nonlinear complementary problems: A survey of theory, algorithms, and applications. Mathematical Programming, 48, 161–220. Oja, E. (1992). Principal components, minor components and linear neural networks. Neural Networks, 5, 927–935. Urahama, K. (1996). Gradient projection network: Analog solver for linearly constrained nonlinear programming. Neural Computation, 8, 1061–1074. Welling, M., & Webber, M. (2001). A constrained EM algorithm for independent component analysis. Neural Computation, 13, pp. 677–689. Xia, Y. S. (2004). An extended neural network for constrained optimization. Neural Computation, 16, 863–883. Xia, Y. S., & Wang, J. (2000). On the stability of globally projected dynamic systems, Journal of Optimization Theory and Applications, 106, pp. 129–150. Received April 23, 2004; accepted July 28, 2004.
LETTER
Communicated by Jerome Feldman and Peter Thomas
Memorization and Association on a Realistic Neural Model Leslie G. Valiant
[email protected] Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, U.S.A.
A central open question of computational neuroscience is to identify the data structures and algorithms that are used in mammalian cortex to support successive acts of the basic cognitive tasks of memorization and association. This letter addresses the simultaneous challenges of realizing these two distinct tasks with the same data structure, and doing so while respecting the following four basic quantitative parameters of cortex: the neuron number, the synapse number, the synapse strengths, and the switching times. Previous work has not succeeded in reconciling these opposing constraints, the low values of synapse strengths that are typically observed experimentally having contributed a particular obstacle. In this article, we describe a computational scheme that supports both memory formation and association and is feasible on networks of model neurons that respect the widely observed values of the four quantitative parameters. Our scheme allows for both disjoint and shared representations. The algorithms are simple, and in one version both memorization and association require just one step of vicinal or neighborly influence. The issues of interference among the different circuits that are established, of robustness to noise, and of the stability of the hierarchical memorization process are addressed. A calculus therefore is implied for analyzing the capabilities of particular neural systems and subsystems, in terms of their basic numerical parameters. 1 Introduction We consider four quantitative parameters of a neural system that together constrain its computational capabilities: the number of neurons, the number of neurons with which each neuron synapses, the strength of synaptic connections, and the speed of response of a neuron. The typical values that these parameters are believed to have in mammalian cortex appear to impose extremely severe constraints. We believe that it is for this reason that computationally explicit mechanisms for realizing multiple cognitive tasks simultaneously on models having these typical cortical parameters have not been previously offered. Neural Computation 17, 527–555 (2005)
c 2005 Massachusetts Institute of Technology
528
L. Valiant
Estimates of these four cortical parameters are known for several systems. The number of neurons in mouse cortex has been estimated to be 1.6 × 107 , while the corresponding estimate is in the region of 1010 for humans (Braitenberg & Schuz, 1998). There also exist estimates of the number of neurons in different parts of cortex and in related structures such as the hippocampus and olfactory bulb. The number of neurons with which each neuron synapses, which we shall call the degree, is a little harder to measure. However, it is considered that the effect of multiple synapsing between pairs of neurons is small and therefore that this degree is close to the total number of synapses per neuron, which has been estimated to be 7800 in mouse cortex and in the 24,000 to 80,000 range in humans (Abeles, 1991). The third parameter, the synapse strength, presents a still more complex set of issues. The most basic question here is how many of a neuron’s neighbors need to be sending an action potential in order to create an action potential in the neuron. Equivalently, the contribution of each synapse, the excitatory presynaptic potential, can be measured in millivolts and the fraction that this constitutes of the threshold voltage that needs to be overcome evaluated. While some moderately strong synapses have been recorded (Thomson, Deuchars, & West, 1993; Markram & Tdodyks, 1996a; Ali, Deuchars, Pawelzik, & Thomson, 1998), the average value is believed to be weak. The effective fraction of the threshold that each neighbor contributes, has been estimated (Abeles, 1991) to be in the range 0.003 to 0.2. In other words, it is physiologically quite possible that cognitive computations are characterized by a very small and hard-to-observe fraction of synapses that contribute above this range, perhaps even up to the threshold fraction of 1.0. However, there is no experimental confirmation of this to date, and for that reason, this article addresses the possibility that at least some neural systems work entirely with synapses whose strengths are some orders of magnitude smaller. The time it takes for a neuron to complete a cycle of causing an action potential in response to action potentials in its presynaptic neurons has been estimated as being in the 1 to 10 millisecond range. Since mammals can perform significant cognitive tasks in 100 to 200 milliseconds, algorithms that take a few dozen parallel steps must suffice. The central technical contribution of this article is to show that two basic computational problems, memory formation and association, can be implemented consistently in model neural systems that respect all of the above-mentioned numerical parameters. The first basic function, JOIN, implements memory formation of a new item in terms of two established items: if two items A and B are already represented in the neural system, the task of JOIN is to modify the circuit so that at subsequent times, there is the representation of a new item C that will fire if and only if the representations of both A and B are firing. Further, the representation of C is an “equal citizen” with those of A and B for the purposes of subsequent calls of JOIN. JOIN
Memorization and Association on a Realistic Neural Model
529
is intended as the basic form of memorization of an item and incorporates the idea that such memorization has to be indexed in terms of the internal representations of items already represented. The second function, LINK, implements association. If two “items” A and B are already represented in the neural system in the sense that certain inputs can cause either of these to fire, the task of LINK is to modify the circuit so that at subsequent times, whenever the representation of item A fires, the modified circuit will cause the representation of item B to fire also. Implicit in the definitions of both JOIN and LINK is the additional requirement that there be no deleterious interference or side effects. This means that the circuit modifications do not impair the functioning of previously established circuits; that when the newly created circuit executes, no unintended other items fire; and that the intended action of the new circuit cannot be realized in consequence of some unintended condition. We note that for some neural systems, such as the hippocampus and the olfactory bulb, the question of what items, whether representing location or odors, for example, are being represented has been the subject of some experimental study already. Also for such systems, our four cortical parameters can be measured. We therefore expect that our analysis offers both explanatory and predictive value for understanding such systems. For the parts of cortex that process higher-level functions, the corresponding experimental evidence is more elusive. In order that our results apply to a wide range of neural systems, we describe computational results for systems within a broad range of realistic parameters. We show that for wide ranges of values of the neuron count between 105 and 109 , and of values of the synapse count or degree between 16 and 106 , there is a range of values of the synapse strength between .001 and .125 for which both JOIN and LINK can be implemented. Furthermore, this latter range usually includes synaptic strengths that are at the small end of the range. Tables 1 through 4 summarize these data and show, given the values of the neuron count and degree of a neural system, the maximum synapse strength that is sufficient for JOIN and LINK in both disjoint and shared representations. The implied algorithms for LINK take just one step and for JOIN either two steps (see Tables 1 and 2) or also just one step (see Tables 3 and 4). The simplicity of these basic algorithms leaves room for more complex functions to be built on top of them. We also describe a general relationship among the parameters that holds under some stated assumptions for systems that use the mechanisms described. This relationship (∗) states that kn exceeds rd, but only by at most a fixed small constant factor, where 1/k is the maximum collective strength of the synapses to any one neuron from any one of its presynaptic neurons, n is the number of neurons, r is the number of neurons that represent a single item, and d is the number of neurons from which each neuron receives synapses.
530
L. Valiant
The essential novel contribution of this article is to show that random graphs have some unexpected powers. In particular, for parameters that have been observed in biology, they allow a method of assigning memory to a new item and also allow for paths, and algorithms for establishing the paths, for realizing associations between items. There is a long history of studies of random connectivity for neural network models, notably Beurle (1955), Griffith (1963, 1971), Braitenberg (1978), Feldman (1982), and Abeles (1991). In common with such previous studies, ours assumes random interconnections and does not apply to systems where, for example, the connections are strictly topographic. The other component of our approach that also has some history is the study of local representations in neural networks, including Barlow (1972), Feldman (1982), Feldman and Ballard (1982), Shastri and Ajjanagadde (1993), and Shastri (2001). The question of how multiple cognitive functions can be realized simultaneously using local representations and random connections has been pursued by Valiant (1988, 1994). Our central subject matter is the difficulty of computing flexibly on sparse networks where nodes are further frustrated in having influence on others by the weakness of the synapses. This difficulty has been recognized most explicitly in the work of Griffith (1963) and Abeles (1991). Griffith suggests communication via chains that consist of sets of k nodes chained together so that each member of each set of k nodes is connected to each member of the next set in the chain. If the synaptic strength of each synapse is 1/k, then a signal can be maintained along the chain. Abeles suggests a more general structure, which he calls a synfire chain, in which each set has h ≥ k nodes and each node is connected to k of the h nodes in the next set. He shows that for some small values of k, such chains can be found somewhere in suitably dense such networks. The goals of this article impose multiple constraints for which these previous proposals are not sufficient. For example, for realizing associations, we want that between any pair of items, there is a potential chain of communication. In other words, these chains have to exist from anywhere to anywhere in the network rather than just somewhere in the network. A second constraint is that we want explicit computational mechanisms for enabling the chain to be invoked to perform the association, and the passive existence of the chain in the network is not enough. A third requirement is that for memory formation, we need connectivity of an item to two others. Some readers may choose to view this article as one that solves a communication or wiring problem, and not a computational one. This view is partly justified since once it is established that the networks have sufficiently flexible interaction capabilities, the mechanisms required at the neurons are computationally very simple. For readers who wish to investigate more rigorously what simple means here, we have supplied a section that goes into more details. The model of computation used there is the neuroidal model (Valiant, 1994), which was designed to capture the communication capa-
Memorization and Association on a Realistic Neural Model
531
bilities and limitations of cortex as simply as possible. It assumes only the simplest timing and state change mechanisms for neurons so that there can be no doubt that neurons are capable of doing at least that much. Demonstrating that some previously mysterious task can be implemented even on this simple model therefore has explanatory power for actual neural systems. The neuroidal model was designed to be more generally programmable than its predecessors and hence to offer the challenge of designing explicit computational mechanisms for explicitly defined and possibly multiple cognitive tasks. The contribution of this article may be viewed as that of exhibiting a wide range of new solutions to that model. The previous solutions given for the current tasks were under the direct action hypothesis—the hypothesis that synapses could become so strong that a single presynaptic neuron was enough to cause an action potential in the postsynaptic neuron. Whether this hypothesis holds for neural systems that perform the relevant cognitive tasks is currently unresolved. In contrast, the mechanisms described here are in line with synaptic strength values that have been widely observed and generally accepted. This letter pursues a computer science perspective. In that field, it is generally found, on the positive side, that once one algorithm has been discovered for solving a computational problem within specified resource bounds, many others often follow. On the other hand, on the negative side, it is found that the resource bounds on computation can be very severe. For example, for the NP-complete problem of satisfiability (Cook, 1971; Papadimitriou 1994) of boolean formulas with n occurrences of literals, no n algorithm for solving all instances of it in 2f( ) steps is known for any function f(n) growing more slowly than linear in n. If a device were found that could solve this problem faster, then a considerable mystery would be created: the device would be using some mechanism that is not understood. Neuroscience has mysteries of the same computational nature and needs to resolve them. This letter aims at making one of these mysteries concrete and to resolve it. 2 Graph Theory We consider a random graph G with n vertices (Bollobas, 2001). From each vertex, there is a directed edge to each other vertex with probability p, so that the expected number of nodes to which a node is connected is d = p(n − 1). In this model, a vertex corresponds to a neuron, and a directed edge from one vertex to another models the synapse between the presynaptic neuron and the postsynaptic neuron. Such a model makes sense for neural circuits that are richly interconnected. A variant of the model is that of a bipartite graph where the vertex set can be partitioned into two subsets V1 and V2 , such that every edge is directed from a V1 vertex to a V2 vertex. This would be appropriate for modeling the connections from one area of cortex to a
532
L. Valiant
second, possibly distant, area. The analyses we give for JOIN and LINK apply equally to both variants. We shall use d = pn in the analysis, which is exact for the bipartite case, and a good approximation for the other case for large n. In general, to obtain rigorous results about random graphs, we take the view that for the fixed nodes under consideration, the edges are present or not, each with probability p independent of each other. It is convenient for analysis to view the edges as being generated in that manner afresh rather than as fixed at some previous time. We assume that the maximum synaptic strength is 1/k of the threshold, for some integer k. In the graph theoretic properties, we shall therefore always need to find at least k presynaptic neighbors to model the k presynaptic neurons that need to be active to make the neuron in question active. Finally, we shall model the representation of an item in cortex by a set of about r neurons, where r is the replication factor. In general, such an item will be considered to be recognized if essentially all the constituent neurons are active. In general, different items will be represented by different numbers of neurons, though of the same order of magnitude. We do not try to ensure that they are all represented by exactly r. However, once an item is represented by some r neurons, then it makes sense to assert that if no more than r /2 of its members are firing, then the item has not been recognized. We call a representation disjoint or shared, respectively, depending on whether the sets that represent two distinct items need, or need not, be disjoint. In disjoint representations, clearly, no more than n/r items can be represented, while shared representations allow for many more, in principle. 2.1 Memory Formation for Disjoint Representations. The JOIN property is the following. Given values of n, d, and k, which are the empirical parameters of the neural system, we need to show that the following holds. Given two subsets of nodes A and B of size r, the number of nodes to which there are at least k edges directed from A nodes and also at least k edges directed from B nodes has expected value of r. The vertices that are so connected will represent the new item C. The above-mentioned property ensures that the representation of C will be made to fire by causing the representation of either one of A or B to fire. The required network is illustrated in Figure 1. We want C to be an equal citizen as much as possible with A and B for the purposes of further calls of JOIN. We ensure this by requiring that the expected number of nodes that represent C is the same as the number of those that represent A and B. In general, we denote by B (r, p, k) the probability that in r tosses of a coin that comes up heads with probability p and tails with probability 1−p, there will be k or more heads. This quantity is equal to the sum for j ranging from k to r of the value of T (r, p, j) = (r!/(j!(r − j)!)p j (1 − p)r−j . For constructing our tables, we compute such terms to double precision using an expansion
Memorization and Association on a Realistic Neural Model
r
A
533
k k C k
r
B
k
Figure 1: Graph-theoretic structure needed for the two-step algorithm for disjoint representations for the memorization of the conjunction of items at A and B. For shared representations, the sets A and B may intersect. For the one-step algorithm, there is a bound km on the total number of edges coming from A and B rather than bounds on A and B separately.
for the logarithm of the factorial or gamma function (Abramowitz & Stegun, 1964). We now consider the JOIN property italicized above. For each vertex u in the network, the probability that it has at least k edges directed toward it from the r nodes of A is B (r, p, k), since each vertex of A can be regarded as a coin toss with probability p of heads (i.e., of being connected to u) and we want at least k successes. The same holds for the nodes of B. Hence the probability of a node being connected in this way to both A and B is p = (B (r, p, k))2 , and hence the expected number of vertices so connected is n times this quantity. The stated requirement on the JOIN property therefore is that the following be satisfied: n(B (r, p, k))2 = r.
(2.1)
This raises the important issue of stability. Even if the numbers of nodes assigned to A and B are both exactly r, this process will assign r nodes to C only in expectation. How stable is this process if such memorization operations are performed in sequence, with previously memorized items forming the A and B items of the next memorization operation? Fortunately, it is easy to see that this process gets more and more stable as r increases. The argument for relationship 2.1 showed that the number of nodes assigned to C is a random variable with a binomial distribution defined as the number of successes in n trials where the probability of success in each trial is p .
534
L. Valiant
This distribution therefore has mean np , as already observed, and variance np (1 − p ). The point is that this variance is close to the mean r = np if p is small relative to 1, and then the standard deviation, which is the square root √ of the variance, is approximately r. Hence, the standard deviation as a √ fraction of the mean decreases as 1/ r as the mean np = r increases. For the ranges of values that occur typically in this article, such as r equal to 103 , 104 , √ or even much larger, this r standard deviation will be a small fraction of r, and hence one can expect the memorization process to be stable for many stages. Thus, for stability, the large k large r cases considered in this article are much more favorable than the k = 1 case considered in Valiant (1994) with r = 50. For the latter situation, some analysis and suggested ways of coping with the more limited stability were offered in Valiant (1994) and Gerbessiotis (2003). If fewer than a half of the representatives of an item are firing, we regard that item as not being recognized. As a side-effect condition, we therefore want that if no more than a half of one of A or B is active, then the probability that more than a half of C is active is negligible. Since we cannot control the size of C exactly, we ensure the condition that at most half of C be active by insisting that at most a much smaller fraction of r, such as r/10, be active. The intention is that r/10 will be smaller than a half of C even allowing for the variations in the size of C after several stages of memory allocation. This gives
B (n, B (r/2, p, k)B (r, p, k), r/10) ∼ 0.
(2.2)
The second side-effect condition we impose is related to the notion of capacity, or the number of items that can be stored. To guarantee large capacity, we need an assurance that the A ∧ B nodes allocated will not be caused to fire if a different conjunction is activated. The bad case is if the second conjunction involves one of A or B, say A, and another item D different from B. If the node sets allocated to A ∧ B and not to A ∧ D is of size at least 2r/3, then we will consider there to be no interference since if A ∧ B is of size at most 4r/3, then the firing of A ∧ D will cause fewer than half of the nodes of A ∧ B to fire. The probability that a node receives k inputs from B and k from A, but fewer than k from D, is p = (1 − B (r, p, k))(B (r, p, k))2 , and we want that the number of nodes that are so allocated to A ∧ B but not to A ∧ D to be at least 2r/3. Hence, we want
B (n, p , 2r/3) ∼ 1.
(2.3)
2.2 Association for Disjoint Representations. We now turn to the LINK property, which ensures that B can be caused to fire by A via an intermediate set of “relay” neurons: Given two sets A and B of r nodes, for each B vertex u with high probability, the following occurs: the number of (relay) vertices from which
Memorization and Association on a Realistic Neural Model
k r
A
535
k B
r
Figure 2: Graph-theoretic structure needed for the algorithm for establishing an association of the item at A to the item at B.
there is a directed edge to u and to which there are at least k edges directed from A nodes is at least k. We shall call this probability Y. This property ensures that each neuron u that represents B will be caused to fire with high probability if the A representation fires. This property is illustrated in Figure 2. For the LINK property, we note that the probability of a vertex having at least k connections from A and also having a connection to B vertex u is pB (r, p, k). We need the number of such nodes to be at least k with high probability, or in other words that Y = B (n, pB (r, p, k), k) ∼ 1.
(2.4)
As a side-effect condition, we need that if at most half of A fire, then with high probability, fewer than half of B should fire. We approximate this quantity by assuming independence for the various u:
B (r, B (n, pB (r/2, p, k), k), r/2) ∼ 0.
(2.5)
As a second side-effect condition, we consider the probability that a third item C for which no association with B has been set up by a LINK operation will cause B to fire because some relay nodes are shared with A. Further, we make this more challenging by allowing there to have been t, rather than just one, association with B previously set up, say from A1 , . . . , At . Now, the probability that a node will act as a relay node from A1 to a fixed node u in B is pB (r, p, k). For t such previous associations, the probability that a node acts as a relay for at least one of the Ai is p = p(1−(1− B (r, p, k))t ). If we require that these nodes be valid relay nodes from item C also, then this probability gets multiplied by another factor of B (r, p, k) since C is disjoint from A1 , . . . , At .
536
L. Valiant
Then the side-effect requirement becomes that p = B (n, p B (r, p, k), k), the probability of there being at least k relay nodes for u is so small that it is unlikely that a large fraction, say at least r/2, of the B nodes are caused to fire by C. We approximate this quantity also by making the assumption of independence for the various u:
B (r, p , r/2) ∼ 0.
(2.6)
2.3 Memory Formation for Shared Representations. By shared representation, we mean a representation where each neuron can represent more than one item. There is no longer a distinction between nodes that have and those that have not been allocated. The items each node will be assigned by JOIN are already determined by the network connections without any training process being necessary. The actual meaning of the items that will be memorized at a node will, of course, depend on the meanings assigned by a different process, such as the hard wiring of certain sensory functions to some input nodes. The model here is that an item is represented by r neurons, randomly chosen. The expected intersection of two such sets then is of size r2 /n. We can recompute the relations corresponding to equations 2.1 to 2.6 under this assumption. For simplicity, we shall consider only the case in which for JOIN, the neurons of A and B are in one area, and those of C in another, and for LINK the neurons of A, B and the relay nodes are in three different areas, respectively. Then, if for simplicity we make the assumption that the intersection is of size exactly r , the closest integer to r2 /n, then relation 2.1 for JOIN becomes: n{B (r , p, k) + (i=0:k−1) [T (r , p, i)(B (r − r , p, k − i))2 ]} = c1 r,
(2.1’)
where i in the summation indexes the number of connections from the intersection of A and B. We need to show that c1 is close to one. Equation 2.1 we adapt from equation 2.2 to be the following:
B (n, p , r/10) ∼ 0,
(2.2’)
where p =(B (r , p, k)+(i=0,...,k−1) [T (r , p, i)B (r−r , p, k−i)B (r/2−r , p, k− i)]), where i in the summation indexes the number of connections from a neuron that is in both A and B, and r = r2 /n. We adapt equation 2.3 from equation 2.3 by considering the case that the intersections A ∩ B, A ∩ D, and B ∩ D all have their expected sizes r = r2 /n, and that A ∩ B ∩ D has its expected size r = r3 /n2 . For a fixed node in C, we shall denote the numbers of connections to these four intersections by i + m, j + m, l + m, and m, respectively, in the summation below. Then the probability of a node being allocated to A ∧ B and not to A ∧ D is lower
Memorization and Association on a Realistic Neural Model
537
bounded by the following quantity p , where r∧ = r − r : (i=0:k) T (r∧ , p, i)(j=0:k−i) T (r∧ , p, j)(l=0:k−i−j) T (r∧ , p, l) (m=0:k−i−j−l) T (r , p, m) [B (r − 2r + r , p, k − i − j − m)B (r − 2r + r , p, k − j − l − m) (1 − B (r − 2r + r , p, k − i − l − m))]. (Note that to allow for terms in which the intersection of A and B have more than k connections to the C node, we need to interpret the first term T (r∧ , p, i) in the above expression for the particular value i = k to mean B (r∧ , p, k).) Then the relationship we need is
B (n, p , 2r/3) ∼ 1.
(2.3’)
2.4 Association for Shared Representations. Equations 2.4 and 2.5 in the shared coding are identical to equations 2.4 and 2.5 since we are assuming here, for simplicity, that in an implementation of LINK between A and B, the neurons representing A, B, and the relay nodes are from three disjoint sets. Equation 2.6 will correspond to equation 2.6 in the special case of t = 1. For a fixed node in B, the probability of a node u being a relay node to it and having k edges coming from both A and C is p = p(B (r , p, k) + (i=0,...,k−1) [T (r , p, i)(B (r − r , p, k − i))2 ]). The probability that there are at least k of these is p = B (n, p , k). We want the probability that at least half of the members of B have such sets of at least k relay nodes to be small. We approximate this quantity by assuming independence for the various u:
B (r, p , r/2) ∼ 0.
(2.6’)
2.5 One-step Memory Formation for Shared Representations. The algorithm for memorization implied above requires A and B to be active at different time instants and therefore requires the memorization process to take two steps. Here we shall discuss a process by which memorization can be achieved in one step. For brevity, we shall consider one-step only for the shared representation case. The results are presented in Tables 3 and 4. The main advantage of these one-step algorithms is that the algorithms become even simpler and assume even less about the timing mechanism of the model of computation. But one small complication arises: the number of inputs needed to fire a node in the memorization algorithm is now different from that needed in the association algorithm. Instead of having a single parameter k, we shall have two parameters, km and ka , respectively. It turns
538
L. Valiant
out that km = 2ka works. This means that we can fix the parameter k of the neurons to be km , and then use a weight in the memorization algorithm that is only half as strong as the maximum value allowed. We now consider the JOIN property. For each vertex u that is a potential representative of C, the probability that it has at least km edges directed toward it from the nodes of A ∪ B is now p = B (2r − r , p, km ), provided the intersection of A and B is of size exactly r , the integer closest to the expectation r2 /n. This is because each vertex of A ∪ B may be regarded as a coin toss with probability p of coming up heads (i.e., of being connected to u), and we want at least km successes. Hence, the expected number of vertices so connected is n times this quantity. The stated requirement on the JOIN property therefore is that the following be satisfied: nB (2r − r , p, km ) − r.
(2.1”)
Alternatively one could impose some specific probability distribution on the choice of A and B and compute an analog of equation 2.1 that is precise for that distribution and does not need the assumption that the intersections are of size exactly r . If fewer than half of the representatives of an item are firing, we regard that item as not being recognized. As a side-effect condition, we therefore want that if no more than half of one of A or B is active, then the probability that more than half of C is active is negligible. In exact analogy with equations 2.2 and 2.2 , we have that
B (n, B (3r/2, p, km ), r/10) ∼ 0.
(2.2”)
Here, 3r/2 upper bounds the number of firing nodes in A ∪ B if at most r/2 are firing in A and all are firing in B, say. As a second side-effect condition, we again need an assurance that the A ∧ B nodes allocated will not be caused to fire if a different conjunction is activated. Again, a bad case is if the second conjunction is A ∧ D where D is different from B. If the node set allocated to A ∧ B and not to A ∧ D is at least of size 2r/3, we will consider there to be no interference since if A ∧ B is of size less than 4r/3 then the firing of A ∧ D will cause fewer than half of the nodes of A ∧ B to fire. If A, B, and D were disjoint sets of r nodes, then the probability that a node receives km inputs from A ∪ B but fewer than k = km from A ∪ D would be p = (s=0:k−1) T (r, p, s)(B (r, p, km − s))(1 − B (r, p, km − s)), where s denotes the number of nodes in A that are connected to that node. We want the number of nodes that are so allocated to A ∧ B but not to A ∧ D to be at least 2r/3. Hence we would want that
B (n, p , 2r/3) ∼ 1.
(2.3”)
Memorization and Association on a Realistic Neural Model
539
In the event that A, B, and D are not disjoint but randomly chosen sets of r elements, we need equation 2.3 but with a value of p computed as follows. We assume that the intersections A∩B, A∩D, and B∩D all have their expected sizes r = r2 /n, and that A ∩ B ∩ D has its expected size r = r3 /n2 . For a fixed node in C, we shall denote the number of connections to these four intersections by i + m, j + m, l + m, and m, respectively, in the summation below. Then the probability of a node being allocated to A ∧ B and not to A ∧ D is lower-bounded by the following quantity p , where r∧ = r −r and r# = r − −2r + r : p = (s=0:k−1) T (r# , p, s)(i=0:k−s−1) T (r∧ , p, i)(j=0:k−i−s−1) T (r∧ , p, j) (m=0:k−i−j−s−1) T (r , p, m)(l=0:k−i−j−m−s−1) T (r∧ , p, l) [(B (r# , p, km − s − i − j − l − m)) (1 − B (r# , p, km − s − i − j − l − m))]. Here s indexes the number of connections from nodes that are in A but not in B or D. 2.6 One-Step Association with Shared Representations. For the LINK property we simply have relations 2.4, 2.5, and 2.6 with k replaced by ka : Y = B (n, pB (r, p, ka ), ka ) ∼ 1,
(2.4”)
B (r, B (n, pB (r/2, p, ka ), ka ), r/2) ∼ 0,
(2.5”)
B (r, p , r/2) ∼ 0,
(2.6”)
and
where p = B (n, p , k), and p = p(B (r , p, ka ) + (i=0,...,k−1) [T (r , p, i)(B (r − r , p, ka − i))2 ]). Since equation 2.6 guarantees only that one previous association to item C will not lead to false associations with C, we also use equation 2.6 to compute an estimate of the maximum number of previous associations that still allow resistance to such false associations. 3 Graph-Theoretic Results In Tables 1 and 2 we summarize the solutions we have found. For each combination of n, d, and k, the table entry gives a value of r that satisfies all six
540
L. Valiant
conditions 2.1 to 2.6 to high precision, as well as their equivalents, conditions 2.1 , 2.2 , 2.3 , and 2.6 , for shared representations. (There are essentially only eight equations since equations 2.2 and 2.6 , subsume equations 2.2 and 2.6, respectively.) For example consider a neural system with n = 1,000,000 neurons where each one is connected to 8192 others on the average and the maximum synaptic strengths are 1/64 of the threshold amount. The entry 8491 found in Table 1 gives the value of r that solves the 10 constraints. It means that if each item is represented by about 8491 neurons, then the graph has the capability of realizing JOIN and LINK using the algorithms to be outlined in later sections. The central positive result of this article is the existence of entries in the tables for combinations of neuron numbers, synapse strengths, and synapse numbers that are widely observed in neural systems. We expect that the analysis that underlies the tables offers a basis for a calculus for understanding the algorithms and data structures used in specific systems, such as the hippocampus or the olfactory bulb. In interpreting the tables the following comments are in order. The solutions were found by solving equation 2.1 using binary search, and discarding any solutions that failed to solve the remaining relations. As a further comment, we note that equation 2.1 and some of the others are defined only for integer values of r, and in our search we therefore imposed the constraint of allowing only such integer values. The values of r are therefore the integer values for which the value of the left-hand side of equation 2.1 is as close as possible to r. We detail the exact integer values as a reminder of this. For all the entries shown, the difference between the two sides of equation 2.1 is at most 1%, except for those labeled ∗, where a difference of 10% is allowed. The following are some further observations: The case of k = 1 in equation 2.4 is known (Valiant, 1994) to give an asymptotic value of Y ∼ 1−1/e = .63 . . .. In general, values k = 1, 2, and 4 violate at least one of either equation 2.2 or 2.5. Most of the entries support equation 2.6 only up to t = 1, except for some entries with k = 8 or 16 and with some of the higher values of d. Finally, corresponding entries for different values of the neuron count n are in the ratio of the values of n. We note that for the k = 1 case, the analysis in Valiant (1994) relates to the analysis here in the following way. It is observed there that in general for any r and n, the graph density that supports JOIN is too sparse to support LINK with a Y value close to 1. The suggested solution there is to use a graph that is dense enough to support LINK and to have it ignore a random fraction of the connections when implementing JOIN so as to effectively use a sparser graph regime for that purpose. On a separate issue, the earlier analysis did not have any equivalents of the relations 2.2 and 2.5 above. For disjoint representations, where the intention is that in each situation, either all or none of the representatives of an item fires, these conditions might be argued to be too onerous. Tables 3 and 4 summarize the solutions we have found that support the one-step algorithm. We note that in general, corresponding values of r are a
Memorization and Association on a Realistic Neural Model
541
Table 1: Value of the Replication Factor r That Is the Closest Integer Solution to Equation 2.1 for the Given Values of the Neuron Count n, the Degree d, and the Inverse Synaptic Strength k. k=8 n = 100,000 neurons d = 128 d = 256 1981 d = 512 899 d = 1024 412 d = 2048 d = 4096 d = 8192 d = 16,384 d = 32,768 d = 65,536 n = 1,000,000 neurons d = 128 d = 256 19,803 d = 512 8979 d = 1024 4105 d = 2048 1888 ∧ 873 d = 4096 ∧ 406 d = 8192 ∧ 190 d = 16,384 d = 32,768 d = 65,536 d = 131,072 d = 262,144 d = 524,288 n = 10,000,000 neurons d = 128 d = 256 198,025 d = 512 89,777 d = 1024 41,033 d = 2048 18,868 ∧ 8717 d = 4096 ∧ 4043 d = 8192 ∧ 1882 d = 16,384 ∧ 879 d = 32,768 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576
k = 16
5025 2338 1098 519 *247 *119
50,235 23,365 10,957 5168 2449 1165 ∧ 557 ∧ 268 ∧ *130
k = 32
5420 2582 1238 597 *290 *143
54,181 25,796 12,353 5940 2866 1388 ∧ 675 ∧ 330 ∧ *163
k = 64 k = 128 k = 256 k = 512 k = 1024
2749 1337 654 *322 ∧ *162
2865 1407 *695 ∧ *347
27,460 13,330 6491 3169 1552 ∧ *763 ∧ *378 ∧ *190
28,607 14,015 6883 ∧ 3388 ∧ *1672 ∧ *830 ∧ *415
1458
∧ *727
14,497 7161 ∧ 3545 ∧ 1761
*1496
∧ *753
14,836 ∧ 7360 ∧ 3660 ∧ *1827
502,339 233,636 541,791 109,547 257,940 51,660 123,498 274,571 24,467 59,368 133,265 286,021 11,628 28,628 64,866 140,098 ∧ 5542 13,839 31,643 68,761 144,886 ∧ 2649 6704 15,464 33,802 71,516 148,248 ∧ 1269 ∧ 3255 ∧ 7570 ∧ 16,640 ∧ 35,342 ∧ 73,463 ∧ 610 ∧ 1584 ∧ 3712 ∧ 8202 ∧ 17,484 ∧ 36,437 ∧ 295 ∧ 773 ∧ 1824 ∧ 4049 ∧ 8660 ∧ 18,089
∧ 74,842
Notes: Equation 2.1 is accurate to ratio 10−2 (but only 10−1 if marked by ∗) and Equation 2.4 to 10−6 . Equations 2.2, 2.3, 2.5, 2.6, 2.2 , 2.3 , and 2.6 are accurate to 10−6 . For unmarked entries these accuracies are achieved even if the noise rates are 10−4 for equations 2.2, 2.3, 2.5, and 2.2 ; 10−5 for 2.6 and 2.3 ; and 10−6 for 2.6 . For entries marked with a ∧ , this accuracy is achieved with the lower noise rate 10−6 for all seven equations. Equation 2.1 is satisfied with constant 1 < c1 < 1.1.
k = 16
5,023,377 2,336,342 1,095,455 516,578 244,648 116,254 ∧ 55,394 ∧ 26,455 ∧ 12,660 ∧ 6069 ∧ 2915
k=8
n = 100,000,000 neurons d = 128 d = 256 1,980,239 d = 512 897,763 d = 1024 410,318 d = 2048 188,664 ∧ 87,153 d = 4096 ∧ 40,412 d = 8192 ∧ 18,797 d = 16,384 ∧ 8766 d = 32,768 ∧ 4098 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576 5,417,894 2,579,374 1,234,952 593,653 286,244 138,351 67,001 ∧ 32,501 ∧ 15,789 ∧ 7681
k = 32
2,745,675 1,332,609 648,612 316,375 154,582 ∧ 75,636 ∧ 37,053 ∧ 18,172
k = 64
2,860,166 1,400,921 687,547 337,949 ∧ 166,314 ∧ 81,930 ∧ 40,397
k = 128
1,448,778 715,063 ∧ 353,310 ∧ 174,721 ∧ 86,468
k = 256
1,482,369 ∧ 734,499 ∧ 364,217 ∧ 180,718
k = 512
∧ 371,950
∧ 748,226
k = 1024
Table 2: The Value of the Replication Factor r That Is the Closest Integer Solution to Equation 2.1 for the Given Values of the Neuron Count n, the Degree d, and the Inverse Synaptic Strength k.
542 L. Valiant
50,233,759 23,363,401 10,954,534 5,165,758 2,446,454 1,162,518 553,914 ∧ 264,523 ∧ 126,564 ∧ 60,656 ∧ 29,112
n = 1,000,000,000 neurons d = 128 d = 256 19,802,377 d = 512 8,977,613 d = 1024 4,103,166 d = 2048 1,886,626 ∧ 871,517 d = 4096 ∧ 404,099 d = 8192 ∧ 187,946 d = 16,384 ∧ 87,638 d = 32,768 ∧ 40,955 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576 54,178,923 25,793,723 12,349,496 5,936,498 2,862,401 1,383,469 669,965 ∧ 324,966 ∧ 154,842 ∧ 76,758
k = 32
27,456,714 13,326,048 6,486,077 3,163,694 1,545,764 ∧ 756,296 ∧ 370,461 ∧ 181,644
k = 64
28,601,613 14,009,157 6,875,400 3,379,417 ∧ 1,633,055 ∧ 819,214 ∧ 403,874
k = 128
14,487,695 7,150,540 ∧ 3,532,995 ∧ 1,747,094 ∧ 864,553
k = 256
∧ 1,807,014
∧ 3,642,013
14,823,572
∧ 7,344,851
k = 512
∧ 3,719,279
∧ 7,482,072
k = 1024
Notes: Equation 2.1 is accurate to ratio 10−2 and equation 2.4 to 10−6 . Equations 2.2, 2.3, 2.5, 2.6, 2.2 , 2.3 , and 2.6 are accurate to 10−6 . For unmarked entries, these accuracies are achieved even if the noise rates are 10−4 for 2.2, 2.3, 2.5, and 2.2 ; 10−5 for 2.6 and 2.3 ; and 10−6 for 2.6 . For entries marked with a ∧ , this accuracy is achieved with the lower noise rate 10−6 for all seven equations. Equation 2.1 is satisfied with constant 1 < c1 < 1.1.
k = 16
k=8
Table 2: Continued.
Memorization and Association on a Realistic Neural Model 543
544
L. Valiant
Table 3: The Value of the Replication Factor r That Is the Closest Integer Solution to Equation 2.1 for the Given Values of the Neuron Count n, the Degree d, and the Inverse Synaptic Strength k, where km = 2k and ka = k. k = 8 k = 16
k = 32
k = 64 k = 128 k = 256 k = 512 k = 1024
n = 100,000 neurons d = 128 d = 256 d = 512 2134 5170 d = 1024 1000 2436 d = 2048 473* 1164 2653 d = 4096 562* 1284* d = 8192 628* d = 16,384 d = 32,768 d = 65,536 n = 1,000,000 neurons d = 128 d = 256 d = 512 d = 1024 3571 9974 24327 d = 2048 4707 11,603 26,487 d = 4096 2233 5576 12,792 d = 8192 1065 2692 6219 d = 16,384 1305 3037 d = 32,768 636* 1488* d = 65,536 733 d = 131,072 364* d = 262,144 d = 524,288 n = 10,000,000 neurons d = 128 d = 256 d = 512 d = 1024 35,693 99,248 d = 2048 16,451 46,807 115,310 d = 4096 22,307 55,487 126,970 d = 8192 10,619 26,877 61,909 d = 16,384 5069 13,005 30,302 d = 32,768 6307 14,815 d = 65,536 7258 d = 131,072 3562 d = 262,144 d = 524,288 d = 1,048,576
2809 1372* 678* 339*
2922 1438* 716*
1488*
28,027 13,650 6688 3290 1625* 807* 406*
29,122 14,265 7028 3477 1728* 865
29,899 14,706 7274 3614 1806
7454 3717*
66,577 32,814 16,155 7967 3936
69,939 34,635 17,131 8487
72,347 35,946 17,838 8867
36,887 18,350
Notes: Equation 2.1 is accurate to ratio 10−2 (but only 10−1 if marked by *) and equation 2.4 to 10−6 . Equations 2.2 , 2.3 , 2.5 , and 2.6 are accurate to 10−6 . These accuracies are achieved even if the noise rates for equations 2.2 , 2.3 , 2.5 , and 2.6 are 10−4 , 10−4 , 10−4 , and 10−5 , respectively.
n = 100,000,000 neurons d = 128 d = 256 d = 512 d = 1024 356,186 d = 2048 164,349 d = 4096 d = 8192 d = 16,384 d = 32,768 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576
k=8
991,770 469,174 222,765 106,086 50,640
k = 16
1,152,860 555,466 268,344 129,914 63,004
k = 32
1,270,184 619,272 302,491 147,976 72,484 35,546
k = 64
665,623 327,502 161,315 79,539 39,248
k = 128
698,966 345,606 171,018 84,678
k = 256
722,820 358,615 178,029 88,413
k = 512
367,903 183,035
k = 1024
Table 4: The Value of the Replication Factor r That Is the Closest Integer Solution to Equation 2.1 for the Given Values of the Neuron Count n, the Degree d, and the Inverse Synaptic Strength k, Where km = 2k and ka = k.
Memorization and Association on a Realistic Neural Model 545
k=8
9,917,674 4,691,653 2,227,722 1,060,910 506,456
k = 16
11,528,462 5,554,651 2,683,448 1,299,108 630,010
k = 32
12,701,863 6,192,598 3,024,779 1,479,667 724,718 355,323
k = 64
6,656,114 3,274,935 1,613,041 795,184 392,293
k = 128
6,989,595 3,455,971 1,710,073 846,699
k = 256
7,228,103 3,585,972 1,780,004 883,950
k = 512
3,678,864 1,830,101
k = 1024
Notes: Equation 2.1 is accurate to ratio 10−2 , and equation 2.4 to 10−6 . Equations 2.2 , 2.3 , 2.5 , and 2.6 are accurate to 10−6 . These accuracies are achieved even if the noise rates for equations 2.2 , 2.3 , 2.5 , and 2.6 are 10−4 , 10−4 , 10−4 , and 10−5 , respectively.
n = 1,000,000,000 neurons d = 128 d = 256 d = 512 d = 1024 3,561,948 d = 2048 1,643,398 d = 4096 d = 8192 d = 16,384 d = 32,768 d = 65,536 d = 131,072 d = 262,144 d = 524,288 d = 1,048,576
Table 4: Continued.
546 L. Valiant
Memorization and Association on a Realistic Neural Model
547
little smaller in these tables than in Tables 1 and 2. With regard to equation 2.6, our findings, which are not detailed here, are as follows. While the parameters of Tables 1 and 2 support maximum values of t = 1 usually, the smaller values of r in Tables 3 and 4 (where equation 2.6 is only an approximation) lead to rather larger values of t, scattered in the range 1 to 7. Further, it turns out that instead of using km = 2ka , as we do in these tables, we can find similar results for slightly smaller coefficients, such as km = 1.95ka , or km = 1.9ka , and these give even smaller values of r and larger values of t. 4 The Computational Model and Algorithms As explained earlier, our goal is not only to show that the connectivity of the networks we consider are sufficient to provide the minimum communication bandwidth needed for realizing memorization and association, but also to show that algorithms are possible that modify the network so as to be able to execute instances of these tasks. In particular, each of these two tasks and for each representation, one needs two algorithms—one for creating the circuit, say for associating A to B in the first place, and one for subsequently executing the task, namely, causing B to fire when A fires. For describing such algorithms, we need a model of computation. We employ the neuroidal model because it is programmable and well suited to describing algorithms (Valiant, 1994). As mentioned earlier, the neuroidal model is designed to be so simple that there is no debate that real neurons have at least as much power. It is not designed to capture all the features of real neurons. Our algorithms are described for a variant of the neuroidal model that allows synapses to have memory in addition to weights. This has some biological support (Markram & Tsodyks, 1996b) and allows for somewhat more natural programming, even though temporary values of synaptic weights may be used instead, in principle, to simulate such states (Valiant, 1994). A brief summary of the model is as follows. A neuroidal net consists of a weighted directed graph G with a model neuron or neuroid at each node. A neuroid is a threshold element with some additional internal memory, which can be in one of a set of modes. The mode si of node i at an instant will specify the threshold Ti , and may also have further components such as a member qi of a finite set of states Q. In particular, a mode is either firing or nonfiring and fi has value 1 or 0 accordingly. The weight of an edge from node j to node i is wji and models the strength of a synapse for which j is presynaptic and i is postsynaptic. Each synapse can also have a state qji , which with wji forms a component of the mode sji . The only way a neuroid i can be influenced by other neuroids is through the quantity wi which equals the sum of the weights wji over all nodes j presynaptic to i that are in firing modes. Each neuroid executes an algorithm that is local to itself and can be formally defined in terms of mode update functions δ and λ, for the neuroid
548
L. Valiant
itself and each synapse, respectively: δ(si , wi ) = si , and λ(si , wi , sji , fj ) = sji . These relations express the values of the modes of the neuroid and synapses at one step in terms of the values at the previous step of the variables on which they are permitted to depend. Thus, the mode of a neuroid can depend only on its mode at the previous step and on the sum wi of weights of synapses incoming from firing nodes. The mode of one of its synapses can depend only on the mode of the same synapse, on the mode of the neuroid, on the firing status of the presynaptic neuroid, and on the sum wi of weights of synapses incoming from firing presynaptic neuroids. The model assumes a timing mechanism that has two components. Each transition has a period. We assume here that all transitions have a period of 1, except for threshold transitions, those that are triggered by wi ≥ Ti , which work on a faster timescale. There is a global synchronization mechanism such that, for example, if some external input is to cause the representations of two items A and B to fire simultaneously, then the nodes partaking in these representations will be caused to fire synchronously enough that the algorithms that will be caused to execute can keep in lockstep for the duration of these local algorithms. These durations will be typically no more than 10, and for the purposes of this article, just two steps. By a disjoint representation, we mean a representation where each neuroid can represent at most one item, though one item may be represented by many neuroids. The specific disjoint representation that our algorithms support has been called a positive representation (Valiant, 1994). The generalization that allows a node to represent more than one item, as needed for the shared representations of the next section, we call a positive shared representation. The neuroidal model allows for negative weights, which may be needed, for example, for inductive learning. However, the algorithms we describe here for JOIN and LINK do not use negative weights. We shall start with the algorithms needed for the disjoint two-step scheme implied by relations 2.1 to 2.6. The algorithms for implementing JOIN and LINK are very similar to those that were described for the same tasks for unit weights (Valiant, 1994, algorithms 7.2 and 8.1). It is clear, however, that once graph-theoretic properties such as equations 2.1 to 2.6 are guaranteed, then a rich variety of variants of these algorithms also suffices. We shall describe these algorithms informally here. The following algorithm for creating JOIN needs the nodes of A and B to be caused to fire at distinct time steps. The nodes that are candidates for C = A∧B (1) are initially in “unallocated” state q1, (2) have a fixed threshold T, (3) have each synapse in initial state qq1, and (4) have all the presynaptic weights initially at the value T/k.
Memorization and Association on a Realistic Neural Model
549
The algorithm acting locally on each candidate C node will act over two steps. The first step is prompted by the firing of A and the second by the firing of B one time unit later. Following these two prompts, each candidate C node initially in state q1 that has at least k connections from A and also at least k connections from B will be in state q2, indicating that it has become a C node and assigned to store something. Incoming weights from nodes other than A or B will be made zero. An incoming weight from A will equal T/x if there are x ≥ k of them, and those from B will equal T/y if there are y ≥ k of them. A candidate node that does not receive two successive prompts will return to the initial unassigned condition. The algorithmic mechanism that realizes this outcome is the following. First, note that a node that does become a C node needs to make the incoming synapses have one of the three weight values depending on whether it comes from A or B or neither. The trick is that after the A prompt, the synapses from A will memorize the value T/x for an x ≥ k as its weight, and memorize the fact that it is in this transitory condition by having the synapse state have the temporary value qq2. Also, the node will memorize the fact that it is in a transitory state in which k connections from A have been found by going to state q3. At the B prompt, if at a candidate node in state q3 some y ≥ k synapses come from firing nodes, then these synapses can be updated to have value T/y. At the same time, the A synapses, in state qq2, can go on to take on the values T/x, and the remaining synapses the value 0. However, if no such k connections from B nodes are found at this second step, then the whole neuroid returns to its initial condition. The reader can verify that the circuit constructed as described can execute the created conjunction using a very similar two-step process if at any later time A and B are presented at successive time instants. However, many variations are possible. For example, if the weights are set to T/2x and T/2y instead of T/x and T/y, then simultaneous presentation of A and B will work for recognition. Thus, one-step execution is possible even with twostep creation. We observe here that in general, there is a chance that the set of neuroids identified to represent a conjunction are mostly previously taken, and the new ones that can be assigned form only a small fraction of r. Condition 2.2 ensures that this effect is initially limited. The situation here is akin to that of a hashing scheme in which as the memory fills up, fewer and fewer new places are available (Valiant, 1994). We now go on to discuss an algorithm for creating LINK. We consider A to be represented in one area from which there are directed edges to an area of relay neuroids, from which in turn there are directed edges to a third area containing B. Initially, the relay neuroids have threshold T, and all weights on incoming edges weight T/k, and on outgoing edges weight 0. The relay neuroids never change, except for firing. The neuroids in B are in state q1 initially.
550
L. Valiant
The algorithm for creating LINK has one step in which the representations of A and B are caused to fire at the first prompt. For the nodes in B that are in state q1 and are caused to fire, each incoming edge from a firing node is given weight T/k. The algorithm for executing LINK is even simpler. It requires simple threshold firing at both the relay level and in the B nodes. We now go on to discuss the shared representation expressed by relations 2.1 to 2.6 . The algorithm given for creating and executing LINK in the disjoint case described above applies unchanged to the shared case also. For JOIN, the creation algorithm given has to be modified so that no distinction is made any more between allocated and unallocated nodes. Now no creation process is necessary. Execution can be realized by the following modification of the creation algorithm for the disjoint case: (1) the final state is made the same as the initial state q1, rather than a new state q2, and (2) no synaptic weights are changed at all. The evaluation algorithm is unchanged. The descriptions of the two algorithms above assume bipartite graphs, in which A and B will be in different areas for the case of LINK and C in a different area from A and B in the case of JOIN. To adapt the algorithms to general graphs, small modifications are needed to allow for the node sets having nonzero intersections. For the shared representation one-step algorithm, there is again no creation process. Evaluation requires again only threshold firing. For LINK, there is no difference between the one-step and two-step cases. 5 Capacity and Interference The intention of our representations is that when all or most of the nodes representing an item fire, then the item is considered recognized. For example, the activation of sufficiently many neurons in a representation in a motor area of cortex would cause a certain muscle movement. This style of representation gives rise to a pair of related concerns: How many items can be represented in the system? What exactly does it mean for an item to be represented if unforeseen interference from other activities in the circuit can occur? First, we note that these concerns occur in a fundamentally novel way in our approach as compared with some previous theories. In a traditional associative memory, for example, there is just one kind of execution, retrieval, and we can assume that nothing else is going on simultaneously with an instance of it. Hence, the notion of capacity, the number of items that can be stored, is analyzable in a clean manner (Graham & Willshaw, 1997). In this letter, we have two kinds of tasks, memorization and association. We also have diverse environments arising from different histories of past circuit creations. Further, we may want some robustness to other simultaneous activities in the circuit, and our longer-term aim may be to support further tasks. These factors make possible a large number of potential sources of interference, which we define to be the effect on the execution
Memorization and Association on a Realistic Neural Model
551
of an algorithm of network conditions that arise from sources not specific to that execution. Our guarantee that the circuit acts correctly is only with respect to some specified set of noninterference conditions such as relations 2.2, 2.3, 2.5, or 2.6 and a robustness condition that, as described in the section to follow, upper-bounds the total number of nodes that are active in the whole circuit. No guarantees are implied for situations that are outside these constraints. For example, if the circuit has a “seizure” so that half of the nodes extraneous to the task at hand fire, then this is a pathological condition for which no guarantees are offered. A second observation is that our guarantees are only probabilistic and, at least in this article, only as computed by some numerical calculations of limited precision. In particular, in order to bound the probability of error in the various relations, we have computed the tails of the Bernouilli distribution B (n, p, k) by adding the individual nonnegligible terms T (n, p, i), using double-precision calculations. Since the number of terms grows with n, our accuracy was limited to about 10−6 for the largest values of n that we considered. In principle, the calculations may be performed to arbitrary accuracy in the following sense. One could compute for what minimum integer x is the probability of error less than 10−x for each relation. Doing such detailed calculations is beyond the scope of this article. It will suffice here to observe that certainly in some cases within the tables of parameters that we consider, the errors are much smaller than the claimed 10−6 , in fact, less than 10−1000 in one extremity. As an example, we give a simple analytic upper bound on the error for relation 2.2 :
B (n, B (3r/2, p, km ), r/10) ∼ 0 under the assumption 2.1 , nB (2r − r , p, km ) ∼ r where, further, km = 2k. Now, for generic variables n, k, p, and b, if k = (1 + b)np and 0 ≤ b ≤ 1, then
B (n, p, k) ≤ exp(−b2 np/3) can be derived from Chernoff’s bound (Angluin & Valiant, 1979), where “exp” denotes exponentiation to the base e = 2.71 . . .. From this bound, it follows that
B (3r/2, p, 2k) ≤ exp(−(2k/(3rp/2) − 1)2 (3rp/2)/3). Now in all cases in the tables, rp < k (a fact that also follows from equation 2.1 if we assume r to be negligible and r/n to be small enough to ensure that km = 2k exceeds the mean). It then follows by substitution that B (3r/2, p, 2k) ≤ exp(−rp/18). So if we choose values from the tables with n = 109 , rp ≥ 180, and r ≥ 107 , say, then equation 2.2 becomes
552
L. Valiant
B (109 , exp(−10), 106 ) ∼ 0. Applying the general bound on B given above a second time gives that the error in equation 2.2 is at most B (109 , 10−4 , 106 ) ≤ B (109 , 10−4 , 2 ∗ 105 ) ≤ exp(−105 /3) ≤ 10−1000 . Thus, at one extremity of our range of parameters, the errors for the one relation 2.2 are indeed extremely small. Our point is that while for conceptually simpler models, the notion of capacity, a single value for the number of items that can be represented, makes sense, for more complex models a more appropriate way of expressing the same notion is that of upper-bounding the probability of various kind of interference in any execution of the task. For example, relation 2.2 does not give an absolute guarantee of the relevant interference’s not happening. It says that if A and B have total size 3r/2 and the graph is regarded as randomly generated with respect to those sets, then the probability of the unwanted interference is very small (e.g., 10−6 or 10−1000 ). This we interpret to say, roughly, in the fixed network, that if A is fixed and of size r and B is a random set of r/2 nodes, then the interference effect will occur with such small probability. 6 Robustness to Noise The discussion on capacity referred to specific interactions in the network. A more generic source of interference that can be analyzed is that due to some fixed fraction of neurons being active in the network that are extraneous to the task being executed. The main question is the fraction of extraneous neurons that can be active without interfering with the intended effects of the task at hand. We shall call that fraction the noise rate σ . In general, we can refine each of our noninterference constraints to allow for the expected number s = σ n of extraneous nodes being additionally active. We have restricted the entries in Tables 1 and 2 to those where noninterference relations 2.2, 2.3, 2.5, 2.6, 2.2 , 2.3 , and 2.6 , held even with perturbations corresponding to noise rates σ in the range from 10−4 to 10−6 . In all seven relations, we replaced the relevant quantities r or r/2, when they referred to the input neurons of the task, by r+σ n or r/2+σ n, as appropriate. In particular, for the respective relations, the replacements were done for the following neuron sets: equations 2.2 and 2.2 : A, B; equations 2.3 and 2.3 : A, B, D; equation 2.5: A; and equations 2.6 and 2.6 : A1 , B. In Tables 3 and 4, the entries are restricted in an exactly analogous way. We note that with the exception of equations 2.3, 2.3 , and 2.3 , it is clear that with a lower noise rate, it is easier to satisfy these equations. Estimates of the noise rates that can be tolerated can be made in a number of other senses also. For example, we could assume that all situations, including both circuit creations and executions, are subject to some noise rate, and solve equation 2.1 under that assumption. We also note that we have sought noise rates that can be supported by a very wide range of the parameters. Higher rates can be tolerated for individual parameter combinations.
Memorization and Association on a Realistic Neural Model
553
7 Predictions The entries in our tables are all solutions to equations 2.1, 2.1 , or 2.1 . Remarkably, there is the following simple interdependence among r, d, k, and n: rd < kn < c2 rd,
(∗)
where for each entry in Tables 1 to 4, c2 is a modest constant. In fact, for all entries in the tables with k ≥ 64, it is the case that c2 is smaller than 1.35. For entries with k = 32, k = 16, and k = 8, it is smaller than 1.6, 2.1, and 3.0, respectively. This relationship can be explained as follows. Equation 2.1 is of the form f(B(r, p, k)) = r/n, where f is a fixed function, here the squaring function. For const k, the expectation rp of the associated binomial distribution will stay a constant as r goes up by factor of 10 and p goes down by factor of 10, or, equivalently, as n goes up by a factor of 10 for constant d. (Note that this explains why the corresponding entries in the tables for the various values of n are approximately in the ratio of the magnitudes of n.) Since, in general, r/n is small, solutions of equation 2.1 will correspond to having combinations of r, p, k that correspond to a point somewhat above the expectation. In other words, k will be somewhat above rp = rd/n. Since the binomial distribution falls off exponentially above, the mean k will not be much larger than rn/d, from which we deduce that kd is a little larger than rn. It is easy to see that the same argument also holds for relation 2.1 if km = 2k. This simple relation can be taken as a prediction for systems that allocate memory in the style of our memorization mechanism, provided the number of representatives for a concept at the lower level, that is, A and B, is the same as at the next level, C. This is an attractive assumption for a memory system that treats all memorized concepts as “equal citizens.” It may not be true for all systems. For example, in various levels of a vision or other sensory system, there may be amplification or reduction in the number of neurons that represent an item between the various levels, and in that case appropriate modifications of equations 2.1 or 2.1 need to be solved instead. Finally, we note that a node in our formalism may be simulating a unit that consists of more than one biological neuron. For example, the suggestion that local connections between different layers in cortex may have the effect of increasing the effective degree of a node in the long-range connection network is analyzed in detail in Valiant (1994). 8 Discussion We have shown that networks of model neurons having the four parameters of neuron count, synapse count, strength of synapses, and switching time all within ranges widely observed in biology, can realize the two basic tasks of memory formation and association. We have given tables of values for
554
L. Valiant
these parameters that are consistent with the quantitative constraints that we have identified as being sufficient for the realization of these two tasks. Our positive result is that entries exist for realistic combinations of these numerical parameters. Further, the algorithms needed for creating and executing the circuits for these tasks are of the simplest kind, requiring as little as one step of vicinal or neighborly interaction. The two basic tasks that we have considered here have been the basis for implementing a broader variety of cognitive tasks, including memorization of conjunctions and disjunctions, handling relations, and inductive learning, under the direct-action hypothesis of strong synapses (Valiant, 1994). For the less restrictive setting of the current article, any such broader implications that may follow have yet to be worked out. In particular, if items have a shared representation and these are to be the targets of inductive learning, then special challenges arise. If multiple concepts are being learned and the examples for them are intermingled in time, then having a single synapse take part in the learning of more than one concept would appear to be problematic. This work offers apparently the first explanation of how the basic cognitive tasks that we consider here can be performed at all by neural systems that have synaptic strengths that are as weak as those that are typically observed experimentally. It is probable that different neural systems exploit different combinations of the numerical parameters, and do so in different ways. It is possible, and even probable, for example, that higher-order cognitive tasks require disjoint representations and stronger synapses. Our methodology offers a calculus for investigating such phenomena. Acknowledgments This work was supported in part by grants from the National Science Foundation: NSF-CCR-98-77049, NSF-CCR-03-10882, NSF-CCF-0432037, and NSF-CCF-04-0427129. I am also grateful to two anonymous referees for their helpful comments. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abramowitz, M., & Stegun, I. (1964). Handbook of mathematical functions. Cambridge, MA: National Bureau of Standards. Ali, A. B., Deuchars, J., Pawelzik, H., & Thomson, A. M. (1998). CA1 pyramidal to basket and bistratified cell EPSPs: Dual intracellular recordings in rat hippocampal slices. J. Physiol., 507, 201–217. Angluin, D., & Valiant, L. G. (1979). Fast probabalistic algorithms for Hamiltonian circuits and matchings. J. Comput. and Syst. Sciences, 8, 155–193. Barlow, H. B. (1972). Single units and sensation: A neuron doctrine for perceptual psychology. Perception, 1, 371–394.
Memorization and Association on a Realistic Neural Model
555
Beurle, R. L. (1955). Properties of a mass of cells capable of regenerating impulses. Philos. Trans. R. Soc. Lond. [Biol.], 240, 55–87. Bollobas, B. (2001). Random graphs. Cambridge: Cambridge University Press. Braitenberg, V. (1978). Cell assemblies in the cerebral cortex. In R. Heim & G. Palm (Eds.), Theoretical approaches to complex systems (pp. 171–188). Braitenberg, V., & Schuz, A. (1998). Cortex: Statistics and geometry of neuronal connectivity. Berlin: Springer-Verlag. Cook, S. A. (1971). The complexity of theorem proving procedures. In Proc. 3rd ACM Symp. on Theory of Computing (pp. 151–158). New York: ACM Press. Feldman, J. A. (1982). Dynamic connections in neural networks. Biol. Cybern., 46, 27–39. Feldman, J. A., & Ballard, D. H. (1982). Connectionist models and their properties. Cog. Sci., 6, 205–254. Gerbessiotis, A. V. (2003) Random graphs in a neural computation model. International Journal of Computer Mathematics, 80, 689–707. Graham, B., & Willshaw, D. (1997). Capacity and information efficiency of the associative net. Network: Comput. Neural Syst., 8, 35–54. Griffith, J. S. (1963). On the stability of brain-like structures. Biophys. J., 3, 299–308. Griffith, J. S. (1971). Mathematical neurobiology: An introduction to the mathematics of the nervous system. New York: Academic Press. Markram, H., & Tsodyks, M. (1996a). Redistribution of synaptic efficiency: A mechanism to generate infinite synaptic input diversity from a homogeneous population of neurons without changing the absolute synaptic efficacies. J. Physiol. (Paris), 90, 229–232. Markram, H., & Tsodyks, M. (1996b). Redistribution of synaptic efficacy between pyramidal neurons. Nature, 382, 807–810. Papadimitriou, C. H. (1994). Computational complexity. Reading, MA: AddisonWesley. Shastri, L., & Ajjanagadde, A. (1993). From simple associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences, 16(3), 417– 494. Shastri, L. (2001). A computational model of episodic memory formation in the hippocampal system. Neurocomputing, 38–40: 889–897. Thomson, A. M., Deuchars, J., & West, D. C. (1993). Large, deep layer pyramidpyramid single axon EPSPs in slices of rat motor cortex displayed paired pulse and frequency-dependent depression, mediated presynaptically and self-facilitation, mediated postsynaptically. J. Neurophysiol., 70, 2354–2369. Valiant, L. G. (1988). Functionality in neural nets. In Proc. AAAI-88 (Vol. 2, pp. 629–634). San Mateo, CA: Morgan Kaufmann. Valiant, L. G. (1994) Circuits of the mind. New York: Oxford University Press. Received March 9, 2004; accepted July 6, 2004.
LETTER
Communicated by Bard Ermentrout
Effects of Noisy Drive on Rhythms in Networks of Excitatory and Inhibitory Neurons Christoph Borgers ¨
[email protected] Department of Mathematics, Tufts University, Medford, MA 02155, U.S.A.
Nancy Kopell
[email protected] Department of Mathematics and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
Synchronous rhythmic spiking in neuronal networks can be brought about by the interaction between E-cells and Icells (excitatory and inhibitory cells). The I-cells gate and synchronize the E-cells, and the Ecells drive and synchronize the I-cells. We refer to rhythms generated in this way as PING (pyramidal-interneuronal gamma) rhythms. The PING mechanism requires that the drive II to the I-cells be sufficiently low; the rhythm is lost when II gets too large. This can happen in at least two ways. In the first mechanism, the I-cells spike in synchrony, but get ahead of the E-cells, spiking without being prompted by the E-cells. We call this phase walkthrough of the I-cells. In the second mechanism, the I-cells fail to synchronize, and their activity leads to complete suppression of the Ecells. Noisy spiking in the E-cells, generated by noisy external drive, adds excitatory drive to the I-cells and may lead to phase walkthrough. Noisy spiking in the I-cells adds inhibition to the E-cells and may lead to suppression of the E-cells. An analysis of the conditions under which noise leads to phase walkthrough of the I-cells or suppression of the E-cells shows that PING rhythms at frequencies far below the gamma range are robust to noise only if network parameter values are tuned very carefully. Together with an argument explaining why the PING mechanism does not work far above the gamma range in the presence of heterogeneity, this justifies the “G” in “PING.” 1 Introduction The gamma rhythm, a 30 to 80 Hz rhythm in the nervous system, has been associated with early sensory processing (Singer & Gray, 1995; Fries, Neuenschwander, Engel, Goebel, & Singer, 2001; Fries, Roelfsema, Engel, Konig, ¨ & Singer, 1997), attention (Tiitinen et al., 1993; Pulvermuller, ¨ Birbaumer, Lutzenberger, & Mohr, 1997), and memory (Tallon-Baudry, Bertrand, PerNeural Computation 17, 557–608 (2005)
c 2005 Massachusetts Institute of Technology
558
C. Borgers ¨ and N. Kopell
onnet, & Pernier, 1998; Slotnick, Moo, Kraut, Lesser, & Hart, 2002). It has also been associated with the formation of cell assemblies, that is, temporarily synchronous sets of cells that work together in some aspect (Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Singer & Gray, 1995; Singer, 1999; Olufsen, Whittington, Camperi, & Kopell, 2003). However, it is still not completely understood which mechanisms underlie gamma rhythms, which biophysical parameters are important in these mechanisms, and how gamma rhythms participate in the formation of cell assemblies. Several different kinds of gamma rhythms have been observed in vivo, in vitro, and in computational studies. Synchronization may result from chemical synapses between inhibitory cells (Whittington, Traub, & Jefferys, 1995; Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996; Wang & Buzs´aki, 1996; Whittington, Traub, Kopell, Ermentrout, & Buhl, 2000; Tiesinga, Fellous, Jos´e, & Sejnowski, 2001; Hansel & Mato, 2003). Gamma rhythms generated in this way have been called interneuronal gamma (ING) rhythms (Whittington, et al., 2000) or γ -I rhythms (Tiesinga et al., 2001). The mechanism has also been called the mutual inhibition mechanism by Hansel and Mato (2003). In this letter, we study gamma rhythms resulting from chemical synapses between excitatory (pyramidal) cells and inhibitory cells. The I-cells (inhibitory cells) gate and synchronize the E-cells (excitatory cells), and the E-cells drive and synchronize the I-cells (Whittington et al., 2000; Tiesinga et al., 2001; Hansel & Mato, 2003). Gamma rhythms of this kind have been called pyramidal-interneuronal gamma (PING) rhythms (Whittington et al., 2000) or γ -II rhythms (Tiesinga et al., 2001). The mechanism has also been called the cross-talk mechanism by Hansel and Mato (2003)1 In vitro, PING is produced by tetanic stimulation (Whittington et al., 2000). PING is believed to be associated with the creation of cell assemblies (Whittington et al., 2000; Olufsen et al., 2003). Other mechanisms, not modeled here, are believed to play a role in the generation of at least some gamma rhythms. Electrical coupling between dendrites of interneurons can contribute to enhancing the coherence of rhythms (Tam´as, Buhl, Lorincz, ¨ & Somogyi, 2000; Traub et al., 2001). Electrical coupling between axons of pyramidal cells is believed to play a central role in driving persistent gamma oscillations in the CA3 region of the hippocampus (Traub et al., 2000). Chattering cells, which intrinsically generate bursts of action potentials at gamma frequency, play a role in generating some gamma rhythms (Gray & McCormick, 1996; Traub, Buhl, Golveli, & Whittington, 2003). Unlike these forms of gamma, PING can be captured by models that abstract from much of the biophysical detail. In general, pyramidal cells are capable of producing many ionic currents. However, some currents that are not involved in producing action potentials, such as Ih and IT , are negligible 1
PING is one of two “cross-talk mechanisms” discussed by Hansel and Mato (2003).
Rhythms in the Presence of Noise
559
during PING activity, since the voltage does not become sufficiently low during any part of the cycle. Other currents, such as, slow outward potassium currents, are weakened by the large activation of metabotropic glutamate receptors needed to produce PING (Storm, 1989; Charpak, G¨ahwiler, Do, & Knopfel, ¨ 1990; Whittington et al., 2000). Consequently, during PING, the electrophysiological behavior is well described by standard HodgkinHuxley equations, which can often be well approximated by reduced equations such as the quadratic integrate-and-fire model (Latham, Richmond, Nelson, & Nirenberg, 2000) and the theta model (Ermentrout & Kopell, 1986; Hoppensteadt & Izhikevich, 1997; Gutkin & Ermentrout, 1998). These reductions and relevant parameter regimes are described in section 2. It is easy to describe and understand the PING mechanism in a network consisting of just one E-cell and one I-cell. The E-cell fires and excites the I-cell enough to fire. The I-cell fires and temporarily inhibits the E-cell. This simple mechanism, discussed in more detail in section 3, requires that external excitatory drive to the I-cell be sufficiently weak that the I-cell fires only in response to the E-cell, not on its own. When this condition is violated, phase walkthrough of the I-cell occurs, that is, the I-cell does not wait for input from the E-cell to fire, destroying the regular rhythm.2 The region in parameter space in which phase walkthrough occurs is bounded by a hypersurface that we call the phase walkthrough boundary. It is discussed in detail in section 4. Clearly, phase walkthrough of the I-cell occurs more easily when the E-cell is driven less strongly. The dynamical properties of the rhythm become more subtle when there are large populations of E- and I-cells. The key to the PING rhythm in large networks is the mechanism by which the I-cells synchronize the E-cells. Section 5 gives an explanation of this mechanism. If approximate synchronization is to occur in the presence of heterogeneity, the ratio r=
strength of inhibitory synaptic currents into E-cells strength of excitatory input drive to E-cells
(1.1)
must be sufficiently large (see section 5). (A refined definition of r is given later.) In a two-cell network, phase walkthrough of the I-cells is the only mechanism by which the PING rhythm can be lost as external drives are varied. In larger networks, there is a second important mechanism: suppression of the E-cells by asynchronous activity of the I-cells. (The I-cells are most effective at suppressing the E-cell when they are asynchronous. See appendix A.) In a simulation in which the I-cells are asynchronous initially and receive strong external drive, suppression of the E-cells may occur immediately, preventing the PING mechanism from synchronizing the network. The region in 2 Throughout this letter, we take the word rhythm to denote a regular rhythm. For instance, the firing pattern in the right panel of Figure 4 is not considered a “rhythm.”
560
C. Borgers ¨ and N. Kopell
parameter space in which asynchronous activity of the I-cells is capable of suppressing the E-cells is bounded by a hypersurface that we call the suppression boundary. It is discussed in detail in section 6. It turns out that the quantity r plays the central role here: suppression occurs more easily for larger r. Bistability between suppression and synchrony is analyzed in section 7. Our study of the phase walkthrough and suppression boundaries casts light on the behavior of PING rhythms in the presence of noisy external drive. The rhythm may not be disrupted by occasional out-of-order spiking of E-cells, caused by noisy external drive. However, too much noisy spiking of the E-cells generates too much tonic excitatory drive to the I-cell, resulting in phase walkthrough. This is analyzed in section 8. Since phase walkthrough of the I-cells occurs more easily for weaker external drive to the E-cells, PING rhythms are more easily abolished by noisy activity in the E-cells when external drive to the E-cells is weaker; the analysis in section 8 makes this quantitative. Similarly, the rhythm may not be disrupted by occasional out-of-order spiking of I-cells. However, too much noisy spiking of the I-cells generates too much tonic inhibitory drive to the E-cells, resulting in their suppression. This is analyzed in section 9. Since suppression of the E-cells occurs easily for large values of r, PING rhythms are easily abolished by noisy activity in the I-cells when r is large. The analysis and most of the simulations presented in this article assume homogeneity in network parameter values and all-to-all connectivity. However, we believe that moderate amounts of heterogeneity and significant sparseness in connectivity do not affect the conclusions. Some simulations, including heterogeneity and sparseness, are presented in section 10. To lower the frequency of a PING rhythm, one must decrease the external drive IE to the E-cells or increase r (or both). A decrease in IE makes phase walkthrough of the I-cells caused by noisy spiking of the E-cells more likely; an increase in r makes the rhythm more susceptible to noise in the I-cells. Our results thus imply that slower PING rhythms are less robust to noise. A PING rhythm will be disrupted by very low levels of noise if its period is many times larger than the decay time constant τI of inhibition, that is, if its frequency is far below the gamma range. If one attempts to raise the frequency above the gamma range (i.e., if one tries to decrease its period below τI ) by raising IE , leaving other parameter values unchanged, then r drops, causing the synchronization mechanism to break down, particularly in the presence of heterogeneity. To restore synchronization, one must raise the strength of the I→E synapses. This brings the frequency back into the gamma range. These arguments are presented in section 11. In summary, our results justify the “G” in “PING”: to be robust in the presence of both heterogeneity and noise, a PING rhythm must have a period that is greater than τI , but not many times greater.
Rhythms in the Presence of Noise
561
2 Networks of Theta Neurons 2.1 The Theta Model. In the theta model, a neuron is represented by a point P = (cos θ, sin θ) moving on the unit circle S1 . This is analogous to the Hodgkin-Huxley model, which represents a periodically spiking spaceclamped neuron by a point moving on a limit cycle in a four-dimensional phase space. In the absence of synaptic coupling, the differential equation describing the motion on S1 is dθ 1 − cos θ = + I (1 + cos θ) . dt τ
(2.1)
Here, I should be thought of as an input “current,” measured in radians per unit time. The time constant τ > 0 is needed to make equation 2.1 dimensionally correct. Its meaning will be clarified shortly. For negative I, equation 2.1 has two fixed points, one stable and the other unstable. As I increases, the fixed points approach each other. When I = 0, a saddle node bifurcation occurs: the fixed points collide, and cease to exist for positive I. For a theta neuron, to “spike” means to reach θ = π (modulo 2π), by definition. The transition from I < 0 to I > 0 is the analog of the transition from excitability to spiking in a neuron. If I > 0 and −π ≤ θ1 ≤ θ2 ≤ π , the time it takes for θ to rise from θ1 to θ2 equals
θ2 θ1
dθ = (1 − cos θ)/τ + I(1 + cos θ)
τ I
tan(θ/2) θ2 . arctan √ τI θ1
(2.2)
Setting θ1 = −π and θ2 = π in this formula, we find that the period is T=π
τ . I
(2.3)
We denote the time it takes for θ to rise from π/2 to 3π/2 by δ. This should roughly be thought of as the spike duration. Applying formula 2.2 with (θ1 , θ2 ) = (π/2, π) and (θ1 , θ2 ) = (−π, −π/2) and adding the results, we find τ 1 δ = π − 2 arctan √ . (2.4) I τI For physiological realism, we wish to ensure δ/T 1. By equations 2.3 and 2.4, δ 2 1 = 1 − arctan √ . T π τI
562
C. Borgers ¨ and N. Kopell
Therefore, δ/T 1 means the same as τ I 1. Since arctan(1/ ) = π/2− + O( 2 ) as → 0, equation 2.4 implies δ ≈ 2τ when τ I 1. This reveals the meaning of τ : in the parameter regime of interest to us, τ is approximately half the spike duration. Motivated by this discussion and by the fact that spike durations in real neurons are on the order of milliseconds, we set τ =1 for the remainder of this article, think of time as measured in milliseconds, and always consider input currents I 1. The frequency of a theta neuron is defined to be ν=
1000 . T
(2.5)
Since we think of t as time measured in milliseconds, ν should be thought of as frequency measured in Hz. Neuronal models are called of type I if the transition from excitability to spiking involves a saddle node bifurcation on an invariant circle, and of type II if it involves a subcritical Hopf bifurcation (Hodgkin, 1948; Ermentrout, 1996; Gutkin & Ermentrout, 1998; Rinzel & Ermentrout, 1998; Izhikevich, 2000). Thus, the theta model is a type I neuronal model. It is canonical, in the sense that other type I models can be reduced to it by coordinate transformations (Ermentrout & Kopell, 1986; Hoppensteadt & Izhikevich, 1997). In this letter, the E-cells are always modeled as theta neurons. The neuronal model used for the I-cells is irrelevant for many of our arguments; where it matters, we model the I-cells at theta neurons at well. The theta model can also be derived from the quadratic integrate-and-fire model (Latham et al., 2000)3 . The general equation of this model is dV = a(V − V0 )(V − V1 ) + Q, dt
(2.6)
with a > 0 and V0 < V1 . After scaling and shifting V and scaling t and Q, equation 2.6 becomes dV = 2V(V − 1) + Q. dt
(2.7)
For Q < 1/2, equation 2.7 has two fixed points, one stable and the other unstable. As Q increases, the fixed points approach each other. When Q = 1/2, a saddle node bifurcation occurs: the fixed points collide and cease 3 This derivation was shown to us by Rob Clewley, who learned it from Bard Ermentrout.
A
V
Rhythms in the Presence of Noise
563
1 0 10
20
30
40
50
40
50
B
V
t
1 0 10
20
30 t
Figure 1: Voltage traces for quadratic integrate-and-fire neuron with (A) threshold at 1 and reset at 0 and (B) threshold at ∞ and reset at −∞.
to exist for positive Q > 1/2. We supplement equation 2.7 with the reset condition V(t + 0) = 0 if V(t − 0) = 1 .
(2.8)
Figure 1A shows V as a function of t for Q = 0.51. The change of variables V=
θ 1 1 + tan 2 2 2
(2.9)
transforms equation 2.7 into equation 2.1 with I=
Q − 1/2 . 1/2
(2.10)
Note that I is the relative deviation of Q from the threshold value 1/2. The only difference between the quadratic integrate-and-fire model and the theta model lies in the reset condition. The two models are precisely equivalent if equation 2.8 is replaced by V(t + 0) = −∞ if V(t − 0) = ∞.
(2.11)
564
C. Borgers ¨ and N. Kopell
(Note that V rises from −∞ to ∞ in finite time when Q > 1/2.) Figure 1B illustrates the effect of replacing equation 2.8 by 2.11. 2.2 Synapses. The derivation of the theta neuron from the quadratic integrate-and-fire neuron offers a way of modeling conductance-based synapses in the framework of the theta model.4 One begins by adding a synaptic current to the right-hand side of equation 2.7, dV = −2V(1 − V) + Q + gs(Vrev − V) , dt
(2.12)
where g denotes the maximum conductance, s = s(t) ∈ [0, 1] is a synaptic gating variable (the quantity that the synapse directly acts on), and Vrev is the reversal potential of the synapse. For excitatory synapses, Vrev should be substantially above the threshold voltage. For inhibitory synapses, it should be somewhat below the reset voltage. We use Vrev = 6.5 for excitatory synapses and Vrev = −0.25 for inhibitory synapses throughout this article. Numerical experiments indicate that changes in these values have no qualitative and little quantitative effect on our conclusions. With the change of variables (see equation 2.9), equation 2.12 becomes dθ = 1 − cos θ + I + (2Vrev − 1)gs(t) (1 + cos θ ) − gs(t) sin θ . dt Thus, our equation for a theta neuron receiving an excitatory synapse (Vrev = 6.5) is dθ = 1 − cos θ + I + 12gs(t) (1 + cos θ) − gs(t) sin θ , dt and our equation for a theta neuron receiving an inhibitory synapse (Vrev = −0.25) is dθ 3 = 1 − cos θ + I − gs(t) (1 + cos θ) − gs(t) sin θ . dt 2
(2.13)
For theoretical arguments in this article, we let s(t) = H(t − t0 )e−(t−t0 )/τD ,
(2.14)
where t0 denotes the time of the spike of the presynaptic neuron, H is the Heaviside function, 1 if t > 0 , (2.15) H(t) = 0 if t ≤ 0 , 4
This too was shown to us by Rob Clewley, with attribution to Bard Ermentrout.
Rhythms in the Presence of Noise
565
and τD > 0 is the synaptic decay time constant. In numerical simulations of networks of theta neurons, we use a smooth approximation to equation 2.14, defining s to be a solution of the differential equation, ds 1−s s + e−η(1+cos θ) , =− dt τD τR where θ denotes the dependent variable associated with the presynaptic neuron, η = 5, and τR = 0.1. Thus, s rises rapidly toward 1 when θ ≈ π modulo 2π and decays exponentially with time constant τ otherwise. 2.3 Networks of Theta Neurons. The numerical simulations presented in this article are for networks of excitatory and inhibitory theta neurons. Unless stated otherwise, the connectivity is all-to-all, and there is no heterogeneity in network properties. The following notation will be used throughout. NE NI τE τI IE II gIE
= = = = = = =
gII =
gEI =
gEE =
TE TI TP νE
= = = =
νI = νP = r =
number of E-cells in the network number of I-cells in the network synaptic decay time constant for excitatory synapses synaptic decay time constant for inhibitory synapses external drive to E-cells external drive to I-cells sum of all conductances associated with inhibitory synapses acting on a given E-cell (so an individual I → E synapse has strength gIE /NI ) sum of all conductances associated with inhibitory synapses acting on a given I-cell (so an individual I → I synapse has strength gII /NI ) sum of all conductances associated with excitatory synapses acting on a given I-cell (so an individual E → I synapse has strength gEI /NE ) sum of all conductances associated with excitatory synapses acting on a given E-cell (so an individual E → E synapse has strength gEE /NE ) √ intrinsic period of E-cells = π/ IE (see equation 2.3) intrinsic period of I-cells (see section 2.4) period of population rhythm intrinsic frequency of E-cells = 1000/TE (see equation 2.5 and the comment following it) intrinsic frequency of I-cells = 1000/TI frequency of population rhythm = 1000/TP (3/2)gIE /IE
566
C. Borgers ¨ and N. Kopell
The definition of r given here will, for the remainder of the article, replace the informal definition in equation 1.1. The reason for including the factor 3/2 in the definition of r will become apparent in section 3. 2.4 Parameter Choices. Unless otherwise stated, we use τE = 2 and τI = 10 in this letter. This is motivated by the decay time constants of excitatory synapses involving AMPA receptors, approximately 2 ms, and inhibitory synapses involving GABAA receptors, approximately 10 ms. (Recall that we think of t as time in milliseconds.) As discussed in section 2.1, the external drive I to a theta neuron should be 1 for the theta model to be biologically reasonable. We therefore assume that the external drives√IE and II are 1. An external drive I corresponds √ to the period T = π/ I, and therefore to the frequency ν = 1000 I/π . For example, I = 0.1 corresponds to ν ≈ 100, and I = 0.4 corresponds to ν ≈ 200. For the PING rhythms studied in this article, it is important that gIE be at least comparable in size to IE (see section 5). We will often choose gII = gIE , but we will also discuss the effects of choosing much smaller values of gII , or even gII = 0. We choose gEI in such a way that a population spike of the E-cells promptly triggers a population spike of the I-cells but does not trigger multiple spikes. If the I-cells are modeled as quadratic integrate-and-fire neurons, the following argument gives a rough indication of how large gEI should be. Consider a network consisting of a single E-cell and a single Icell. (This is equivalent to a network in which each group of cells, E and I, is fully synchronized.) Recall that Vrev = 6.5 for excitatory synapses. Since V is near 0 except during spikes, we approximate the term gs(Vrev − V) in equation 2.12 by gsVrev . We use the idealized form of s given by equation 2.14. Thus, a spike of the E-cell at time t0 gives rise to injection of the current gEI H(t − t0 )e−t/τE Vrev into the I-cell. Since τE = 2 is small in comparison with the periods of the rhythms of interest to us, we introduce the additional approximation that the charge injection resulting from an excitatory synapse is instantaneous, causing a rise in the membrane potential of the I-cell by V = gEI τE Vrev . The excitatory synapse is sure to cause a nearly instantaneous spike of the I-cell if V = 1, that is, if gEI = 1/(τE Vrev ). Since we use τE = 2 and Vrev = 6.5 throughout, this means gEI = 1/13 ≈ 0.077. However, even V = 0.2 typically leads to a spike in the I-cell after a brief delay, so even gEI = 0.2/(τE Vrev ) = 0.2/13 ≈ 0.015 will suffice for PING rhythms. In the CA1 region of the hippocampus, E→E connections are known to have low density (Knowles & Schwartzkroin, 1981). Also, some types
Rhythms in the Presence of Noise
567
of gamma oscillations are induced by acetylcholine (Fisahn, Pike, Buhl, & Paulsen, 1998), which in turn suppresses recurrent excitatory connections (Hasselmo, 1999). Motivated by these experimental findings and following many previous modeling studies of gamma rhythms (e.g., Whittington, Stanford, Colling, Jefferys, & Traub, 1997), we assume gEE = 0 throughout this article. It would be interesting to study the effects of E→E synapses in the presence of noise, particularly in the presence of noisy spiking of the E-cells; however, this will be deferred to future work. 2.5 The Intrinsic Frequency of the I-Cells. The intrinsic frequency νI of the I-cells is the frequency at which the I-cells would spike if the E→I synapses were removed. In the presence of I→I synapses, νI differs from the frequency of an isolated I-cell and depends on whether the I-cells synchronize. In many of our arguments, we will make no specific assumption about the neuronal model used for the I-cells, and simply consider νI a network parameter. However, we will also apply our general results to the case when the I-cells are modeled as theta neurons. For this, we will need to compute νI . The intrinsic frequency νI is related to the intrinsic period TI by νI = 1000/TI . We will discuss how to compute TI . 2.5.1 Intrinsic Period in the Absence of I→I Synapses. If gII = 0, then π TI = √ II
(2.16)
by equation 2.4. 2.5.2 Intrinsic Period in the Presence of I→I Synapses, Assuming Synchronous Network Activity. Assume now that gII > 0. We model synapses in the idealized way described by equations 2.13 and 2.14. If the I-cells are in perfect synchrony, the period TI is the time that it takes for θ to reach π if θ is governed by 3 dθ −t/τI = 1 − cos θ + II − gII e (1 + cos θ ) − gII e−t/τI sin θ , dt 2 θ (0) = −π . (2.17) This cannot be computed analytically, but it is easy to compute numerically. 2.5.3 Intrinsic Period in the Presence of I→I Synapses, Assuming Asynchronous Network Activity. If the I-cells are in complete asynchrony, the function
568
C. Borgers ¨ and N. Kopell
e−t/τI in equation 2.17 is replaced by its time average. (We neglect fluctuations resulting from the finiteness of the network.) Thus, each I-cell is governed by dθ 3 = 1 − cos θ + II − gII s (1 + cos θ) − gII s sin θ dt 2
(2.18)
with s=
1 TI
TI
e−t/τI dt =
0
1 − e−TI /τI . TI /τI
(2.19)
If θ obeys equation 2.18, and θ(0) = −π, then θ reaches π at time
π
dθ −π 1 − cos θ + II − (3/2)gII s (1 + cos θ) − gII s sin θ π tan(θ/2)−gII s/2 arctan √ II −(3/2)gII s−(gII s)2 /4 = I − (3/2)g s − (g s)2 /4 I II II −π
π = . II − (3/2)gII s − (gII s)2 /4
(2.20)
This formula holds if II − (3/2)gII s − (gII s)2 /4 > 0. For II − (3/2)gII s − (gII s)2 /4 ≤ 0, θ = π is never reached. The period TI is therefore determined by TI =
π II − (3/2)gII s − (gII s)2 /4
.
(2.21)
Note that the right-hand side depends on TI , since s does. We cannot solve this equation analytically. However, it has a unique solution because the right-hand side is a strictly decreasing function of TI . This solution can easily be found numerically, for instance, using the bisection method. 2.5.4 An Observation on the Effects of I→I Synapses for Weakly Driven I-Cells. A major focus of this letter is the behavior of PING rhythms for weak external drives. We will show later that PING rhythms tend to become noise sensitive when the external drives become weak, but that the noise sensitivity can be counteracted, to some extent, by introducing I→I synapses. This motivates our interest in the effects of I→I synapses in the limit as II → 0. The main result of the following discussion is equation 2.23, which will play a role in section 7. Under the assumption of perfect synchrony of the I-cells, I→I synapses become irrelevant as II → 0. In this limit, TI → ∞, so the decay time τI
Rhythms in the Presence of Noise
569
of the inhibitory synapses becomes negligible in comparison with TI , and therefore π TI ∼ √ II as II → 0. (The symbol ∼ expresses that the ratio of the two quantities converges to 1.) Under the assumption of complete asynchrony, the I→I synapses remain relevant in the limit as II → 0. As II → 0, TI → ∞ and therefore s∼
τI TI
(2.22)
by equation 2.19. Using equation 2.22 in equation 2.21, we find that as II → 0, the fixed point equation for TI becomes TI =
π I − (3/2)gII τI /TI − (gII τI /TI )2 /4
.
This is equivalent to II TI2 =
3 gII τI TI + C 2
with C = π2 +
(gII τI )2 , 4
so TI ∼
3 gII τI C 3 gII τI + ∼ . 2 II II TI 2 II
(2.23)
Comparing equation 2.23 with 2.16, we see that the presence of I→I synapses makes an important difference in the limit as II → 0. 2.6 The Asynchronous State. In our simulations, we frequently initialize theta neurons in a state of complete asynchrony. We define here what we mean by “complete asynchrony.” Consider a theta neuron with an external drive I > 0. Suppose that θ(0) = θ0 ∈ (−π, π ). The time to the next spike is then π π dt 1 dθ = dθ R= θ0 dθ θ0 1 − cos θ + I(1 + cos θ ) 1 tan(θ/2) π = √ arctan √ I I θ0 1 π tan(θ0 /2) = √ − arctan . (2.24) √ I 2 I
570
C. Borgers ¨ and N. Kopell
To initialize a population of uncoupled neurons in complete asynchrony, √ we choose R = Uπ/ I with U ∈ (0, 1) random, uniformly distributed. We then compute θ0 from equation 2.24, √ I tan (Vπ/2) , (2.25) θ0 = 2 arctan with V = 1 − 2U uniformly distributed in (−1, 1). 3 PING in a Simple Two-Cell Network In this section, we consider a network of a single E-cell and a single I-cell. The E-cell is modeled as a theta neuron; it is assumed to receive external drive IE > 0. We make no explicit assumption about the neuronal model used for the I-cell, but denote by νI its intrinsic frequency. (If the I-cell is driven below threshold, νI = 0.) We make the following idealizing assumptions: 1. A spike of the E-cell instantaneously triggers a spike of the I-cell but has no effect lasting beyond the spike time. 2. The spike of the I-cell gives inhibitory input to the E-cell in the idealized form described by equation 2.14. 3. The I-cell does not spike again until prompted by the next spike of the E-cell. Assumptions 1 and 3 imply that the I-cells spike exactly once per oscillation cycle. By strengthening the E→I synapses, one can generate PING-like rhythms in which each I-cell fires multiple times on each oscillation cycle. These rhythms differ from those considered in this letter only in some details of minor interest; for instance, the oscillation frequency is reduced if the I-cell population fires several population spikes on each oscillation cycles. We note that spike doublets of the I-cells play a much subtler and more important role when conduction delays are substantial (Ermentrout & Kopell, 1998). However, in this article, we neglect conduction delays. If the two cells spike at t = 0, the E-cell is governed by the initial value problem dθ 3 −t/τI = 1−cos θ + IE − gIE e (1+cos θ)−gIE e−t/τI sin θ for t dt 2 > 0, (3.1) θ (0) = −π ,
(3.2)
until it spikes again. We denote by TP the time at which θ, governed by equations 3.1 and 3.2, reaches π , and write νP = 1000/TP . Assumption 3 then becomes νI ≤ νP .
(3.3)
Rhythms in the Presence of Noise
571
νP
100 0 0
0 IE 0.1 0.5
gIE
Figure 2: The PING frequency νP as a function of gIE and IE , for τI = 10.
We will now discuss the dependence of the frequency νP on IE , gIE , and τI . Our main conclusion will be that νP depends weakly on IE and gIE but strongly on τI . For τI = 10, Figure 2 shows the graph of νP as a function of IE and gIE . The graph is quite flat in the region in which neither gIE /IE nor IE is small. We will argue in later sections that in those regions in which gIE /IE or IE are small, PING rhythms are not robust. Specifically, one needs gIE /IE > 1 for rapid and robust synchronization in large networks (see section 5), and PING rhythms with small IE are highly susceptible to noise in the E-cells (see section 8). Thus, Figure 2 shows that in the most relevant parameter regime, the dependence of νP on IE and gIE is fairly weak. We will next derive an approximate formula for TP , valid for sufficiently large gIE /IE . This formula will confirm that the dependence of TP on IE and gIE is weak and also show the importance of τI . Defining J = IE −
3 gIE e−t/τI , 2
equations 3.1 and 3.2 can be written as follows: 2 dθ = 1 − cos θ + J(1 + cos θ) − (IE − J) sin θ dt 3 IE − J dJ = for t > 0 , dt τI θ (0) = −π ,
for t > 0 ,
(3.4) (3.5) (3.6)
572
C. Borgers ¨ and N. Kopell
*
J=J
A
J
0 −0.2 −0.4 −π
0 θ
π
q
1
B
0 0.4
20
IE 0 5
τI
Figure 3: (A) Phase portrait for equations 3.4 and 3.5 with IE = 0.1 and τI = 10, with the stable river indicated in bold. (B) q = (IE − J∗ )/IE as a function of IE and τI .
J(0) = IE −
3 gIE . 2
(3.7)
We will study the phase portrait for the two-dimensional dynamical system, equations 3.4 and 3.5. This is very similar to a discussion for inhibitory current pulses (not inhibitory synaptic inputs) that we gave earlier (Borgers ¨ & Kopell, 2003). The phase portrait for equations 3.4 and 3.5 is shown in Figure 3A for IE = 0.1 and τI = 10. The figure should be extended periodically in θ with period 2π. The flow is upward, in the direction of increasing J. The most striking feature of the phase portrait is the existence of strongly attracting and strongly repelling trajectories. Trajectories of this kind are found in many systems of ordinary differential equations and are called rivers
Rhythms in the Presence of Noise
573
(Diener, 1985a, 1985b). The figure reveals a stable river, that is, a trajectory, (θs , Js )
with Js (t) = IE −
3 gIE e−t/τI , 2
(3.8)
that is attracting in forward time. The stable river is indicated as a bold line in Figure 3A. As t → −∞, Js → −∞, and θs → θs0 , where θs0 is the unique solution in (−π, π ) of 1 + cos θ +
2 sin θ = 0 . 3
(It is easy to see that θs0 = −2 arctan(3/2) ≈ −1.966.) We denote by T∗ the time when θs (T∗ ) = π, and define J∗ = Js (T∗ ) ∈ (0, IE ) .
(3.9)
We define r=
(3/2)gIE , IE
(3.10)
and assume r>1,
(3.11)
so J(0) < 0. If J(0) is sufficiently negative, the trajectory (θ (t), J(t)) is rapidly attracted to (θs (t), Js (t)). At the time when θ = π, we therefore have J ≈ J∗ , or t ≈ T∗ . Thus, TP is approximately T∗ . Equations 3.8 and 3.9 imply that T∗ = ln
(3/2)gIE IE − J ∗
τI .
(3.12)
We write q = q(τI , IE ) =
IE − J ∗ ∈ (0, 1) . IE
(3.13)
Figure 3B shows the graph of q. From equations 3.10, 3.12, and 3.13, T∗ = ln
r τI . q
(3.14)
Of course, this is not an explicit formula for T∗ , since q depends on IE and τI and is not given by an explicit formula. Equation 3.14 does, however,
0
I−cell
I−cell
E−cell
C. Borgers ¨ and N. Kopell
E−cell
574
100 200 300
0
100 200 300
Figure 4: From PING (left panel) to phase walkthrough (right panel) as a result of raising drive to the I-cell.
explicitly describe the dependence of T∗ on gIE , since the only quantity on the right-hand side of equation 3.14 depending on gIE is r = (3/2)gIE /IE . We have concluded r TP ≈ ln τI , q
(3.15)
if r is sufficiently large. Numerical experiments show that r = 3 is sufficient for this formula to be quite accurate. Figure 3B shows that the graph of q is fairly flat unless IE is small. In a parameter regime in which q is approximately constant, formula 3.15 shows that TP depends strongly (namely, approximately linearly) on τI , but only weakly (namely, logarithmically) on gIE and IE . 4 The Phase Walkthrough Boundary A necessary condition for the PING mechanism to work is that the I-cells spike only when prompted by the E-cells, not independently. This is trivially true if the drive to the I-cells is subthreshold, that is, νI = 0. If νI > 0, then the PING mechanism works only if the frequency of the rhythm is greater than the intrinsic frequency of the I-cells. Figure 4 illustrates, using a network of a single E-cell and a single I-cell, what happens when this condition is not met. (The figure indicates spike times.) We refer to the phenomenon shown in Figure 4 as phase walkthrough of the I-cells. Figure 5 shows, in various different ways, that phase walkthrough occurs more easily when IE is smaller or gIE is larger, in other words, when the PING frequency is lower. The figure also shows that I→I synapses protect against phase walkthrough. We will now discuss the details of Figure 5.
Rhythms in the Presence of Noise
575
A
B
100
II
νI
0.1
0 0
0
ν
0 0 IE
gIE
E
0 gIE
100 0.5
0.1 0.5
C
D
0.06 II
II
0.1
0 0
0 0
IE
0.2
0 IE
g
IE
0.1 0.5
Figure 5: Phase walkthrough boundary for τI = 10 (A) in (gIE , νE , νI )-space, (B) in (gIE , IE , II )-space for gII = 0, (C) in the (IE , II )-plane, for gII = 0 and gIE = 0.2, (D) in (gIE , IE , II )-space, for gII = gIE .
4.1 The Phase Walkthrough Condition. In the simple two-cell network of section 3, a necessary and sufficient condition for phase walkthrough of the I-cells to be avoided is νI ≤ νP .
(4.1)
The equation νI = νP
(4.2)
(or equivalently TI = TP ) defines a hypersurface in parameter space that we call the phase walkthrough boundary. As discussed in section 3, νP is √ a function of IE , gIE , and τI . Since νE = (1000/π ) IE , we can also think of νP as a function of νE , gIE , and τI . For given νE , gIE , and τI , it is easy to
576
C. Borgers ¨ and N. Kopell
calculate νP numerically with great (ten-digit) accuracy. We first determine the time Tp at which θ, governed by equations 3.1 and 3.2, reaches π , and then compute νP = 1000/Tp . For τI = 10, Figure 5A shows the phase walkthrough boundary in (gIE , νE , νI )-space. If (gIE , νE , νI ) lies below the surface in Figure 5A, a PING rhythm is possible. Above the surface, there is no PING rhythm because of phase walkthrough. Note that slowing the rhythm without reducing drive to the I-cell eventually results in νI > νP , that is, in phase walkthrough. 4.2 I→I Synapses Protect Against Phase Walkthrough. Since νI is a decreasing function of gII , I→I synapses make phase walkthrough of the I-cells less likely to occur. (Recall that by the notational conventions of section 2, the definition of νI takes I→I synapses into account, but νE denotes the frequency of the E-cells in the absence of any synapses.) To make this quantitative, we must assume a specific model for the I-cell. For the remainder of this section, we therefore assume that the I-cell is a theta neuron. We still make assumptions 1 and 2 of section 3, and we also assume that I→I synapses take the idealized form given by equation 2.14. For fixed values of τI and gII , the phase walkthrough boundary can then be thought of as a surface in (gIE , IE , II )-space. For τI = 10 and gII = 0, this surface is plotted in Figure 5B. The surface in Figure 5B is thus identical to that of Figure 5A, except for the choice of coordinates. The parameter νI in Figure 5A is replaced by II in Figure 5B. Equation 2.16 yields the relation between νI and II . (Note that νI = 1000/TI .) Figure 5B shows that in the parameter regimes of interest to us (r > 1, that is, gIE > (2/3)IE )), II must be much smaller than IE for PING to be possible. In Figure 5C, we show a section through the surface of Figure 5B. In addition to fixing τI = 10 and gII = 0 as in Figure 5B, gIE = 0.2 is fixed as well. The curve in Figure 5C is the graph of a function of IE . For later reference, we denote this function by F = F(IE ), suppressing in the notation the dependence on τI , gII , and gIE . Thus, the curve shown in Figure 5C is given by II = F(IE ) .
(4.3)
Phase walkthrough occurs above the curve, but not below it. The surface in Figure 5B changes dramatically when gII is taken to be equal to gIE . It is easy to see that in this case, the walkthrough boundary is given by II = IE
(4.4)
(see Figure 5D). Thus, I→I synapses have a stabilizing effect, greatly enlarging the parameter regime in which PING is possible.
Rhythms in the Presence of Noise
577
5 PING in Networks of More Than Two Cells The key to PING rhythms in networks of more than two cells lies in the fact that a population of uncoupled neurons receiving a single common strong inhibitory input pulse synchronizes. In this section, we give an explanation of this synchronization mechanism and illustrate it with numerical examples. The main conclusion of this section is that rapid and robust synchronization in large networks requires r = (3/2)gIE /IE > 1. In earlier work (Borgers ¨ & Kopell, 2003), we discussed the synchronization of a population of theta neurons by an inhibitory current pulse. We present here a very similar discussion referring to an inhibitory synaptic pulse. A single theta neuron receiving an inhibitory synaptic input at t = 0 is described by equations 3.4 and 3.5, with initial conditions θ (0) = θ0 , and equation 3.7. Synchronization of a population of uncoupled theta neurons by a single inhibitory synaptic pulse can be understood from the phase portrait in Figure 3A. Recall that the trajectory (θs , Js ) indicated as a bold line in Figure 3A, the “stable river,” attracts nearby trajectories. Also recall that T∗ is defined by θs (T∗ ) = π and J∗ by J∗ = Js (T∗ ). Assume, as in section 3, r > 1, so J(0) < 0. If J(0) is sufficiently negative and θ0 is sufficiently far from π, (θ(t), J(t)) is rapidly attracted to the stable river. At the time when θ = π, we therefore have J ≈ J∗ , or t ≈ T∗ . Thus, the first spike after time zero occurs approximately at time T∗ . If θ0 is close to π, then θ(t) quickly passes through π and is then rapidly attracted to (θs (t) + 2π, Js (t)). (Recall that Figure 3A should be thought of as extended periodically in θ with period 2π.) When θ (t) reaches 3π , then J ≈ J∗ and therefore t ≈ T∗ . Thus, in that case, a spike occurs soon after time zero, followed by a spike approximately at time T∗ . Only for values of θ0 in a narrow transition band is (θ (t), J(t)) attracted to neither (θs (t), Js (t)) nor (θs (t) + 2π, Js (t)). A population of E-cells is approximately synchronized by an inhibitory pulse with sufficiently large r because T∗ is independent of θ0 . Figure 6A shows a simulation for a network as described in section 2, gEI = 0.05, gIE = 0.20, gII = 0, IE = 0.1 (νE ≈ 101), II = 0.002. (Other parameter values are as specified in section 2.) The common inhibitory input received by all E-cells leads to their rapid synchronization as a result of the mechanism described earlier. This in turn leads to synchronization of the I-cells, which are driven by the E-cells. Note that here, r=
(3/2)gIE 0.3 =3. = IE 0.1
PING rhythms are possible even for r < 1. For instance, Figure 6B shows results similar to those of Figure 6A, but with gIE = 0.05, corresponding to r = 0.75. A rhythm appears, but only after a number of periods, not
C. Borgers ¨ and N. Kopell
0 40
B
160
I−cells
C
0 40
0 0
100 200 300
160
D
100 200 300
0 40
0 0
100 200 300
E−cells
E−cells
0 0
160
I−cells
I−cells
A
E−cells
160
I−cells
E−cells
578
0 40
0 0
100 200 300
Figure 6: (A) PING with strong inhibitory synapses (gEI = 0.05, gIE = 0.2, gII = 0, IE = 0.1, II = 0.002). (B) PING with weak inhibitory synapses (gIE = 0.05; all other parameter values as in A). (C) As in A, with 5% heterogeneity in IE . (D) As in B, with 5% heterogeneity in IE .
immediately. Thus, when r ≤ 1, PING may still be possible, but it is less attracting and more difficult to analyze. Numerical experiments suggest that PING with r ≤ 1 is also highly sensitive to heterogeneity. As an example, we introduce 5% heterogeneity in IE . That is, we take IE to be a normally distributed random number with mean 0.1 and standard deviation 0.005. Figure 6A turns into Figure 6C—the rhythm is barely affected. However, Figure 6B turns into Figure 6D—the rhythm is destroyed. The effect of heterogeneity in IE is to spread out the spiking of the E-cells. If r is sufficiently large, each population spike of the I-cells approximately erases the memory of the past, bringing the E-cells back together and preventing the effects of heterogeneity from accumulating over time. In our numerical experience, at least r ≈ 3 is required if the rhythm is to survive heterogeneity in network parameter values on the order of 20%.
Rhythms in the Presence of Noise
579
6 The Suppression Boundary As seen in section 4, in a two-cell network (or, equivalently, in a larger network in which the E-cells and the I-cells are perfectly synchronized), PING is possible only if the intrinsic frequency νI of the I-cells is sufficiently small. If the I-cells intrinsically spike too rapidly, phase walkthrough occurs. In a network with many cells, there is another way in which rapid spiking of the I-cells can destroy PING: the I-cells can suppress the E-cells altogether. The I-cells are most effective at suppressing the E-cell when they are asynchronous. For linear integrate-and-fire neurons, this is proved in appendix A. Although we have not proved it for theta neurons, numerical evidence suggests that it is true in that case as well. We therefore assume in this section that the I-cells spike in complete asynchrony and ask under which circumstances the I-cells suppress the E-cells. Figure 7 shows, in various different ways, that suppression of the E-cells occurs more easily when IE is smaller or gIE is larger, in other words, when the PING frequency is lower. In this section, we will discuss the four panels of Figure 7 in detail. We will show that the quantity determining how easily suppression of the E-cells occurs is the ratio gIE /IE , which, up to the constant factor 3/2, is r. 6.1 The Suppression Condition. Consider a large population of I-cells, spiking in asynchrony, acting on a single E-cell driven above threshold. We ask under which conditions spiking in the E-cell will be prevented altogether by the inhibition. We model the synaptic gating variable in the idealized way described by equation 2.14. Because the I-cells are assumed to spike in asynchrony, the term s(t) in equation 2.13 is replaced by its time average. (We neglect fluctuations resulting from the finiteness of the network.) Thus, the equation governing the target neuron is dθ 3 (6.1) = 1 − cos θ + IE − gIE s (1 + cos θ ) − gIE s sin θ, dt 2 with s defined in equation 2.19. The E-cell escapes suppression if and only if 2 3 1 (6.2) gIE s + gIE s < IE . 2 4 To see this, replace gII by gIE and II by IE in equation 2.20. Since we assume IE 1, this is approximately equivalent to 3 gIE s < IE . 2 Using the definition of r, equation 3.10, this condition becomes s
0 if it can be shown for gII = 0. To prove equation 7.2 for gII = 0, we note that in the limit IE → ∞, the I→E synapses become negligible as well, but for a different reason: the terms, including the factor IE in the equation of the E-cell, dominate all others, in particular, the terms modeling the synapse. As a result, TP ∼ TE as IE → ∞. The phase walkthrough boundary is generally given by TP = √ TI . In the √limit as IE → ∞, this means TE ∼ TI . For gII = 0, this means π/ IE ∼ π/ II , or IE ∼ II . This implies equation 7.2. The dashed curve in Figure 7C, the suppression boundary, is the graph of a function as well. Recall from section 6 that we denote this function by G, so the dashed curve in Figure 7C is given by II = G(IE ). We next show G (0) =
gII gIE
(7.3)
and lim
IE →(3/2)gIE
G(IE ) = ∞ .
(7.4)
To show equation 7.3, note that as IE → 0, II = G(IE ) → 0; therefore, TI → ∞, and τI s∼ TI by equation 2.19. Thus, the suppression boundary becomes TI ∼r τI
(7.5)
by equation 6.4. Assuming gII > 0, using equation 2.23, this becomes (3/2)gIE (3/2)gII ∼ , II IE or II ∼
gII IE , gIE
that is, equation 7.3. If gII = 0, we use equation 2.16 to turn equation 7.5 into π (3/2)gIE √ ∼ IE II
584
C. Borgers ¨ and N. Kopell
or II ∼ π 2
IE2 (9/4)g2IE
.
Thus, equation 7.3 holds for gII = 0 as well. As IE → (3/2)gIE , r → 1. The suppression boundary is given by 1 1 − e−TI /τI = TI /τI r
(7.6)
(see equations 6.4 and 2.19). The left-hand side of equation 7.6 can easily be shown to be a strictly decreasing function of TI /τI that tends to 1 as TI /τI → 0. Thus, as r → 1, TI /τI → 0, so II → ∞. This proves equation 7.4. Equations 7.1 to 7.4, taken together, imply that the suppression and phase walkthrough boundaries in the (IE , II )-plane must intersect (in a point other than the origin) as long as gII < gIE . We present a numerical example illustrating the bistability that our theoretical arguments predict. We use gIE = 0.2, gEI = 0.05, gII = 0 and (II , IE ) = (0.0025, 0.03) , a point inside the region of bistability. We initialize the I-cells asynchronously (see section 2.6). If we initialize all E-cells at θ = −π/2, complete suppression of the E-cells results, as shown in Figure 8A. On the other hand, if we initialize the E-cells at θ = π/2, they start out so close to spiking that they are able to spike before enough inhibition has built up to suppress them. This moves the I-cells away from asynchrony, removing their ability to suppress the E-cells. The result is the PING rhythm shown in Figure 8B. The two simulations producing the two panels of Figure 8 are identical, except for initial conditions. 8 Phase Walkthrough of the I-Cells Resulting from Stochastic Spiking of the E-Cells In this section, we consider the effects of noisy spiking of the E-cells, for instance, as a result of stochastic external input. We show that noisy spiking of the E-cells disrupts PING rhythms by causing phase walkthrough of the I-cells. We give an approximate analysis of the conditions under which noise in the E-cells results in phase walkthrough of the I-cells. Our analysis is not very accurate quantitatively (we will explain why); however, it does reveal that slower PING rhythms are much more vulnerable to noise in the E-cells than faster ones.
Rhythms in the Presence of Noise
585
E−cells
0 40
0 0
B
160
I−cells
I−cells
E−cells
A
160
100 200 300
0 40
0 0
100 200 300
Figure 8: Example illustrating bistability. Suppression of the E-cells (A) or a rhythm (B) are possible with the same parameter values (gEI = 0.05, gIE = 0.2, gII = 0, IE = 0.03, II = 0.0025), different initial conditions.
In our simulations, we introduce random spiking of the E-cells as follows. At randomly selected times, ti1 < ti2 < ... , 1 ≤ i ≤ NE , the ith E-cell is forced to spike; that is, the value of θ associated with it is instantaneously reset to −π , and the synaptic gating variable s associated with it is set to 1. We assume that the time intervals between forced spikes, ti,k+1 − ti,k , 1 ≤ i ≤ NE , k = 1, 2, ..., are independent of each other and exponentially distributed, with a common expected value TSE . (The subscript S stands for “stochastic.”) We define νSE =
1000 . TSE
(8.1)
This is the average frequency of the random spiking of the E-cells. Figures 9C, 9D, and 9F (discussed in detail shortly) show examples of rhythms that persist in spite of random spiking in the E-cells. When an Ecell spikes out of order, it is temporarily out of synchrony with the bulk of the E-cells. It may or may not participate in the next population spike of the E-cells. However, it is brought back into synchrony with the bulk of the E-cells by the next population spike of the I-cells. If a large fraction of the E-cells spikes out of order, the number of Ecells participating in the population spikes may be reduced significantly. The population spikes of the E-cells may then no longer suffice to trigger population spikes of the I-cells. This effect is neglected in the analysis given
586
C. Borgers ¨ and N. Kopell
below, as is justified if gEI is sufficiently large and/or if the ratio νSE /νP is sufficiently small. For example, in Figure 9D, a large fraction of all E-cells spike out of order, but gEI is so large that those E-cells that participate in the population spikes still suffice to prompt the I-cells. 8.1 Analysis. The low-frequency random spiking of some of the E-cells between population spikes generates extra excitatory drive to the I-cells. If each I-cell receives input from sufficiently many E-cells, this drive is nearly constant. For two special cases, we will now analyze when the extra drive to the I-cells results in phase walkthrough. We assume here that the I-cells are modeled as theta neurons. However, we find it convenient to use the dependent variable V instead of θ (see section 2.1). In the absence of synaptic input, the equation of an I-cell is then dV = 2V(V − 1) + QI dt (see equation 2.7). The relation between QI and II is II = 2QI − 1 (see equation 2.10). The random spiking of the E-cells approximately adds the term TSE 1 1 − e−TSE /τE QSI = gEI e−t/τE dtVrev = gEI Vrev TSE 0 TSE /τE
(8.2)
to QI (compare equation 2.12). The average of e−t/τE over [0, TSE ] appears in this formula because the stochastic spiking of the E-cells is assumed to be asynchronous. We have also used the approximation Vrev − V ≈ Vrev in equation 8.2; this is reasonable because V is near 0 except during spikes. We are interested in low-frequency random spiking in the E-cells, so TSE τE = 2, and therefore the term e−TSE /τE in equation 8.2 is negligible: QSI ≈
gEI τE Vrev . TSE
(8.3)
QSI is added to QI , so II = 2QI − 1 turns into II + 2QSI . Denoting by fI (I) the frequency of an I-cell receiving drive I, the condition under which phase walkthrough of the I-cells is avoided becomes approximately 2gEI τE Vrev fI II + ≤ νP (8.4) TSE √ (see equation 4.1). If gII = 0, then fI (I) = (1000/π) I by equations 2.3 and 2.5. Therefore, inequality 8.4 becomes 1000 2gEI τE Vrev II + ≤ νP π TSE
Rhythms in the Presence of Noise
587
or, equivalently, π 2 (νP /1000)2 − II νSE ≤ . 1000 2gEI τE Vrev
(8.5)
Inequality 8.5 was derived assuming gII = 0. For gII > 0, the rhythm can withstand more drive to the I-cells, and therefore more random spiking of the E-cells. In the special case gII = gIE , the condition under which phase walkthrough of the I-cells is avoided is II + 2QSI ≤ IE
(8.6)
(compare equation 4.4). With the approximation 8.3, inequality 8.6 is equivalent to νSE IE − II . ≤ 1000 2gEI τE Vrev
(8.7)
To highlight the similarity between this inequality and equation 8.5, we note that IE = π 2 (νE /1000)2 by equations 2.3 and 2.5, so inequality 8.7 can be written as νSE π 2 (νE /1000)2 − II . ≤ 1000 2gEI τE Vrev
(8.8)
Note that equation 8.8 is obtained from equation 8.5 if νP is replaced by νE . Both formulas show that slower rhythms (smaller νP or smaller νE ) are more vulnerable to noisy activity in the E-cells than faster ones. 8.2 Numerical Examples. 8.2.1 Too Much Noise in the E-Cells Leads to Phase Walkthrough of the I-Cells. Figure 9A shows results of a simulation with gEI = 0.05, gIE = 0.20, gII = 0, IE = 0.1 (νE ≈ 101), II = 0 .
(8.9)
(Other parameter values are as specified in section 2.) A rhythm at frequency νP ≈ 40 is seen. We remark that r = (3/2)gIE /IE = 3 here. Inserting the parameter values 8.9 in inequality 8.5, we find that the rhythm should hold up for νSE ≤ 12.15. This prediction is not very accurate. With νSE = 7, phase walkthrough occurs already, as shown in Figure 9B. (Note that Figure 9B is indeed a noisy analog of the sort of phase walkthrough depicted in Figure 4.) Thus, the breakdown occurs earlier in our simulation than predicted
588
C. Borgers ¨ and N. Kopell
E−cells
0 40
0 0
B
160
I−cells
I−cells
E−cells
A
160
0 40
0 0
100 200 300
E−cells
0 40
0 0
D
160
I−cells
I−cells
E−cells
C
160
0 40
0 0
100 200 300
E−cells
0 40
0 0
400 800 1200
100 200 300 F
160
I−cells
I−cells
E−cells
E
160
100 200 300
0 40
0 0
400 800 1200
Figure 9: (A) Gamma frequency PING rhythm (gEI = 0.05, gIE = 0.2, gII = 0, IE = 0.1, II = 0). (B) Low-frequency stochastic spiking in the E-cells leads to phase walkthrough in the I-cells. (C) Phase walkthrough is counteracted by reducing gEI 0.02. (D) I→I synapses (gII = gIE ) make the rhythm astonishingly robust to noise in the E-cells. (E) A slow PING rhythm (IE = 0.003; all other parameter values as in A) is highly sensitive to noise in the E-cells. (F) Noise sensitivity can be greatly reduced, even for the slow rhythm, by setting gII = gIE and reducing gEI to 0.01.
Rhythms in the Presence of Noise
589
by our theory. To understand this discrepancy, we took a detailed look at the simulation underlying Figure 9B, focusing on the three population spikes of the I-cells that occur out of order, not prompted by the E-cells, in Figure 9B. Immediately before each of these population spikes, the actual drive to the I-cells rises significantly above that predicted by formula 8.2, as a result of random fluctuations. Our simple theory neglects those fluctuations, assuming instead that QSI is constant. The fluctuations would be less important in significantly larger networks. Therefore, we would expect the predictions of inequality 8.5 to be more accurate in larger networks. For the parameter values of Figure 9A, Figure 11A illustrates the transition from low, inconsequential levels of noise in the E-cells to disruptive levels. The figure shows quantities ρE ∈ [0, 1] and ρI ∈ [0, 1], measuring the regularity of the spiking of the E- and I-cells, plotted as functions of the noise frequency νSE . The precise definitions of ρE and ρI are given in appendix B. The closer ρE and ρI are to 1, the more regular is the rhythm. The figure shows that the regularity of the rhythm is lost abruptly as νSE is raised above 6. If in fact it is correct that the rhythm in Figure 9B is disrupted essentially as a result of too much drive to the I-cells, then it ought to be possible to restore the rhythm by lowering gEI . Indeed, this is the case. When gEI is lowered to 0.02, the rhythm shown in Figure 9C is obtained. 8.2.2 At Gamma frequency, I→I Synapses Greatly Reduce Sensitivity to Noise in the E-Cells. Figures 5B and 5D show that in large portions of parameter space, I→I synapses greatly enlarge the amount of drive to the I-cells that PING rhythms can withstand before breaking down as a result of phase walkthrough. To confirm this by simulation, we use the parameter values of Figure 9A, but replace gII = 0 by gII = gIE . The resulting rhythm is astonishingly insensitive to noisy spiking in the E-cells. Formula 8.8 predicts that the rhythm should survive for νSE ≤ 77. Figure 9D shows a simulation with νSE = 70. The spike time rastergram for the E-cells looks quite noisy, of course, but the rhythm is still visible. (Because of stochastic fluctuations, the rhythm would not, in reality, remain intact if νSE were equal to 77.) 8.2.3 Sensitivity to Noise in the E-Cells Increases as the Rhythm Slows Down. We now present numerical simulations illustrating that slower PING rhythms are more sensitive to noisy spiking of the E-cells, as predicted by Figure 5A and formulas 8.5 and 8.8. Figure 9E shows a slow PING rhythm disrupted by a small amount of stochastic spiking in the E-cells. (Notice that the time window shown in Figures 9E and 9F is four times longer than that shown in Figures 9A–9D.) The parameter values of Figure 9E are those of Figure 9A, except that IE has been reduced from 0.1 to 0.003. The frequency of the resulting PING rhythm is about 10. Formula 8.5 predicts that this rhythm should be abolished by random spiking in the E-cells at an average frequency of about νSE = 0.76. Indeed, phase walkthrough occurs
590
C. Borgers ¨ and N. Kopell
for νSE = 0.5 already. This is shown in Figure 9E. Thus, phase walkthrough again occurs at a (somewhat) lower value of νSE than predicted by our theory. As before, the discrepancy is due primarily to the stochastic fluctuations in the drive to the I-cells generated by the random spiking of the E-cells. For this case, Figure 11B illustrates the transition from nearly inconsequential levels of noise in the E-cells to disruptive levels. The regularity of the rhythm is lost abruptly as νSE is raised above 0.4. It is not surprising that the maximum allowable value of νSE decreases as νP decreases. For instance, if the maximum allowable value of νSE is 6 for νP = 40 (see Figure 11A), one might expect it to be four times smaller for νP = 10. What is surprising is that it is not 4 but 16 times smaller (see Figure 11B). This is in agreement with inequality 8.5, as the right-hand side of that inequality, for II = 0, is proportional to (νP /1000)2 , not to νP /1000. 8.2.4 At Subgamma Frequencies, Careful Parameter Tuning Can Reduce Sensitivity to Noise in the E-Cells. At low frequencies, I→I synapses do not help much. Inequality 8.8 predicts that even with gII = gIE , phase walkthrough of the I-cells will occur as soon as νSE > 2.3. However, the robustness of the rhythm can be enhanced by reducing gEI . Figure 9F shows a simulation in which the parameter values are as in Figure 9E, except gII = gIE , gEI = 0.01, and νSE = 7. 9 Suppression of the E-Cells Resulting from Stochastic Spiking of the I-Cells In this section, we consider the effects of random spiking of the I-cells, for instance, as a result of stochastic external input. We show that noisy spiking of the I-cells disrupts PING rhythms by causing suppression of the E-cells. We give an approximate analysis of the conditions under which noise in the I-cells results in suppression of the E-cells. As in section 8, our analysis is not very accurate quantitatively, again because it neglects statistical fluctuations that are significant at least in small networks. It does reveal that the crucial quantity here is the product rτI : the larger rτI , the less noisy spiking in the I-cells can be tolerated. In our simulations, we enforce random spiking of the I-cells in the same way in which we enforced random spiking of the E-cells in section 8. In analogy with section 8, TSI denotes the expected time between two random spikes of a given I-cell, and νSI = 1000/TSI .
(9.1)
Figures 10B and 10F (discussed in detail shortly) show examples of rhythms that persist in spite of random spiking in the I-cells. When an I-cell spikes out of order, it is temporarily out of synchrony with the bulk of the I-cells. However, it is brought back into synchrony when the next
Rhythms in the Presence of Noise
591
population spike of the E-cells prompts a population spike of the I-cells. 9.1 Analysis. The low-frequency random spiking of some of the I-cells between population spikes generates inhibitory synaptic drive to the Ecells. If each E-cell receives input from sufficiently many I-cells, this drive is nearly constant. If it is strong enough, it leads to suppression of the E-cells, and thereby abolishes the rhythm (see Figure 10D). A necessary and sufficient condition for the E-cells to escape suppression is s
33, approximately. In our simulations, even for νSI = 40, the
592
C. Borgers ¨ and N. Kopell
E−cells
0 40
0 0
B
160
I−cells
I−cells
E−cells
A
160
0 40
0 0
100 200 300
E−cells
0 40
0 0
D
160
I−cells
I−cells
E−cells
C
160
0 40
0 0
400 800 1200
E−cells
0 40
0 0
400 800 1200
400 800 1200 F
160
I−cells
I−cells
E−cells
E
160
100 200 300
0 40
0 0
400 800 1200
Figure 10: (A) Gamma frequency PING rhythm with r = 3 (gEI = 0.05, gIE = 0.2, gII = 0, IE = 0.1, II = 0). (B) Considerable noise in the I-cells can be withstood by this rhythm. (C) Low-frequency PING rhythm with r = 60 (IE = 0.005; all other parameters as in A). (D) This rhythm is highly sensitive to noise in the I-cells. (E) Low-frequency PING rhythm with r = 3 (IE = 0.001, gIE = 0.002; all other parameter values as in A). (F) Considerable noise in the I-cells can be withstood by this rhythm.
Rhythms in the Presence of Noise
593
E-cells are not suppressed, but the intervals between population spikes of the E-cells are long and quite irregular. Thus, the regularity of the rhythm is disrupted earlier (for smaller νSI ) than predicted by our theory, but complete suppression of the E-cells occurs later (for larger νSI ) than predicted by our theory. This discrepancy between our theory and the numerical simulations is not hard to understand. In our theory, statistical fluctuations in the strength of inhibition received by the E-cells are neglected. These fluctuations make the rhythm irregular earlier (for smaller νSI ) than predicted by our theory. When the inhibition happens to be stronger than average, the E-cells are delayed, and when it happens to be weaker than average, the E-cells spike earlier. However, statistical fluctuations cause complete suppression of the E-cells to occur later (for larger νSI ) than predicted by our theory. Even when inhibition is strong enough, on the average, to suppress the E-cells, there are time windows when it happens to fall below the threshold required for suppression. Population spikes of the E-cells may occur during those time windows. For the parameter values of Figure 10A, Figure 11C illustrates the transition from low, inconsequential levels of noise in the I-cells to disruptive levels. The figure shows the regularity measures ρE and ρI (see appendix B) plotted as functions of νSI . 9.2.2 Sensitivity to Noise in the I-Cells Increases as Drive to the E-Cells Weakens. We lower IE to 0.005. The frequency of the PING rhythm decreases to about 12, and r rises to 60. Figure 10C shows the rhythm without any stochastic spiking. (Notice that the time window shown in parts C–F of Figure 10 is four times longer than that shown in parts A and B.) Adding stochastic spiking of the I-cells at average frequency νSI = 5, the E-cells are suppressed, and thereby the rhythm is abolished (see Figure 10D). For the parameter values of Figure 10C, Figure 11D illustrates the transition from low, inconsequential levels of noise in the I-cells to disruptive levels. Note that the value of IE is 20 times smaller in Figure 11D than in Figure 11C, and the value of r is therefore 20 times larger. According to inequality 9.4, the maximum frequency of noise in the I-cells that the rhythm can withstand should therefore be 20 times smaller in Figure 11D than in Figure 11C. Indeed, this is approximately the case. 9.2.3 For Weakly Driven E-Cells, Careful Parameter Tuning Can Reduce Sensitivity to Noise in the I-Cells. We have mentioned that PING rhythms with low values of IE can be made fairly noise insensitive by lowering gIE proportionally, ensuring that r does not become too large. To illustrate this, we lower IE even further, to 0.001, but also lower gIE to 0.002, bringing the value of r back to 3, as in Figure 10A. For reasons discussed in the next paragraph, we also use a smaller value of gEI here: gEI = 0.005. The frequency of the resulting rhythm is approximately 8 (see Figure 10E). Remarkably, the rhythm in Figure 10E survives stochastic spiking of the I-cells at frequency
594
C. Borgers ¨ and N. Kopell
A
B
ρE
1
0.5
0.5
1
1
0.5
0.5
ρI
ρI
ρ
E
1
0 2 4 6 8 10 νSE
0
C
ρE
ρE
1
0.5
0.5 1
ρI
1
ρI
1
D
1
0.5 0
0.5 νSE
50 ν
SI
100
0.5 0 1 2 3 4 5 νSI
Figure 11: Regularity measures ρE and ρI (defined in appendix B) as functions of noise frequency. (A) At gamma frequency (40 Hz), the rhythm can withstand 6 Hz noise in the E-cells. (B) At four times lower frequency (10 Hz), it can withstand no more than 0.4 Hz noise in the E-cells. (C) At gamma frequency (40 Hz), the regularity of the rhythm is lost if the frequency of the noise in the I-cells is greater than 40 Hz. (D) When IE is reduced twenty-fold, the rhythm can withstand no more than 1.5 Hz noise in the I-cells.
Rhythms in the Presence of Noise
595
5 (see Figure 10F). This is a result of the moderate (not very large) value of r. In the preceding simulation, we used a reduced value of gEI . In general, for small gIE , the PING rhythm is more rapidly established if gEI is small as well. (Of course, gEI must still be large enough for a population spike of the E-cells to trigger a population spike of the I-cells.) Section 8 suggests the following heuristic explanation of this numerical observation. In the initial phases of the simulation, before the rhythm is established, the population spikes are fuzzy (see, for instance, the beginning of the simulation in Figure 10E). If gIE is small, there is a considerable amount of nearly asynchronous activity of the E-cells between activity peaks. This activity results in nearly tonic drive to the I-cells, which may result in premature spiking of some of the I-cells, and may therefore make it more difficult for the rhythm to be established. Lowering gEI reduces this effect. 9.2.4 Effects of I→I Synapses. We have used gII = 0 throughout this section. As pointed out in earlier sections, I→I synapses generally stabilize PING rhythms by slowing disruptive spiking in the I-cells. However, in the experiments presented here, the disruptive spiking in the I-cells is forced, regardless of the value of gII . As a result, if gII were set to gIE , the results presented in this section would remain virtually unchanged. If the random spiking were instead generated by weaker random input pulses, not necessarily always inducing immediate spiking, then I→I synapses would indeed counteract the suppression of the E-cells by the I-cells. 10 Simulations Including Sparse Connectivity and Heterogeneity In this section, we present some numerical simulations including sparse connectivity and heterogeneity. We sparsen the connectivity as follows. The possible synaptic connections are considered one at a time. Each possible connection is removed with probability 0.5. If it is retained, its strength is doubled. For instance, the strength of an I→E connection is 0 with probability 0.5, and 2gIE /NI with probability 0.5. This is a moderate degree of sparseness. However, in earlier work (Borgers ¨ & Kopell, 2003), we showed that the effect of sparseness on the coherence of a PING rhythm is determined not by the fraction p of connections retained, but by pNE /(1 − p) and pNI /(1 − p). For instance, in a 10 times larger network (i.e., in a network of 1600 E-cells and 400 I-cells), we would get approximately the same effect if we removed connections with probability 10/11 and strengthened the retained connections eleven-fold. (Note that p/(1 − p) = 1/10 when p = 1/11.) In addition to sparseness, we introduce 20% heterogeneity in synaptic strengths. For instance, the strength of a given retained I→E synapse is not precisely 2gIE /NI , but rather a gaussian random number with mean 2gIE /NI and standard deviation 0.4gIE /NI . (This random number is negative with
596
C. Borgers ¨ and N. Kopell
very small but positive probability. If our program draws a negative strength for one of the synapses, the strength of that synapse is reset to zero.) Thus, gIE , gEI , and gII are no longer the total synaptic strengths, but they are (very close to) the expected total synaptic strengths. We also introduce 20% heterogeneity in the external drive to the E-cells. That is, the external drive received by a given E-cell is not precisely IE , but rather a gaussian random number with mean IE and standard deviation 0.20IE . Thus, IE is no longer the external drive to each E-cell, but it is the expected external drive to each E-cell. Similarly, we introduce 20% heterogeneity in the external drive to the I-cells. Finally, all cells are forced to spike at random times at average frequencies νSE (for the E-cells) and νSI (for the I-cells). At gamma frequency, PING is robust to heterogeneity, random connectivity, and noisy external drives. Figure 12A shows the results of a simulation with gIE = 0.2, gEI = 0.05, gII = 0.2, IE = 0.1 (νE ≈ 101), II = 0.05, νSE = 5, νSI = 5. Figure 12B shows, for the same simulation, quantities sE = sE (t) and sI = sI (t) measuring the average strength of excitatory and inhibitory synapses (see appendix B for precise definitions). The rhythm is clearly detectable in Figures 12A and 12B. Its frequency is approximately 40. Even with heterogeneity and random connectivity, it is possible to obtain PING rhythms at subgamma frequencies, but careful parameter tuning is required. As discussed in section 8.2, one should lower gEI , to avoid phase walkthrough as a result of the drive to the I-cells resulting from noisy activity of the E-cells. However, one cannot lower gEI too much, since the I-cells must respond promptly to population spikes of the E-cells. As discussed in section 9, one should lower both IE and gIE proportionally, keeping r constant. This avoids making the rhythm vulnerable to suppression of the E-cells by noisy spiking of the I-cells. The parameter values gIE = 0.004, gEI = 0.004, gII = 0.004, IE = 0.001 (νE ≈ 10), II = 0.0005, νSE = 2, νSI = 2 yield a PING rhythm at frequency νP ≈ 8 in the presence of sparse connectivity, heterogeneity, and noise. Sparseness and heterogeneity are introduced as described earlier. Figure 12C shows the spike times, and Figure 12D shows sE and sI for the same simulation. (Notice that the time window shown in Figures 12C and 12D is five times longer than that shown in Figures 12A and 12B.)
0.26
sE
160
I−cells
A
0 40
0 0
0 0.68
0 0
100 200 300
0.16
0 40
0 0.59
I−cells
B
s
E
160
100 200 300
sI
E−cells
597
sI
E−cells
Rhythms in the Presence of Noise
0 0
500 1000 1500
0 0
500 1000 1500
Figure 12: Sparse connectivity, heterogeneity, and noise do not necessarily abolish PING rhythm, either at gamma frequency or—with carefully tuned parameter values—at lower frequencies. (A) Gamma frequency PING rhythm (gEI = 0.05, gIE = 0.2, gII = 0.2, IE = 0.1, II = 0.05); spike times (left panel), and sE and sI as defined in appendix B (right panel). (B) Low-frequency PING rhythm (gEI = 0.004, gIE = 0.004, gII = 0.004, IE = 0.001, II = 0.0005); spike times (left panel), and sE and sI (right panel).
11 The Frequency Range in Which PING is Robust A major theme of this letter has been that PING rhythms at low frequencies, significantly below the gamma range, are easily abolished by noise, unless the parameter values are tuned very carefully. In fact, there is also a sense in which PING rhythms at high frequencies, above the gamma range, are not robust. We remarked earlier that one typically needs r ≥ 3 for synchronization in the presence of 20% heterogeneity in network parameter values. Combining
598
C. Borgers ¨ and N. Kopell
this with equation 3.15, and recalling q ≤ 1, we find TP ≥ (ln 3)τI ≈ 1.1τI . If τI = 10, then TP ≥ 11, which approximately means νP ≤ 91. Thus, in the presence of significant heterogeneity, PING rhythms cannot have a frequency above the gamma range. This argument is, of course, far from precise. In particular, nothing is really special about the value r = 3. Nevertheless, we do believe that the argument is correct in essence. In more intuitive language, when one attempts to drive the frequency of a PING rhythm above the gamma range by raising drive to the E-cells, one must also raise the strength of the I→E synapses in order to maintain synchronization. The rise in the strength of the I→E synapses brings the frequency back into the gamma range. 12 Summary and Discussion We have described and analyzed two distinct ways in which too much external drive to the I-cells can disrupt PING rhythms: The I-cells may synchronize, but get ahead of the E-cells (phase walkthrough of the I-cells), or the I-cells may not synchronize, and their activity may keep the E-cells from spiking altogether (suppression of the E-cells). Our analysis of the effects of deterministic drive to the I-cells casts light on the effects of stochastic drive to the E- or I-cells. If there is too much noisy spiking activity in the E-cells, the resulting rise in excitatory drive to the I-cells may lead to phase walkthrough; too much noisy spiking in the I-cells may result in suppression of the E-cells. I→I synapses reduce the spiking frequency of the I-cells, counteracting phase walkthrough of the I-cells, suppression of the E-cells, and the effects of noisy external drive on both I- and E-cells, and thereby enhancing the robustness of the rhythm. Our analysis also shows why PING rhythms are most robust when their frequency lies in the gamma range. Above the gamma range, synchronization breaks down easily when the E-cell population is heterogeneous. As the frequency is lowered, on the other hand, PING becomes increasingly sensitive to noisy spiking of the E-cells, and—unless the strength of the inhibitory synapses and the drive to the E-cells are calibrated carefully—more sensitive to noisy spiking of the I-cells as well. Gutkin and Ermentrout (1998) showed that noise sensitivity of type I model neurons is greater at lower frequencies than at higher frequencies, (see in particular Figure 6B of their article). Our point about noise sensitivity and frequency may seem similar at first but is in fact quite different. Gutkin and Ermentrout showed for a single theta neuron that noisy drive has a more severe effect when the intrinsic frequency of the neuron is low than when it is high. We have shown for an E/I network that the tonic component of the drive to the I-cells created by noisy spiking in the E-cells has a more severe effect when the population frequency is low than when it is high.
Rhythms in the Presence of Noise
599
We conclude with a brief discussion of the relation of our work to previous work on synchronization in networks of model neurons. Much of this work has addressed purely excitatory networks (e.g., Peskin, 1975; Mirollo & Strogatz, 1990; Somers & Kopell, 1993; Hansel, Mato, & Meunier, 1995; Crook, Ermentrout, & Bower, 1998; Bose, Kopell, & Terman, 2000; van Vreeswijk & Hansel, 2001; Acker, Kopell, & White, 2003), purely inhibitory networks (e.g., Wang & Rinzel, 1992; Golomb & Rinzel, 1993; Skinner, Kopell, & Marder, 1994; Chow, 1998; Chow, White, Ritt, & Kopell, 1998; Terman, Kopell, & Bose, 1998; White, Chow, Ritt, Soto-Tervino, & Kopell, 1998; Brunel & Hakim, 1999), or pairs of neurons that are either both excitatory, or both inhibitory (e.g. van Vreeswijk, Abbott, & Ermentrout, 1994; Kopell & Ermentrout, 2002). Synchronization in networks including both excitatory and inhibitory model neurons has been studied widely as well (e.g., Wang, Golomb, & Rinzel, 1995; Ermentrout & Kopell, 1998; Brunel, 2000a, 2000b; Whittington et al., 2000; Tiesinga et al., 2001; Hansel & Mato, 2001; van Vreeswijk & Hansel, 2001; Borgers ¨ & Kopell, 2003; Hansel & Mato, 2003). For more complete references, see Kopell and Ermentrout (2002) or Hansel and Mato (2003). Much of the work on (a)synchrony in neuronal networks has focused on states of asynchronous spiking and ways in which such states can lose stability (Abbott & Van Vreeswijk, 1993; Gerstner & van Hemmen, 1993; Hansel et al., 1995; Gerstner, 2000; Neltner Hansel, Mato, & Meunier, 2000; van Vreeswijk, 2000; Hansel & Mato, 2003). In this letter, we have taken a complementary point of view, focusing on states of nearly synchronous spiking, asking for which parameter values such states exist and how sensitive they are to changes in parameter values and noise. Hansel and Mato (2003) gave a comprehensive analysis of the bifurcations by which asynchronous states can lose stability in networks of excitatory and inhibitory neurons. They identified four codimension 1 bifurcations, of which three lead to oscillatory behavior. One of these (the one associated with crossing the curve L4 in Figure 11 of Hansel and Mato, 2003) corresponds to the emergence of PING and another (associated with the curve L2 in the same figure) to the emergence of ING. In drawing their phase diagrams, Hansel and Mato varied the strengths of synaptic conductances while adjusting external drives to keep the average firing rates of the E- and I-cell populations constant. By contrast, we have treated synaptic strengths and external drives as independent parameters. This difference in point of view complicates direct comparisons between the results of Hansel and Mato and ours. In particular, the lower panel of Figure 11 of Hansel and Mato shows that in a certain sense, I→I synapses counteract PING. For stronger I→I synapses, stronger coupling between the E- and I-cells is needed for the transition from asynchrony to PING (crossing the L4 boundary). At first sight, this appears to contradict our conclusion that I→I synapses promote PING by protecting against phase walkthrough of the I-cells and suppression of the E-cells. However, in fact, there is no contra-
600
C. Borgers ¨ and N. Kopell
diction, since different points in the phase plane depicted by Hansel and Mato correspond to different external drives. The frequency of PING oscillations was analyzed in a recent paper by Brunel and Wang (2003). We note that their study was concerned with oscillations driven by noisy input, whereas in this article, we have focused on oscillations driven by deterministic input (and, in some cases, disrupted by a noisy input component). However, not all discrepancies between their results and ours can plausibly be explained by this difference. For instance, equation 18 of Brunel and Wang (2003) determines the population frequency of the oscillation for E/I networks without E→E and I→I synapses. According to this equation, the frequency depends on synaptic delay, rise, and decay times. In our simulations, there are no synaptic delays, and synaptic rise times are very short. Furthermore, numerical results (not shown here) indicate that shortening the synaptic rise times much further has little impact on the results of our simulations, and the same holds even when the rhythm is driven by noise instead of deterministic tonic drive. This motivates consideration of the limit of equation 18 of Brunel and Wang (2003) as the synaptic delay and rise times tend to zero. Doing this, one finds the prediction that the population oscillation frequency should tend to infinity—in stark discrepancy with our numerical results. The reason for this and other discrepancies between the results of Brunel and Wang and ours remains to be determined. An important difference between the study of Brunel and Wang and ours lies in the choice of membrane time constants for the excitatory neurons. Brunel and Wang used linear integrate-and-fire neurons with time constants equal to 10 ms for inhibitory neurons and 20 ms for excitatory ones. Significantly shorter membrane time constants are believed to be appropriate for neocortical pyramidal neurons in vivo during high ongoing activity (Destexhe & Par´e, 1999; Destexhe, Rudolph, & Par´e, 2003). For theta neurons, as for real cortical neurons, the membrane time “constant” is not actually a constant, but depends on external input. In our simulation of gamma rhythms, the E-cells typically have rather short membrane time constants (on the order of few milliseconds) shortly after receiving input from the I-cells; this is what makes the stable river in Figure 3A strongly attracting and allows for synchronization of the E-cells within one gamma cycle. Appendix A: The Impact of Synchrony and Asynchrony on Downstream Effects In computing the suppression boundary, we assumed that the activity of the I-cells was completely asynchronous. The rationale for this assumption was that asynchrony maximizes the downstream effect of an assembly of inhibitory neurons. In this appendix, we prove a precise statement to this effect. We also prove a precise statement showing that synchrony maximizes
Rhythms in the Presence of Noise
601
the downstream effect of an assembly of excitatory neurons. In contrast with the main portion of this article, we use the linear integrate-and-fire model here. We have not so far generalized these results to the theta model. We begin with a precise version of the following claim. If N periodic current inputs succeed in making an integrate-and-fire neuron spike, then the same inputs would also succeed in making the neuron spike if they were synchronized. Theorem 1. Let ϕ ≥ 0 be periodic function with period T, and let t1 , ..., tN ∈ [0, T). Let τ > 0, and consider the initial value problems
N dV V ϕ(t − ti ) =− + dt τ i=1
V(0) = 0 and dVˆ Vˆ = − + Nϕ(t). dt τ ˆ V(0) =0 Then ˆ ≥ sup V(t) . sup V(t) t≥0
t≥0
Denote by V(V0 ; t1 , ..., tN ; t) the solution of
Proof.
N V dV ϕ(t − ti ). =− + dt τ i=1
V(0) = V0 . In general, V(V0 ; t1 , ..., tN ; t) =
t N 0
ϕ(s − ti ) e(s−t)/τ ds + V0 e−t/τ .
(A.1)
i=1
To find solutions with period T, we define V0,per > 0 to be the solution of 0
T
N i=1
ϕ(s − ti ) e(s−T)/τ ds + V0,per e−T/τ = V0,per .
602
C. Borgers ¨ and N. Kopell
We write Vper (t1 , ..., tN ; t) = V(V0,per ; t1 , ..., tN ; t) . From equation A.1, V(0; t1 , ..., tN ; t) < V(V0,per ; t1 , ..., tN ; t) for all t, but lim V(0; t1 , ..., tN ; t) − V(V0,per ; t1 , ..., tN ; t) = 0 .
t→∞
Therefore, sup V(0; t1 , ..., tN ; t) = sup Vper (t1 , ..., tN ; t) . t≥0
t≥0
We denote by Vˆ per the uniquely determined solution with period T of dVˆ per Vˆ per =− + ϕ(t) . dt τ Then Vper (t1 , ..., tN ; t) =
N
Vˆ per (t − ti ) .
i=1
Therefore, sup Vper (t1 , ..., tN ; t) ≤ N sup Vˆ per (t) , t≥0
t≥0
with “=” if ti = 0 for all i. Therefore, sup V(0; t1 , ..., tN ; t) = sup Vper (t1 , ..., tN ; t) t≥0
t≥0
≤ N sup Vˆ per (t) t≥0
= sup Vper (0, ..., 0; t) t≥0
= sup V(0; 0, ..., 0; t). t≥0
This proves the assertion.
Rhythms in the Presence of Noise
603
We next show that asynchrony minimizes the downstream effect of an assembly of excitatory neurons. More precisely, we will show that if a given asynchronous excitatory synaptic input succeeds in making an integrateand-fire neuron spike, then the same input, delivered phasically with a period T, will also succeed in making the neuron spike. Theorem 2. Let I ∈ IR, τ > 0, Vrev ∈ IR (think of Vrev as the reversal potential of an excitatory synapse), and let g = g(t) ≥ 0 be a function with period T. Define V = V(t) by dV V = − + I + g(Vrev − V) dt τ V(0) = 0 . Let g=
1 T
T
g(t)dt .
0
Define V = V(t) by V dV = − + I + g(Vrev − V) dt τ V(0) = 0 . Then sup V(t) ≤ sup V(t) . t≥0
t≥0
Proof. If supt≥0 V(t) = ∞, then we have nothing to prove, so assume supt≥0 V(t) < ∞. V dV = − + I + g(t)(Vrev − V) dt τ sup V(t) ≥−
t≥0
τ
+ I + g(t) Vrev − sup V(t) .
(A.2)
t≥0
The right-hand side of this inequality is a periodic function of t and a lower bound on dV/dt. The assumption supt≥0 V(t) < ∞ then implies that the average of the right-hand side of equation A.2 is ≤ 0: sup V(t) −
t≥0
τ
+ I + g Vrev − sup V(t) ≤ 0 . t≥0
604
C. Borgers ¨ and N. Kopell
This means that dV/dt would be ≤ 0 if V ever reached supt≥0 V(t). So V cannot exceed supt≥0 V(t). This proves the assertion. Thinking of Vrev as the reversal potential of an inhibitory synapse, one can reinterpret the statement just proved as follows. If a given inhibitory synaptic input suppresses an integrate-and-fire neuron, then the same input, delivered asynchronously, will also suppress the neuron. That is, asynchrony maximizes the downstream effect of an assembly of inhibitory neurons. Appendix B: A Measure of Regular Rhythmicity We associate synaptic gating variables with the presynaptic neurons (see section 2.2). Let sE,i denote the synaptic gating variable associated with ith E-cell, 1 ≤ i ≤ NE , and let sE (t) be the average of sE,i over i ∈ {1, 2, ..., NE } and over the time interval [t − 5, t + 5]. We approximate the time average using the trapezoid method. The time averaging is needed in some of our noisier simulations (for instance, those of Figure 12) to eliminate small random fluctuations that would make automatic detection of the underlying rhythm difficult. We define sE,min + sE,max sE,min = min sE (t) , sE,max = max sE (t) , and sE,av = . t t 2 The minimum and maximum are taken over the second half of the simulation time interval to avoid the effects of initial transients. We consider those times t in the second half of the simulation time interval at which sE changes from values below sE,av to values above sE,av . We denote these times by tE,1 < tE,2 < ... < tE,nE . Our measure of (regular) rhythmicity is min(tE,i+1 − tE,i ) ρE = max(tE,i+1 − tE,i ) 0
if nE ≥ 3 , otherwise .
Here the minimum and maximum are taken over i ∈ {1, 2, ..., nE −1}. Clearly ρE ∈ [0, 1]; the closer ρE is to 1, the more regular is the rhythm. To visualize and measure rhythmicity of the I-cells, sI (t) and ρI are defined analogously. Acknowledgments We are grateful to Steve Epstein for reading the manuscript and providing helpful criticism. We also thank the referees for their detailed and thoughtful comments, which led to several improvements in the article. C.B. is
Rhythms in the Presence of Noise
605
supported by NSF grant DMS-0418832. N.K. is supported by NSF grant DMS-9706694 and NIH grant MH47150. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous state in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Acker, C., Kopell, N., & White, J. (2003). Synchronization of strongly coupled excitatory neurons: Relating network behavior to biophysics. J. Comp. Neurosci., 15, 71–90. Borgers, ¨ C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Comp., 15(3), 509–539. Bose, A., Kopell, N., & Terman, D. (2000). Almost synchronous solutions for pairs of neurons coupled by excitation. Physica D, 140, 69–94. Brunel, N. (2000a). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comp. Neurosci., 8, 183–208. Brunel, N. (2000b). Phase diagrams of sparsely connected networks of excitatory and inhibitory spiking neurons. Neurocomputing, 32, 307–312. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-fire neurons with low firing rates. Neural Comp., 11, 1621–1671. Brunel, N., & Wang, X.-J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? I. Synaptic dynamics and excitation-inhibition balance. J. Neurophysiol., 90, 415–430. Charpak, S., G¨ahwiler, B. H., Do, K. Q., & Knopfel, ¨ T. (1990). Potassium conductances in hippocampal neurons blocked by excitatory amino-acid transmitters. Nature, 347, 765–767. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comp. Neurosci., 5, 407–420. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillators. Neural Comp., 10, 837–854. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531– 1547. Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Reviews Neurosci., 4, 739–751. Diener, F. (1985a). Propri´et´es asymptotiques des fleuves. C. R. Acad. Sci. Paris, 302, 55–58. Diener, M. (1985b). D´etermination et existence des fleuves en dimension 2. C. R. Acad. Sci. Paris, 301, 899–902. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends Neurosci., 15, 218–226.
606
C. Borgers ¨ and N. Kopell
Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 879–1001. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math, 46, 233–253. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delay. Proc. Natl. Acad. Sci. USA, 95, 1259–1264. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40 Hz in the hippocampus in vitro. Nature, 394, 186–189. Fries, P., Neuenschwander, S., Engel, A. K., Goebel, R., & Singer, W. (2001). Rapid feature selective neuronal synchronization through correlated latency shifting. Nature Neurosci., 4, 194–200. Fries, P., Roelfsema, P. R., Engel, A. K., Konig, ¨ P., & Singer, W. (1997). Synchronization of oscillatory responses in visual cortex correlates with perception in interocular rivalry. Proc. Natl. Acad. Sci. USA, 94, 12699–12704. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous state, and locking. Neural Comp., 12, 43–89. Gerstner, W., & van Hemmen, J. L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse-emitting units. Phys. Rev. Lett., 71, 312– 315. Golomb, D., & Rinzel, J. (1993). Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Gray, C. M., & McCormick, D. A. (1996). Chattering cells: Superficial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274, 109–113. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comp., 10, 1047–1065. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comp., 15(1), 1–56. Hansel, D., Mato, G., & Meunier, G. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Hasselmo, M. E. (1999). Neuromodulation: Acetylcholine and memory consolidation. Trends in Cognitive Sciences, 3 (9), 351–359. Hodgkin, A. L. (1948). The local changes associated with repetitive action in a nonmedullated axon. J. Physiol. (London), 107, 165–181. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. M. (2000). Neural excitability, spiking and bursting. Int. J. Bifurcation and Chaos, 10, 1171–1266. Knowles, W. D., & Schwartzkroin, P. A. (1981). Local circuit synaptic interactions in hippocampal brain slices. J. Neurosci., 1, 318–322.
Rhythms in the Presence of Noise
607
Kopell, N., & Ermentrout, G. B. (2002). Mechanisms of phase-locking and frequency control in pairs of coupled neural oscillators. In B. Fiedler (Ed.), Handbook on dynamical systems (Vol. 2: Towards applications (pp. 3–54). Amsterdam: Elsevier. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous networks of spiking neurons. Neural Comput., 12, 1607–1641. Olufsen, M., Whittington, M. A., Camperi, M., & Kopell, N. (2003). New functions for the gamma rhythm: Population tuning and preprocessing for the beta rhythm. J. Comp. Neurosci., 14, 33–54. Peskin, C. S. (1975). Mathematical aspects of heart physiology. New York: Courant Institute of Mathematical Sciences, New York University. Pulvermuller, ¨ F., Birbaumer, N., Lutzenberger, W., & Mohr, B. (1997). Highfrequency brain activity: Its possible role in attention, gestalt processing and language. Progress in Neurobiology, 52, 427–445. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch, & I. Segev (Eds.), Methods in neuronal modeling (pp. 251–292), Cambridge, MA: MIT Press. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations. Neuron, 24, 49–65. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 18, 555–586. Skinner, F., Kopell, N., & Marder, E. (1994). Mechanisms for oscillation and frequency control in networks of mutually inhibitory relaxation oscillators. J. Comp. Neurosci., 1, 69–87. Slotnick, S. D., Moo, L. R., Kraut, M. A., Lesser, R. P., & Hart, J. (2002). Interactions between thalamic and cortical rhythms during semantic memory recall in human. Proc. Natl. Acad. Sci. USA, 99(9), 6440–6443. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cybern., 68, 393–407. Storm, J. F. (1989). An afterhyperpolarisation of medium duration in rat hippocampal pyramidal cells. J. Physiol. (London), 409, 171–190. Tallon-Baudry, C., Bertrand, O., Peronnet, C. F., & Pernier, J. (1998). Induced gamma-band activity during the delay of a visual short-term memory task in humans. J. Neurosci., 18, 4244–4284. Tam´as, G., Buhl, E. H., Lorincz, ¨ A., & Somogyi, P. (2000). Proximally targeted GABAergic synapses and gap junctions synchronize cortical interneurons. Nature Neurosci., 3(4), 366–371. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled slow inhibitory neurons. Physica D, 117, 241–275. Tiesinga, P. H. E., Fellous, J.-M., Jos´e, J. V., & Sejnowski, T. J. (2001). Computational model of carbachol-induced delta, theta, and gamma oscillations in the hippocampus. Hippocampus, 11, 251–274.
608
C. Borgers ¨ and N. Kopell
Tiitinen, H., Sinkkonen, J., Reinikainen, K., Alho, K., Lavikainen, J., & Naatanen, R. (1993). Selective attention enhances the auditory 40-Hz transient response in humans. Nature, 364, 59–60. Traub, R. D., Bibbig, A., Fisahn, A., LeBeau, F. E. N., Whittington, M. A., & Buhl, E. H. (2000). A model of gamma-frequency network oscillations induced in the rat CA3 region by carbachol in vitro. European J. Neuroci., 12, 4093–4106. Traub, R. D., Buhl, E. H., Gloveli, T., & Whittington, M. A. (2003). Fast rhythmic bursting can be induced in layer 2/3 cortical neurons by enhancing persistent Na+ conductance or by blocking BK channels. J. Neurophysiol., 89, 909–921. Traub, R. D., Kopell, N., Bibbig, A., Buhl, E. H., Lebeau, F. E. N., & Whittington, M. A. (2001). Gap junctions between interneuron dendrites can enhance long-range synchrony of gamma oscillations. J. Neurosci., 21, 9478–9486. Traub, R. D., Whittington, M. A., Colling, S. B., Buzs´aki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol. (London), 493, 471–484. van Vreeswijk, C. (2000). Analysis of the asynchronous state in networks of strongly coupled oscillators. Phys. Rev. Lett., 84, 5110–5113. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comp. Neurosci., 1, 313–321. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Comp., 13, 959–992. Wang, X.-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst firing in a thalamic model: Specific neuronal mechanisms. Proc. Natl. Acad. Sci. USA, 2, 5577–5581. Wang, X.-J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comp., 4, 84–97. White, J., Chow, C., Ritt, J., Soto-Trevino, C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comp. Neurosci., 5, 5–16. Whittington, M. A., Stanford, I. M., Colling, S. B., Jefferys, J. G. R., & Traub, R. D. (1997). Spatiotemporal patterns of gamma frequency oscillations tetanically induced in the rat hippocampal slice. J. Physiol. 502, 591–607. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Whittington, M. A., Traub, R. D., Kopell, N., Ermentrout, B., & Buhl, E. H. (2000). Inhibition-based rhythms: experimental and mathematical observations on network dynamics. Int. J. Psychophysiol., 38, 315–336. Received February 17, 2004; accepted July 27, 2004.
LETTER
Communicated by Sebastian Seung
Supervised Learning Through Neuronal Response Modulation Christian D. Swinehart cds@brandeis. edu.
L.F. Abbott abbott@brandeis. edu. Volen Center and Department of Biology, Brandeis University, Waltham, MA 02454, U.S.A.
Neural networks that are trained to perform specific tasks must be developed through a supervised learning procedure. This normally takes the form of direct supervision of synaptic plasticity. We explore the idea that supervision takes place instead through the modulation of neuronal excitability. Such supervision can be done using conventional synaptic feedback pathways rather than requiring the hypothetical actions of unknown modulatory agents. During task learning, supervised response modulation guides Hebbian synaptic plasticity indirectly by establishing appropriate patterns of correlated network activity. This results in robust learning of function approximation tasks even when multiple output units representing different functions share large amounts of common input. Reward-based supervision is also studied, and a number of potential advantages of neuronal response modulation are identified. 1 Introduction Correlation-based, Hebbian mechanisms of synaptic plasticity have been used with considerable success to explain the spontaneous development of selectivity and sensory maps in neural circuits (see Miller, 1996). However, when such plasticity mechanisms are applied to the development of networks that perform specific functions, rather than simply represent input data, a problem arises. To guide correlation-based synaptic plasticity, the activity of a naive neural circuit must be correlated in a manner similar to that of the final, functioning circuit. But such correlations usually arise only after the synapses of the circuit have been appropriately adjusted. The consequence of this is a chicken-and-egg problem: Which comes first, correlations or synaptic modifications? The traditional answer to this question is that synaptic modifications come first, guided by a supervisor. The supervisor is a hypothetical neural circuit that assesses network performance, computes an error signal, and uses it to direct synaptic plasticity within the network. Such schemes work Neural Computation 17, 609–631 (2005)
c 2005 Massachusetts Institute of Technology
610
C. Swinehart and L. Abbott
extremely well for many tasks (Widrow & Stearns, 1985; Chauvin & Rumelhart, 1995; Hertz, Krogh, & Palmer, 1991; Dayan & Abbott, 2001), making them attractive models for learning in biological systems. However, for biological applications, it is important to identify the pathways through which the supervisory circuit controls synaptic plasticity. In some cases, such as climbing fiber input to cerebellar Purkinje cells, such a mechanism appears to be in place. In other systems, such as cerebral cortex, an appeal must be made to some form of modulatory (perhaps dopaminergic; see Schultz, Dayan, & Montague, 1997) control of synaptic plasticity that is largely conjectural. Furthermore, modulatory pathways tend to be slow and nonlocal, making them poorly suited for the rapid, precise control of synaptic plasticity needed during task learning. These considerations lead us to explore the possibility that supervision of synaptic plasticity takes place indirectly rather than directly. The scheme we study corresponds to correlations coming first. In other words, the supervisor modulates neuronal excitability in order to introduce correlations into network activity. These correlations then generate the synaptic plasticity needed to learn a task through Hebbian synaptic modifications that are not themselves subject to direct supervision. We are interested in supervision through response modulation because it is easy to see how this scheme could be realized in cortical circuitry. The response modulations that we consider can be generated through standard excitatory and inhibitory synaptic input. Therefore, in this scheme, the supervisory circuit can guide learning through the feedback projection pathways that are characteristic of cortical circuitry, and no appeal must be made to as-yetundiscovered forms of modulation. It is important to realize that we are not proposing this scheme as an algorithmic improvement. Indeed, such indirect supervision of synaptic plasticity has disadvantages, and an important element of our study is to determine how detrimental these are. In summary, we consider a network in which synaptic plasticity is purely Hebbian, a form typically used in unsupervised learning applications. We ask whether it is possible to implement supervised learning in such a network solely by communicating error signals to the network along conventional excitatory and inhibitory feedback pathways that modulate neuronal responsiveness but do not directly affect synaptic plasticity. Such a scheme is not optimal, so its virtues are not efficiency or elegance. Rather, we take this minimalist approach so that we can determine whether these wellestablished elements of cortical circuitry provide a sufficient basis for implementing supervised learning. 2 Response Modulation and Synaptic Plasticity Neural networks used for supervised learning consist of units with nonlinear response functions connected together through interactions characterized by synaptic weights. The response ri of network unit i to an input Ii is
Supervised Learning Through Neuronal Response Modulation
611
typically determined by a sigmoidal function, ri =
1 . 1 + exp −gi (Ii − si )
(2.1)
In biophysical terms, this can be thought of as the normalized firing rate generated by an input current Ii . The parameter si , which we call the shift, controls the value of Ii at which ri reaches one-half its maximal value, while gi , which we call the gain, determines the slope of the firing rate versus input curve at this point. Input currents are typically computed by multiplying presynaptic responses by synaptic weight factors and summing over all inputs. The role of the supervisor is to compute an error by comparing actual and desired network output and to use this error to direct the modification of network parameters such that network performance improves. Conventionally, the major targets of this process are the synaptic weights. For example, weights can be modified to produce a stochastic gradient descent of the error function. We deviate from this procedure by employing a standard Hebbian synaptic modification rule that is not directly affected by the supervisor. At each stimulus presentation, the synaptic weight connecting unit i with response ri to unit a with response Ra , wai , is augmented by a term proportional to the product of the pre- and postsynaptic activities, wai → wai + w Ra ri ,
(2.2)
where the parameter w controls the learning rate. In addition, to prevent the runaway excitation that results from this positive feedback rule, divisive normalization is included. This consisted of dividing all the weights by factors that maintain the sums (for all a) N
wai = α
(2.3)
i=0
at a constant value α. The important point here is that neither of the above rules, 2.2 or 2.3, involves the error function or any other form of supervisory signal. All the supervision in our network takes place at the level of the gain and shift parameters governing the input-output function of equation 2.1. It is not unusual for supervised learning schemes to modify such parameters, particularly shift parameters. Furthermore, in our scheme, supervision of shift and gain parameters takes place through the same type of stochastic gradient-descent procedure used in conventional supervised learning algorithms. The novel element in our approach is that these parameters are the only targets of supervised modification. The reason that we restrict supervision to the shift and gain parameters of neuronal response functions
612
C. Swinehart and L. Abbott
is that, unlike the supervision of synaptic plasticity, such supervision can be accomplished by ordinary, fast excitatory and inhibitory synapses from neurons of the supervisory circuit onto neurons of the function-approximation network. Changes in the shift variable si correspond to having the supervisor provide either net excitatory or net inhibitory input to network neuron i. It has been shown that balanced, parallel modulations of excitatory and inhibitory input can modify the gain of a postsynaptic neuron (Doiron, Longtin, Berman, & Maler, 2001; Chance, Abbott, & Reyes, 2002; Prescott & De Koninck, 2003), and on the basis of this result we argue that the supervisor can also control and modify the gain variable gi . In summary, the supervisory circuit in our model can modify both the shift and the gain variables for each of the neurons in the network (though in our examples, only the input neurons are modulated, and modulating the output units alone is not effective) through normal excitatory and inhibitory synaptic pathways. To reiterate what was said in section 1, our goal is not to introduce a new algorithm, but rather to see if existing algorithms can still operate when supervision is restricted to well-established cortical pathways. 3 Function Approximation We apply the proposed mechanism of supervised learning to function approximation, a well-studied task in the artificial neural network literature with obvious applications to biological systems (Poggio, 1990). In this task, network neurons are driven by a stimulus characterized by a single variable θ . The goal of learning is to produce a network output that matches a specified function or set of functions of θ. This is a very easy task for neural network learning that can be accomplished with a single layer of synapses modified, for example, by a delta learning rule (Widrow & Hoff, 1960). We consider this task because it allows us to illustrate clearly the features and limitations of the scheme we are studying. Specifically, we consider a two-layer feedforward network architecture with purely excitatory connections, as shown in Figure 1. The network consists of N input units, responding to a stimulus variable θ (which takes values in the range from 0 to 2π), that drive M output units. The input units of the network, indicated by the lower row of circles in Figure 1, are driven by currents that are gaussian functions of the difference between θ and a preferred stimulus value, which is different for each input unit. Specifically, the input to unit i, Ii , is given by Ii = G(θ − θi ) + G(θ − θi − 2π) + G(θ − θi + 2π ) ,
(3.1)
where
2 θ G(θ) = 1.5 exp − − 0.5 . 2
(3.2)
Supervised Learning Through Neuronal Response Modulation
DjiejiAVnZg
613
hjeZgk^hdg
=ZWW^VcEaVhi^X^in
>cejiAVnZg
Jk`dlclj Figure 1: The function approximation network. Input units (lower row of circles) receive input tuned to the value of a stimulus variable θ, as indicated by the gaussian curves. The input units drive output units, shown at the top of the figure, through synaptic connections that are subject to Hebbian plasticity. A supervisor modifies the response properties of the input units through feedback projections. The task is to induce the firing rates of the output units to match specified functions of the stimulus variable.
The three terms appearing in equation 3.1 impose an approximate periodicity on the network, which is convenient (though not essential) because it removes edge effects. The values of the preferred stimulus parameters, θi for i = 1, 2, . . . , N, are uniformly distributed over the range from 0 to 2π . The response of input unit i to stimulus θ, ri (θ ), is given in terms of the input Ii by equation 2.1. The output of the network consists of the firing rates of the units appearing at the top of Figure 1. These are determined by the same firing-rate function as in equation 2.1, but their inputs are given by a weighted sum of the firing rates of the input units. Specifically, using Ra (θ ) to denote the response of output unit a (for a = 1, 2, . . . , M) to stimulus θ , Ra (θ) =
1 + exp −ga
1 N
.
(3.3)
wai ri (θ) − sa
i=1
Here, wai is the weight of the connection from input unit i to output unit a. When we consider networks with a single output unit, we drop the output index and denote the weight from input i simply as wi . The goal of learning for this network is to match the outputs Ra (θ), as closely as possible, to a set of stimulus-dependent target functions Fa (θ).
614
C. Swinehart and L. Abbott
3.1 Action of the Supervisor. The supervisor in our network model computes an error by comparing the firing rates of the output units to the values of the target functions for each stimulus. It uses a stochastic gradientdescent algorithm to adjust the gain and shift values for the input units of the network of Figure 1 in such a way that the error, E(θ) =
M 1 (Ra (θ) − Fa (θ))2 , 2 a=1
(3.4)
is reduced after each stimulus presentation. Here, Ra is the response of output unit a, and Fa is the target response for that unit. As stated above, synaptic weights in the network are subject to Hebbian synaptic plasticity as described by equations 2.2 and 2.3, with w = 0.03 and α = 5.5. Error-based supervision is used to vary the shift and gain parameters (si and gi ) that control neuronal responsiveness. These are set to the initial values si = 1 and gi = 3 for all units, but are then changed by the supervised learning algorithm. During each run, a stimulus value θ is chosen randomly in the range from 0 to 2π, and the resulting output rates are computed. Then the shifts and gains for all the input units of the network are updated according to the rules si → si − s
∂E(θ) ∂si
and
gi → gi − g
∂E(θ) , ∂gi
(3.5)
where s and g are small parameters that control the rate of response modulation. For our simulations, these took the values s = g = 0.2/M. This process is repeated until performance stops improving. We could also adjust the corresponding parameters sa and ga for the output units, but for the examples we give, this is unnecessary. Instead, these have been held at their initial values sa = 1 and ga = 3 for all a. The adjustment of output shifts and gains is unnecessary in the examples we present because we have chosen parameters so that the mean of the output response, averaged across all stimuli, is equal to the stimulus average of the target function. This is not essential; it was done primarily to simplify the presentation. It is useful to compare and contrast our approach with the conventional use of the delta rule in this situation. In the conventional approach in which synaptic plasticity is supervised, the error in equation 3.4 is differentiated with respect to the synaptic weight wai . This weight is then updated according to the rule (assuming a gain of one) wai → −w
∂E = w (Fa − Ra ) Ra ri , ∂wai
(3.6)
where Ra stands for the derivative of the response of output unit a with
Supervised Learning Through Neuronal Response Modulation
H]^[i
9%
c^i^VaHiViZ
'
2
(2
&
2
(2
2
(2
Figure 3: Function approximation by supervised response modulation acting alone without synaptic plasticity. Upper curves show the response of the single output unit to various stimulus values (dots) and the target function (line). The lower panels show a sampling of the responses of the 230 input units as a function of the stimulus value. (A) State of the network before learning. All input responses have the same shifts and gains, and the output is independent of the stimulus. (B) State of the network after learning. The output unit responses match the target function due to modulation of the input unit responses.
that relied solely on synaptic modification. This is because the tuning curve narrowing seen in Figure 2C can somewhat ameliorate problems with approximating rapidly varying functions. In the following examples, we choose to approximate sinusoidally varying functions and do not present examples with other types of functions. All the networks shown can produce equivalent results with any target functions for which the input responses provide an adequate basis. 4.1.2 Transfer of Learning to Synapses. In the previous section, we considered supervised response modulation acting alone. We now add to this a Hebbian plasticity mechanism. In other words, we now use both equation 3.5 and equation 2.2 during learning. The combined effect of supervised response modulation and unsupervised synaptic plasticity is illustrated in Figure 5. As before, the network is initialized with uniform weights and all shifts and gains set to the same values. The supervisor then modifies response properties to minimize the output error. Early on during the learn-
Supervised Learning Through Neuronal Response Modulation >ceji;^g^c\GViZh
'
IXk\
J_`]k
'
'
;jcXi^dc6eegdm^bVi^dc
IXk\
BdYjaVi^dcHiViZ
8%
619
+
+
$.
>X`e
9%
-$,'
-$.
>X`e
.$&
&
2
(2
IXk\
J_`]k
>X`e
-$,(
&
&
2
(2
2
(2
'
+
-$,
(2
+
'
$.
2
IXk\
-$,
& '
+
'
$,
(2
IXk\
J_`]k
:% '$(
2
'
'
&
&
IXk\
-$,&
+
2
(2
&
Figure 4: Examples of learning through response modulation. Three different functions are approximated by response modulation. The left column shows the shift and gain variables for all 460 of the input units of the network, each dot representing one input unit. Gains and shifts can also be seen in the sampling of input unit response curves in the middle column. The right column shows the output unit response (dots) and target function (line) plotted against the stimulus value.
ing process (top row of plots), the performance of the network relies almost entirely on the response modulation of the input units produced by the supervisor (illustrated by the distribution of input responses in the top row, left panel). At this point, the weights have hardly changed from their initial values (as seen in the top row, center panel). However, as the simulation progresses, the weight changes become progressively larger (second row, center panel), and the response modulations become progressively smaller (second row, left panel). Ultimately, the weights take on the cosine shape of the target function (third row, center panel), and the responses are almost uniform for all the input units (third row, left panel), as they were at the beginning of the learning process. Note that a stable equilibrium is reached when Hebbian modification and response modulation act together. Once the response modulation and Hebbian plasticity have equilibrated, the supervisory input can be removed altogether, returning all shifts and
620
C. Swinehart and L. Abbott
>ceji;^g^c\GViZh
( '
$)
2
(2
(2
& (
(2
&
2
(2
2
(2
2
(2
IXk\
&
IXk\
+
2
Gi\]\ii\[
(2
& '
N\`^_k
IXk\
2
(2
'
'
n`k_flk $) jlg\im`jfi
&
(2
N\`^_k
IXk\
(2
$,
;^cVa
2
Gi\]\ii\[
'
2
2
+
(
$) &
&
& '
N\`^_k
IXk\
2
$,
AViZ
(2
'
$) &
2
Gi\]\ii\[
(
$,
B^Y
&
+
IXk\
&
IXk\
IXk\
:Vgan
;jcXi^dc6eegdm^bVi^dc
'
N\`^_k
$,
LZ^\]ih
+
2
Gi\]\ii\[
(2
Figure 5: Supervised response modulation along with unsupervised synaptic plasticity allows for the transfer of learning to the synapses. The left column shows a sampling of the responses of the 230 input units as a function of the stimulus value. The middle column depicts the synaptic weights of the network plotted as a function of the preferred stimulus value for the presynaptic neuron. The right column shows the responses of the output unit (dots) and the target function (line), plotted against the stimulus value. The top three rows of plots, from top to bottom, show the gradual transfer of learning as the network changes from relying primarily on response modulation (top row of plots) to relying primarily on the pattern of modified synaptic weights (third row of plots). The bottom row illustrates that the network can perform fairly well even when response modulation is totally eliminated, once the appropriate pattern of synaptic weights has been established.
Supervised Learning Through Neuronal Response Modulation
621
gains to their default values, and yet the network can still generate a good approximation of the target function (bottom row of Figure 5) (although, for stability, this necessitates the deactivation of Hebbian plasticity). Unsupervised synaptic plasticity thus allows the supervisor to contribute progressively less as the burden of representing the target function is taken up by the synapses. Supervised response modulation plays three critical roles in guiding the Hebbian development of synapses capable of performing the function approximation task. First, because supervised response modulation acting alone can solve the task, the supervisor can act through the input units to effectively clamp the output to the correct response profile while Hebbian plasticity is taking place. Second, by increasing the responsiveness of appropriate input units while clamping the output to the target function, supervised response modulation sets up the appropriate pattern of correlation across the synapses of the network to guide Hebbian modification. For example, input units that are important contributors to the correct output response will be pushed to high levels of responsiveness by the supervisor, enhancing their correlations with the correctly clamped output unit. This causes the synapses connecting such units to the output to grow rapidly. Input units not needed for the task will be made unresponsive by the supervisor, so their synapses to the output unit will not be enhanced by the Hebbian modification rule. Instead, these synapses will be weakened due to the synaptic normalization constraint. Finally, we consider a third role for supervised response modulation on the basis of an analysis of Hebbian modification. The form of synaptic plasticity we are using, Hebbian synaptic modification in conjunction with divisive normalization, ultimately sets synaptic weights in this case equal to αFri wi = , Frj
(4.1)
j
where Fri =
1 2π
2π
dθ F(θ)ri (θ) .
(4.2)
0
To simplify the analysis, we consider a linear approximation for the response function of the output unit rather than the full sigmoidal form of equation 3.3. In this case and for these weights, the condition that the output response matches the target function, R=
N i=0
wai ri = F ,
(4.3)
622
C. Swinehart and L. Abbott
requires that α
N i=1
ri (θ)ri (θ ) = δ(θ − θ )
N Frj .
(4.4)
j=0
The third role of supervised response modulation is to make this equation as near to an equality as possible. The accuracy with which the sum on the left side of this equation can match a δ function profile depends on the narrowness of the tuning curves of the input units. This places a limit on the degree to which rapidly varying target functions can be reproduced, but modifications in the gain and shift variables can improve this situation by narrowing the input tuning curves. More important, supervised response modulation acts to ensure that the normalization condition implied by equation 4.4 is met, and this is what ultimately allows Hebbian plasticity to solve the problem (Salinas & Abbott, 2000). Thus, by acting on these multiple levels, supervised response modulation guides Hebbian plasticity to a solution of the function approximation task. 4.2 Networks with Multiple Output Units. Networks with multiple output units present a greater challenge to the form of supervised learning we are proposing than do single-output networks. In particular, situations frequently arise where the representation of different functions by different output units cannot be achieved by response modulation alone due to shared input. Two cases are simple to analyze. If the connectivity between the input and output units is all-to-all with equal weights, response modulation alone is clearly unable to produce different responses in the output units. With all-to-all coupling, all the output units receive the same total drive, and whatever modulation is done at the input level affects all of the output units in the same way. Unsupervised synaptic plasticity does not help because the input correlation structure seen by the synapses to each output unit is identical, so the synapses will all be modified in an identical manner. Basically, the problem with all-to-all coupling is symmetry; all the output units are equivalent, and supervised modulation of input responses is not sufficient to break this symmetry and allow the output units to respond differently to the stimulus. Although we have assumed that the synapses take identical values, setting the initial synaptic weights to different values does not fix this problem. At the opposite extreme, if the coupling from input to output units is so sparse that each input unit projects to just a single output unit, the situation reduces to multiple copies of the single-output case, and the analysis becomes a trivial extension of what was done in the previous section. In this section, we consider intermediate cases where the input-to-output connectivity is not all-to-all, but there is nevertheless considerable overlap in the input to different output units. We start by considering the case of twooutput units and construct networks with various amounts of overlap in
Supervised Learning Through Neuronal Response Modulation
GZhedchZBdYjaVi^dc6adcZ
+
+
2
2
GBl^i]=ZWW^VcAZVgc^c\
.*d[^cejihh]VgZY
+
IXk\
IXk\
IXk\
GZhedchZBdYjaVi^dcl^i] =ZWW^VcAZVgc^c\
623
2
2
2
2
Figure 6: Unsupervised synaptic plasticity allows learning to succeed where interference causes supervised response modulation, acting without synaptic plasticity, to fail. In the left panel, a network of 460 input units connected with 88% overlap to two output units was simulated with response modulation acting without synaptic plasticity. The responses of the two output units, denoted by filled and open dots, failed to match the two target functions, indicated by dashed and solid curves, as a function of the stimulus value. When Hebbian plasticity was included, the two output responses accurately matched the target functions (middle panel). The combination of supervised response modulation and Hebbian plasticity continued to produce accurate outputs until the common input to the output units was increased to 95% (right panel).
the projections they receive from the input units. For an overlap of q, the number of input units that project to both output units is qN, and the number that project to only a single output unit is (1 − q)N. We ask whether, in such cases, unsupervised synaptic plasticity can exploit small differences in the drive to each output unit to break the symmetry and allow the output units to represent different functions. Figure 6 illustrates the ability of unsupervised synaptic plasticity to play the role of a symmetry-breaking mechanism. In the first panel, supervised learning acting without synaptic plasticity has set the shifts and gains to their optimal values, but due to the degree of interference caused by shared input, the approximation is quite poor, and neither target function has been matched. The network has essentially split the difference between the two functions, with only small disparities between the responses of the two output units. However, when Hebbian plasticity is activated, it is able to exploit and amplify these small differences to improve performance dramatically. This ultimately leads to a match of the two different target functions (see the center panel of Figure 6). At this point, the supervisory input is no longer necessary (provided that the Hebbian process is halted). Thus, the combination of supervised response modulation and unsupervised synaptic plasticity allows the network to perform this task at a level that could not be achieved via response modulation alone. The problem of indirectly supervising the plasticity of NM synapses by modulating only N neurons might at first appear to be a crippling limitation
624
C. Swinehart and L. Abbott
of response modulation. The example of Figure 6 shows at least one case in which this problem is not nearly as severe as might have been imagined. Separation of the two output units could still be achieved when they shared up to 95% of their inputs. However, it is critical to the success of supervision by response modulation that the requirement of a unique component for the input to each output unit scale appropriately as the size of the network and the number of output units increase. Initially, we investigate this issue by varying the amount of shared input in a network with two output units. The result of varying the proportion of shared inputs in two-output networks of different sizes, when both supervised response modulation and unsupervised synaptic plasticity are active, is shown in Figure 7. In this figure and in Figure 8, performance is quantified by computing the error of equation 3.4, divided by the number of output units. The left panel of this figure shows that when learning occurs via response modulation alone without synaptic plasticity, errors begin to grow once the proportion of shared inputs exceeds 50% (dashed curves in Figure 7A). Performance is virtually identical for different input population sizes. With Hebbian plasticity included, the required number of unique inputs decreases dramatically, with little error accumulating until 90% to 95% of inputs are shared (solid curves in Figures 7A and 7B). The point of the transition from small errors to large errors appears to be roughly the same for all the network sizes studied (see Figure 7B). The main effect of increasing the number of input units is to make the transition point, where the function approximation network fails, sharper. This suggests that a discontinuous phase transition occurs at a critical percentage of about 94% shared input in the N → ∞ limit. We now extend these results to networks with more than two output units. In this case, the proportion of shared inputs (q) is not appropriate for describing all the different possibilities for sharing projections from the input units. Instead, we use the connection probability (p) to characterize the networks we study. To construct these networks, we introduce a connection between any one of the N input units and any one of the M output units with probability p. If such a connection is formed, it is subject to Hebbian plasticity. If no connection forms during this stochastic initial wiring, the connection remains absent for the entire duration of the simulation. The connection probability controls the sparseness of the network in that small values of p correspond to sparse connectivity. Figure 8 shows function approximation errors for networks with different numbers of output units as a function of the connection probability p, for two sizes of input unit populations. In this case, both response modulation and synaptic plasticity are activated. As in the two-output case, interference does not become a serious impediment to learning until p reaches the .9 to .95 range, indicating that truly unique inputs are not necessary. Rather, the requirement is a certain degree of sparseness. Also noteworthy is the fact that the output population size can approach one-third of the total number
Supervised Learning Through Neuronal Response Modulation
$+
'
?dfkjFefkbWj_edI_p[
$-
D\Xe=`eXc[XX
$)
$)
$'
$' $+
$,
$-
$.
$/
Gifgfik`fef]@eglkjJ_Xi\[
'
$.
$.*
$..
$/(
$/,
Gifgfik`fef]@eglkjJ_Xi\[
'
Figure 7: Error as a function of the proportion of shared input for a two-output network with different numbers of input units. Dashed lines show the mean error per output unit resulting from supervised response modulation without synaptic plasticity. Solid lines are from runs that also employed Hebbian plasticity. Individual lines correspond to different numbers of input units (N). The functions being approximated are cosine and sine. (A) Network performance degrades as the proportion of inputs projecting to both outputs increases. (B) Detailed view of the results for supervised response modulation with Hebbian plasticity shown in panel A.
of inputs before performance begins to suffer from interference (provided p values are not too large). Our results indicate that connection probability is the dominant factor that controls whether a network with multiple output units, using both supervised response modulation and Hebbian plasticity, can function properly. The required sparseness in the connectivity is not stringent. Furthermore, approximation of multiple functions is possible even when output population size is a significant fraction of the total input population size. The analysis of networks with multiple output units is more difficult than in the single-output case, but some of the same basic principles apply. In this case, the supervisor cannot clamp the output units to their target functions, but the existence of even a small number of symmetry-breaking synapses is sufficient to break this impasse. These synapses initially get quite strong and drive the output units away from the degenerate state in which they are all the same, which starts off the combined response modulation Hebbian learning process. Through this process, the bulk of the synapses ultimately come to obey the multi-output generalization of equation 4.1, αFa ri wai = . Fa rj
(4.5)
j
Similar to the result in the one-output case and making the same linear
626
HbVaa>cejiEdejaVi^dc-4)'' EkfkjFefkbWj_edI_p[
+ '& (& )& +& -&
$+
9%
$-
D\Xe=`eXcX`e
)$+
&
*
2# Jk`dlclj
(2#
Figure 9: Learning under the random walk supervisor. (A) The paths in modulation space of three input units controlled by the supervisor. All three units started with the same shifts and gains (marked Start), but these then diverged as the supervisor found values that accomplished the function approximation task (End). (B) The end result is a good approximation (dots) of the target function (line) by the output unit as a function of stimulus value. In this simulation, 230 input units drove a single output unit.
Cauwnberghs, 1993; Doya & Sejnowski, 1995; O’Reilly, 1996; Xie & Seung, 2004; Seung, 2003). For stochastic reward-based supervision, two N-dimensional “modification” vectors, vs and vg , of unit length were generated randomly—one for shifts and one for gains. For all i values, the shift and gain of unit i was incremented by an amount proportional to component i of the appropriate modification vector, si → si + vsi
and
g
gi → gi + vi .
(4.7)
Simulations were divided into epochs of 20 stimulus presentations and error evaluations. After each epoch, the sum of the 20 errors was compared to the summed error from the previous epoch. If this total error was less than it was previously, the modification vectors were left unchanged. If the summed error increased from the previous epoch, new modification vectors, vs and vg , were generated randomly. In either case, the resulting modification vector was then used to further increment the shifts and gains, as described above. In this study of random walk learning through response modulation, we do not include any Hebbian synaptic modification. This strategy has the effect of steadily, although slowly, reducing the average error. The paths through modulation space of three input units over the course of a run are plotted in Figure 9A. The improvement in performance can be seen in the reduction in the lengths of the line segments seen in the traces. At the beginning of the run, most of these segments are relatively long as the network makes coarse adjustments to approach
628
C. Swinehart and L. Abbott
the target function. Later, more frequent trajectory changes appear as the network approaches a solution and makes fine adjustments. Figure 9B illustrates that this crude strategy is capable of solving the task, given enough time (in this case 600 iterations). Thus, a random walk supervisor strategy that requires much less information and algorithmic sophistication than gradient descent can, at least in simple cases, provide adequate supervision. There are clearly severe limitations on the sizes of networks that can be trained by this random walk algorithm. As the network grows in size, the algorithm gets prohibitively slow. In section 5, we propose ways that this problem might be addressed to achieve better scaling of performance with network size. 5 Discussion The novel feature of the supervised learning scheme we have proposed is that supervision takes place at the level of neuronal responsiveness rather than synaptic plasticity. Two apparent disadvantages of this scheme—that it does not lead to permanent network modification and that it severely limits the number of elements being supervised—appear to be far less severe than might have been imagined at first. By guiding synaptic plasticity that is otherwise unsupervised, supervised response modulation can lead to permanent changes that allow a network to operate effectively, even in the absence of supervision. Furthermore, Hebbian plasticity can take advantage of small inhomogeneities in randomly coupled networks to allow independent changes in synapses to output units that share presynaptic input. Given that it works, there are some potential advantages of supervising neuronal excitability rather than synaptic plasticity. First, supervision can occur through ordinary feedback projections that can act rapidly and can target individual neurons independently. Additional advantages concern the nature of the supervisory circuit. We have not attempted to construct a realistic model of this circuit, but we envision it as a network capable of maintaining a continuum of stable, self-sustained patterns of activity (Compte, Brunel, Goldman-Rakic, & Wang, 2000; Seung, Lee, Reis, & Tank, 2000). Such networks tend to drift, especially if provided with noisy input. Thus, it might be possible to implement the random walk supervisor as a network with self-sustained activity and random drift, with the rate of drift controlled by noisy inputs that are suppressed by reward. Whatever the form of the supervisory circuit, modulating neuronal responsiveness instead of synaptic plasticity has a number of tactical advantages. We considered two approaches to supervision: gradient descent, which involves more information and mathematical analysis than we would expect from a neural circuit, and a random walk model that uses less. A real circuit should lie somewhere between these extremes. From the point of
Supervised Learning Through Neuronal Response Modulation
629
view of the supervisor, the fact that there are far fewer neurons than synapses to supervise changes from a disadvantage to an advantage. The supervisor must search in the space of the parameters it is modifying for a solution to the problem at hand. By reducing the dimension of this space, supervision of neuronal responses, provided that it works (and we have shown that it does), is far easier than supervision of more numerous synapses. Another advantage of supervising neuronal responses is that the supervisor can monitor the activities that it is modulating in a way that is impossible with supervised synaptic plasticity. It is almost inevitable that the function approximation network, which receives input from the supervisor circuit, would also send projections to it. Such reciprocal connectivity is a typical feature of neuroanatomy. These projections allow the supervisor to monitor the activity of the units it is supervising and use this information to guide learning. For example, this information could be used to reduce the dimension of the space in which the supervisor must search for solutions of the task being learned. Consider, for example, two input units in the function approximation network that have almost totally overlapping response profiles. It is rather wasteful for the supervisor to vary the response properties of these two neurons independently, and yet this is what was done in the random walk model we studied. A more “intelligent” supervisor would use information about the correlations between the units it is modulating to find strategies that are most likely to produce large changes in the network being supervised, and to avoid wasting time generating modulations that have little effect. Thus, projections from the supervised units to the supervisor could be part of a secondary modulatory process that allows the supervisor to learn about learning. Acknowledgments This research was supported by the National Science Foundation (IBN0235463) and NSF-DGE-9972756 and the Sloan-Swartz Center for Theoretical Neurobiology at Brandeis University. References Barto, A. B., Sutton, R. S., & Anderson, C. W. (1983). Neuron-like adaptive elements that can solve difficult learning control problems.IEEE Trans. on Systems, Man and Cybernetics, 13, 834–846. Cauwnberghs, G. (1993). A fast stochastic error-descent algorithm for supervised learning and optimization. In Col. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing stystems, 5 (pp. 244–251). San Mateo, CA: Morgan Kaufmann. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation through background synaptic input. Neuron, 35, 773–782.
630
C. Swinehart and L. Abbott
Chauvin, Y., & Rumelhart, D. E., (Eds.) (1995). Back propagation: Theory, architectures, and applications. Hillsdale, NJ: Erlbaum. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang X. J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cereb. Cortex, 10, 910–923. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural Systems. Cambridge, MA: MIT Press. Doiron, B., Longtin, A., Berman, N., & Maler, L. (2001). Subtractive and divisive inhibition: Effect of voltage-dependent inhibitory conductances and noise. Neural Comput., 13, 227–248. Doya, K., & Sejnowski, T. J. (1995). A novel reinforcement model of birdsong vocalization learning. In G. Tesauro, D. Touretzky, & T. Leen (Eds.) Advances in nueral information processing systems, 7 (pp. 101–108) Cambridge, MA: MIT Press. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Jabri, M., & Flower, B. (1992). Weight perturbation—an optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks. IEEE Trans. on Neural Networks, 3, 154–157. Lukashin, A. V., Wilcox, G. L., & Georgopoulos A. P. (1994). Overlapping neural networks for multiple motor engrams. Proc. Natl. Acad. Sci. U.S.A., 9, 8651– 8654. Mazzoni, P., Andersen R. A., & Jordan M. I. (1991). A more biologically plausible learning rule for neural networks. Proc. Natl. Acad. Sci. U.S.A., 88, 4433–4437. Miller, K. D. (1996). Receptive fields and maps in the visual cortex: Models of ocular dominance and orientation columns. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks (Vol. 3, pp. 55–78). New York: Springer-Verlag. O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Computation, 8, 895–938. Poggio, T. (1990). A theory of how the brain might work. Cold Spring Harbor Symposium on Quantitative Biology, 55, 899–910. Prescott, S. A., & De Koninck, Y. (2003). Gain control of firing rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. Proc. Natl. Acad. Sci. U.S.A., 100, 2076–2081. Salinas, E., & Abbott, L. F. (2000). Do simple cells in primary visual cortex form a tight frame? Neural Comp., 12, 313–336. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Seung, S. (2003). Learning in spiking neural networks by reinforcement of stochastic synapatic transmission. Neuron, 40, 1063–1073. Seung, H. S., Lee, D. D., Reis, B. Y., Tank, D. W. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons, Neuron 26, 259–271. Widrow, B., & Hoff, M. E. (1960) Adaptive switching circuits. WESCON Convention Report, 4, 96–104.
Supervised Learning Through Neuronal Response Modulation
631
Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Englewood Cliffs, NJ: Prentice Hall. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. Xie, X., & Seung, S. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review, 69, 041909. Received October 2, 2003; accepted July 28, 2004.
LETTER
Communicated by Bard Ermentrout
The Combined Effects of Inhibitory and Electrical Synapses in Synchrony Benjamin Pfeuty
[email protected] Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris Cedex 06, France
Germ´an Mato
[email protected] Comisi´on Nacional de Energ´ıa At´omica and CONICET Centro At´omico Bariloche and Instituto Balseiro, R. N.,Argentina
David Golomb
[email protected] Department of Physiology and Zlotowski Center for Neuroscience, Faculty of Health Sciences, Ben Gurion University of the Negev, 84105, Be’er-Sheva, Israel
David Hansel
[email protected] Interdisciplinary Center for Neural Computation, Hebrew University, 91904 Jerusalem, Israel, and Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris Cedex 06, France
Recent experimental results have shown that GABAergic interneurons in the central nervous system are frequently connected via electrical synapses. Hence, depending on the area or the subpopulation, interneurons interact via inhibitory synapses or electrical synapses alone or via both types of interactions. The theoretical work presented here addresses the significance of these different modes of interactions for the interneuron networks dynamics. We consider the simplest system in which this issue can be investigated in models or in experiments: a pair of neurons, interacting via electrical synapses, inhibitory synapses, or both, and activated by the injection of a noisy external current. Assuming that the couplings and the noise are weak, we derive an analytical expression relating the cross-correlation (CC) of the activity of the two neurons to the phase response function of the neurons. When electrical and inhibitory interactions are not too strong, they combine their effect in a linear manner. In this regime, the effect of electrical and inhibitory interactions when combined can be deduced knowing the effects of each of the interactions separately. As a consequence, depending on intrinsic neuronal properNeural Computation 17, 633–670 (2005)
c 2005 Massachusetts Institute of Technology
634
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
ties, electrical and inhibitory synapses may cooperate, both promoting synchrony, or may compete, with one promoting synchrony while the other impedes it. In contrast, for sufficiently strong couplings, the two types of synapses combine in a nonlinear fashion. Remarkably, we find that in this regime, combining electrical synapses with inhibition amplifies synchrony, whereas electrical synapses alone would desynchronize the activity of the neurons. We apply our theory to predict how the shape of the CC of two neurons changes as a function of ionic channel conductances, focusing on the effect of persistent sodium conductance, of the firing rate of the neurons and the nature and the strength of their interactions. These predictions may be tested using dynamic clamp techniques. 1 Introduction Electrical synapses have long been known to exist in invertebrates (Watanabe, 1958; Furshpan & Potter, 1959), but only recently has evidence of their ubiquity been unequivocally found in the mammalian brain. Remarkably, electrical synapses in the CNS are frequently found to connect GABAergic interneurons. This is, for instance, the case in the striatum (Kita, Kosaka, & Heizmann, 1990), the hippocampus (Venance et al., 2000), the cerebellum (Mann-Metzer & Yarom, 1999), the reticular thalamic nucleus (Landisman et al., 2002), and the neocortex (Gibson, Beierlein, & Connors, 1999; Galarreta & Hestrin 1999; Fukuda & Kosaka, 2000). In the neocortex, electrical as well as inhibitory synapses exist between fast spiking (FS) interneurons (Gibson et al., 1999; Galarreta & Hestrin, 2002) or between multipolar bursting neurons (Blatow et al., 2003). In contrast, low threshold spiking (LTS) interneurons interact mostly via electrical synapses, the inhibitory synapses between them being sparse (Gibson et al., 1999; Amitai et al., 2002). FS and LTS neurons are interacting reciprocally via inhibition but not via electrical synapses. Recent experiments have shown that electrical synapses may be involved in synchronizing neural activity. Blocking of inhibition and excitation does not reduce the synchronization of interneurons in the molecular layer of the cerebellum between which electrical synapses have been identified (MannMetzer & Yarom, 1999). In slices of neocortex, in which LTS cells are activated by ACPD, the synchronous activity of LTS remains after blocking inhibition (Beierlein, Gibson, & Connors, 2000). Reciprocally, blockade of gap junctions through pharmacological agent (Draguhn et al., 1998; PerezVelazquez, Valiante, & Carlen, 1994; Traub et al., 2001; Friedman & Strowbridge, 2003) or genetic manipulations (Deans, Gibson, Sellitto, Connors, & Paul, 2001; Hormuzdi et al., 2001; Blatow et al., 2003) reduces the synchronized activity in cortex or in hippocampus. Other studies suggest that electrical synapses alone are not sufficient for getting synchronous activity but that they are necessary. This has been shown in preparations of cortical slices where electrical synapses or inhibitory ones are not able to synchro-
Inhibitory and Electrical Synapses in Synchrony
635
nize neuronal activity (Tamas, Buhl, Lorincz, & Somogyi, 2000; Blatow et al., 2003). In contrast with these results, it has been recently reported that in inspiratory motoneurons, synchronous activity depends on inhibition and is strongly enhanced in the presence of CBX, a blocker of electrical synapses (Bou-Flores & Berger, 2001). Therefore, electrical synapses desynchronize neural activity, in this case. These experimental facts raise the following issues. How do intrinsic cellular properties affect synchrony of neurons coupled via inhibitory or electrical synapses? What is the significance of the combination of electrical and inhibitory couplings for neuronal dynamics? The goal of this work is to address these issues in the simplest system in which they can be answered in models or in experiments: a pair of neurons, interacting via electrical synapses, inhibitory synapses, or both, and activated by the injection of a noisy external current. The pattern of firing of a pair of model neurons can be characterized by the cross-correlations (CCs) of their activity, which measure the probability that the two neurons fire with some given time delay. Of particular interest are the locations and the width of the peaks and the troughs of the CCs. The main limitation in measuring CCs in experiments is the stationarity of the data. This puts constraints on the number of spikes one must accumulate. Simulations of neuronal models can be useful to estimate how long the recordings should be to get reliable estimate of the CCs. Exploring in a systematic way the space of parameters of a model using only numerical simulations is not the best way to discover the general rules underlying the dependence of the CCs on neuronal properties. Here, we derive an analytical expression for the CC of a pair of spiking neurons under the assumptions of weak interactions and weak noise. It provides valuable information about the way noise and intrinsic and synaptic properties affect the shape of the CCs. To our knowledge, this is the first time such an analytical expression has been obtained. Only a few theoretical works have been devoted to neurons coupled via electrical synapses (Sherman & Rinzel, 1992; Chow & Kopell, 2000; Pfeuty, Mato, Golomb, & Hansel, 2003). Recently we combined analytical and numerical studies to investigate how cellular properties affect synchronization of activity of neurons coupled via electrical synapses (Pfeuty et al., 2003). We showed that the stability of the asynchronous state in a large network of tonically firing neurons and the way it is affected by the average neuronal firing rate depend crucially on the shape of the neuronal phase response function (for the definition of the PRF, see section 2). Our theory predicts that sodium or calcium currents impede synchrony of neurons coupled via electrical synapses, whereas potassium currents promote it. This study also shows that predictions for experiments derived in the framework of the leaky integrate-and-fire (LIF) model should be considered with caution. Indeed, we found that large networks of electrically coupled LIF neurons and conductance-based neurons behave in qualitatively different
636
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
ways unless strong potassium conductances are involved in the dynamics of the conductance-based model. This previous work focused on the conditions for antisynchronous states in pairs of neurons and for stability of asynchronous states in large networks. The analysis was also restricted to a single model. In this article, we complement that study by investigating in detail how the features of the CCs of a pair of electrically coupled neurons depend on their intrinsic properties. We also study in a two-compartment model how the location of the synapses, on the soma or on the dendrite, affects the CCs. In contrast to the case of electrically coupled neurons, synchrony of neurons coupled via chemical synapses has been extensively studied. Many of the results that have shaped our concepts regarding this issue have been obtained in the framework of the LIF model (Abbott & van Vreeswijk, 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995; Brunel & Hakim, 1999) or of a modified version of it, the spike response model (Gerstner & Kistler, 2002). Several works have also considered neuronal models with more realistic biophysics (Wang & Rinzel, 1992; Hansel, Mato, & Meunier, 1983; Hansel et al., 1995; van Vreeswijk et al., 1994; Wang & Buzs´aki 1996; Ermentrout, 1996; Crook, Ermentrout, & Bower, 1998a, 1998b; Golomb, Hansel, & Mato, 2001; Ermentrout, Pascal, & Gutkin, 2001; Acker, Kopell, & White, 2003). However, few general results relate intrinsic properties and synchronization behavior. Moreover, these results deal mostly with the stability of fully synchronized states and not the stability of asynchronous state in large networks or antiphase locking of a pair of neurons. In this article, we provide new results regarding how intrinsic properties affect antiphase locking and bistability between in-phase and antiphase locking for a pair of reciprocally coupled inhibitory neurons. Another issue investigated here is the interplay of inhibitory and electrical synapses. This issue has been studied analytically and numerically in the framework of the LIF model (Lewis & Rinzel, 2003) or in specific conductance-based models (Skinner, Zhang, Velazquez, & Carlen, 1999; Traub et al., 2001; Bartos et al., 2002; Nomura, Fukai, & Aoyagi, 2003; Bem & Rinzel, 2004). However, general principles governing this interplay are lacking. When the coupling strengths are not overly strong, the two types of synapses interact linearly when combined. As a consequence, we show that depending on cellular properties, they can cooperate or act antagonistically. We derive rules to predict these effects. We also show that in the strong coupling regime, the two types of synapses can interact in a nonlinear manner. Indeed, we find a regime in which synchrony promoted by inhibition is amplified by electrical synapses, although these synapses alone do not promote synchrony. Finally, our work provides concrete predictions on the relationship between intrinsic properties and synchrony. We propose to test these predictions in experiments in which ionic currents or interactions are modified by
Inhibitory and Electrical Synapses in Synchrony
637
dynamic clamp (Sharp, O’Neil, Abbott, & Marder, 1993; Prinz, Abbott, & Marder, 2004). The letter is organized as follows. In section 2, we describe the methods employed in this work. We describe the two models that are the framework of our study: the quadratic integrate-and-fire model (QIF) and a conductance-based model that involves sodium and potassium currents. We also derive a general formula for the CCs of weakly coupled spiking neurons. Section 3 is devoted to the results obtained in the framework of the QIF model. We first consider the situation of weak or moderate couplings, and we rely on our analytical formula for the CCs. We checked with numerical simulations that these results extend over a substantial range of coupling strength and noise level. The effects specific to strong coupling are subsequently analyzed. The theory developed in the first part of the article allows one to predict the effects of calcium, potassium, or sodium currents. For the sake of concreteness, we focus in section 4 on the effect of a persistent sodium current. The case of other currents is briefly considered in section 5. 2 Methods 2.1 The Models. 2.1.1 Single Neuron Dynamics. Two models are considered in this work: the quadratic integrate-and-fire (QIF) model and a two-compartment conductance-based model. The QIF model is sufficiently simple that it can be studied with analytical calculations. In this model, the dynamical equation for the subthreshold membrane potential, v, is (Latham, Richmond, Nelson, & Nirenberg, 2000; Hansel & Mato, 2003; Pfeuty et al., 2003)
τm
dv = A(v − Vs )2 + Iext (t) − Ic , dt
(2.1)
where τm = 10 ms, A = 0.25, and Ic = 0 are constant parameters and Iext (t) is the total external current to the neuron. The parameter Vs is varied. The membrane potential is measured in mV. Hence, A is measured in mV−1 and the “current” in mV. If Iext is constant in time and smaller than Ic , v converges to a fixed point at large time. In contrast, if Iext > Ic , the solutions to equation 2.1 diverge in finite time. We define action potential firing as v reaching some threshold value VT from below. The membrane potential is then instantaneously reset to some value Vr < VT . In this article, we assume that Vr < Vs < VT . The membrane potential between two action potentials (Iext > Ic ) can be expressed analytically, solving equation 2.1 with the initial condition v(0) =
638
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
Vr . It reads: v(t) =
√ t A(Iext − Ic ) (Iext − Ic )/A tan + α + Vs τm
(2.2)
where α is an integration constant determined by the condition that at t = 0, the membrane potential of the neuron is at its reset value, Vr : α = tan−1
√
Vr − Vs . (Iext − Ic )/A
(2.3)
The condition v(T) = VT determines the firing period, and the firing frequency of the neuron is f = 1/T. One finds tan−1 ( √ VT −Vs ) − tan−1 ( √ Vr −Vs ) (Iext −Ic )/A (Iext −Ic )/A . T = τm √ A (Iext − Ic )
(2.4)
√ Near firing onset, Iext >≈ Ic , the frequency f behaves like f ∝ Iext − Ic , whereas f increases linearly with Iext for large Iext . The subthreshold dynamics needs to be supplemented with a model for the suprathreshold part of the membrane potential time course. Assuming that the width of the action potentials is much smaller than the interspike intervals, we represent them by a δ-function at each time a spike is fired. Therefore, the membrane potential of the neuron can be written at time t, V(t) = v(t) + θ
δ(t − tspike ),
(2.5)
spikes
where θ measures the integral over time of the suprathreshold part of action potentials. For given Vr and VT , the membrane potential time course depends on the parameter Vs . This is shown in Figure 1A, where V(t) is plotted for Vr = 0, VT = 12 mV and different values of Vs . In all these examples, the external current is such that the neurons fire at f = 50 Hz. It is clear that the concavity of the potential time course depends on Vs . For Vs ≈ Vr , it is convex except right after a spike, whereas for Vs ≈ VT , it is concave except right before a spike. For Vs = (VT + Vr )/2, the membrane potential displays an inflexion point at half the firing period. The conductance-based model used in this work has two compartments. The somatic compartment has a leak current, IL , a fast sodium current, INa , a delayed rectifier potassium current, IK , and a persistent sodium current, INaP . Details about these currents are given in appendix A. The second compartment, which corresponds to the dendrite, is passive. For simplicity, we assume that the two compartments have equal areas. Therefore, the
Inhibitory and Electrical Synapses in Synchrony
Vs=2 mV
639
Vs=6 mV
Vs=10 mV
A
B 0.6
0.6
0.4
0.4
0.2
0.2
0.2
Z
0.6 0.4 0
0
0.5
φ
1
0
0
0.5
φ
1
0
0
0.5
φ
1
Figure 1: Voltage traces and PRFs of the QIF neuron. Voltage traces and PRFs were computed using equations 2.2 and 3.3, respectively. The external current is such that the firing rate is 50 Hz (τ0 = 10 msec). (A), Voltage traces of the neuron for three values of Vs (2 mV, 6 mV, 10 mV) in response to a constant current (0.77 mV, 0.98 mV,0.77 mV). Note the changes in the concavity of the voltage traces between the spikes when Vs increases. (B), The PRF, Z, for the values of Vs as in A. For Vs = 6 mV= VT /2, the phase response is symmetric around φ = 1/2. For Vs = 2 mV (resp. Vs = 10 mV), the response is skewed toward the left (resp. the right).
membrane potentials of the soma, V, and of the dendrite, Vd , follow the equations: dV = −IL − INa − INaP − IK + Iext − gc (V − Vd ), dt dVd C = −gld (Vd − Vld ) − gc (Vd − V), dt C
(2.6) (2.7)
where gc is the coupling conductance between the two compartments, gld is the leak conductance of the dendritic compartment, and Iext is an external current. 2.1.2 Inhibitory and Electrical Synaptic Couplings. Inhibitory interactions are taken into account in the QIF model by adding to the right-hand side of the dynamical equation for v, equation 2.1, an additional current of the
640
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
form Iinh (t) = ginh s(t).
(2.8)
Here, ginh is a negative constant parameter measuring the strength of the inhibitory coupling. The function s(t) is defined by s(t) =
f (t − tspike ),
(2.9)
spikes
where the summation extends over all the spikes fired by the presynaptic neuron at time tspike prior to time t and f (t) =
−t/τ 1 e 1 − e−t/τ2 (t) τ1 − τ2
(2.10)
with τ1 and τ2 , the rise and the decay time of the synapse, respectively, and
the Heaviside function: (t) = 0 for t < 0 and (t) = 1 otherwise. Note that the integral of f (t) over time is normalized to 1. Each neuron receives s(t) from the other neuron. For simplicity, we limit our study of the conductance-based model to the case of inhibitory synapses located on the soma. Therefore, we incorporate inhibition in this model by adding to the right-hand side of the equation for V in equation 2.6 a current of the form Iinh (t) = −ginh s(V(t) − Vinh ),
(2.11)
where ginh > 0 is a constant conductance, Vinh is the reversal potential of the synapse, and s is the synaptic conductance dynamics defined in appendix A. The reversal potential of the inhibition is Vinh = −75 mV throughout the article. An electrical synapse between two neurons induces a synaptic current proportional to the difference of their membrane potential. In both the QIF and the conductance-based model, the effect on neuron i of an electrical synapse connecting neuron j and neuron i is taken into account by adding to the right-hand side of the dynamical equation of neuron i a current of the form Igap = −ggap (Vi − Vj ).
(2.12)
In the QIF model, Vi and Vj are the membrane potentials of the neurons defined by equations 2.1, and 2.5. In the conductance-based model, Vi and Vj stand for the potentials of the somatic or the dendritic compartments depending on the locations of the electrical synapses.
Inhibitory and Electrical Synapses in Synchrony
641
2.1.3 Noise. tochasticity is introduced in the dynamics of the QIF model by adding a gaussian white noise current, Inoise (zero mean, standard deviation, σ ), in the right-hand side of equation 2.1. In the conductance-based model, noise is incorporated in a similar way in the dynamics of the somatic compartment. 2.2 Cross-Correlations of Spike Trains. We characterize the synchrony properties of a pair of neurons by the CCs of their activity. Given the time series at which neuron i = 1, 2 fire action potentials, one can define a variable Si (t) by Si (t) = 1/δ
(2.13)
if neuron i has fired a spike in a time bin of size δ about time t and Si (t) = 0 otherwise. For a sufficiently small δ, the time average of Si , < Si (t) >, is the average firing rate of neuron i. The normalized CC of the spike trains of the two neurons is defined as C(τ ) =
< S1 (t)S2 (t + τ ) > . < S1 (t) >< S2 (t) >
(2.14)
It represents the density of probability that neuron 2 will fire a spike in some bin of size δ a delay τ after a spike of neuron 1. 2.2.1 Synchrony and Antisynchrony of a Pair of Neurons. A flat CC indicates that the two neurons fire independently, whereas a peak in the CC at some time shift indicates that the neurons have an increased probability to fire with some constant time delay. In particular, synchronous firing corresponds to an increased probability of the two neurons to fire simultaneously, that is, to a peak of the CC at τ = 0 in the CC. The model neurons considered in this article fire periodic trains of action potentials in response to constant input current. Noise in the input will decorrelate the periodic firing of action potentials. If the noise is not overly strong, this decorrelation requires several action potentials, and if the neurons fire with some correlation, the CCs will display a few oscillations with a period equal to the average firing period of the neurons, T. We will say that the two neurons tend to fire in antisynchrony if the CC of their activity displays a peak at τ = ±T/2. 2.3 Reduction to a Phase Model. 2.3.1 The Phase Dynamics of a Pair of Neurons. The study of the dynamics of a pair of interacting oscillatory neurons is greatly simplified if one assumes that their cellular properties are not too different, the noise they receive is small, and the coupling between them is weak. Under these assumptions, the dynamics of the neurons can be described in terms of two
642
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
phase variables—one for each neuron. In this reduced model, interactions between the neurons are taken into account by effective coupling functions that depend on their phase difference (Kuramoto, 1984; Ermentrout & Kopell, 1986; Hansel et al., 1993, 1995; van Vreeswijk et al., 1994) and the dynamics of the phases for the two neurons are given by the coupled differential equations, dφ1 = f1 + ginh inh (φ1 − φ2 ) + ggap gap (φ1 − φ2 ) + η1 (t) dt dφ2 = f2 + ginh inh (φ2 − φ1 ) + ggap gap (φ2 − φ1 ) + η2 (t), dt
(2.15) (2.16)
where f1 and f2 are the frequencies of the neurons when decoupled. In the following, we consider the homogeneous case : f1 = f2 = f . The phase coupling inh and gap , which corresponds to chemical and electrical interactions, respectively, depends on the phase response function (PRF) of the neuron, Z(φ). These can be computed from the convolution integrals with φ between 0 and 1: inh (φ) =
1
Z(u) s(u − φ) du
(2.17)
Z(u) (V(u) − V(u − φ)) du
(2.18)
0 1
gap (φ) = 0
where s(φ) =
1 e−φ/( f τ1 ) e−φ/( f τ2 ) ( − ) τ1 − τ2 1 − e−1/( f τ1 ) 1 − e−1/( f τ2 )
(2.19)
and V(φ) is the membrane potential expressed as a function of the phase variable. The functions gap (φ), inh (φ), and Z(φ) are periodic with period 1. The terms η1 (t) and η2 (t) in equations 2.15 and 2.16 are the effective noise on the phase dynamics that corresponds to the noise in the input received by the neurons. Assuming white and gaussian input noise with zero mean and standard deviation, σ , and generalizing the approach developed by Kuramoto (1991) in the context of the LIF model, one can show that for the QIF model, η1 (t) and η2 (t) are white gaussian noise with zero mean and standard deviation σφ : σφ2
=σ
2
1
[Z(u)]2 du.
(2.20)
0
Therefore, the effective noise in the phase dynamics of the QIF neuron is proportional to the average of its PRF over a period.
Inhibitory and Electrical Synapses in Synchrony
643
2.3.2 Analytical Expression for the CC of a Pair of Neurons in the Weak Coupling and Weak Noise Limit. In the weak coupling, weak noise limit, the time elapsed at time t since the last spike fired by neuron i is simply the firing period, T, multiplied by the phase, φ(t), computed modulo 1 (mod(φi (t), 1)). Similarly, the time elapsed between the last spikes fired by neurons i and j is τ = T mod(φi (t) − φj (t), 1). In the weak coupling limit, the CC and the phase shifts’ probability distribution function, P0 (), are related by C(τ ) = P0
τ T
.
(2.21)
A more detailed proof of this equation is given in appendix B. Using equations 2.15 and 2.16, one finds that = φ1 − φ2 satisfies the Langevin equation, d = − () + η(t), dt
(2.22)
where we have assumed that the two neurons fire at the same average firing rates. We have defined η(t) ≡ η1 (t) − η2 (t) and − () ≡ (() − (−)) with () = ginh inh () + ggap gap ().
(2.23)
The noise η(t) is the difference of two gaussian white noise, with zero average and standard deviation σφ . Therefore, √ it is a gaussian white noise with zero average and standard deviation 2σφ . Hence, the phase-shift distribution, P(, t), satisfies a Fokker-Planck equation (van Kampen, 1981): ∂2 ∂P(, t) ∂ P(, t) − = σφ2 [− () P(, t)]. 2 ∂t ∂ ∂
(2.24)
The stationary phase shifts distribution, P0 , is the solution of this equation, which satisfies the additional constraint ∂P0 (, t) = 0, ∂t
(2.25)
that is,
P0 () = eG() [A
e−G( ) d + B],
(2.26)
0
where
G() = 0
− ( ) d
σφ2
(2.27)
644
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
and A and B are two integration constants, which are determined by the
1 periodicity of P0 , that is, P0 (0) = P0 (1), and its normalization, 0 P0 ()d = 1. One finds P0 () = 1 0
eG()
eG( ) d
.
(2.28)
One can show that P0 is extremum for phase shifts solutions of − () = 0.
(2.29)
These solutions for which d− () = () < 0(resp. > 0) dφ
(2.30)
are maxima (resp. minima) of P0 . The function is periodic with period 1. Therefore, = 0, ±1/2 are always solutions of equation 2.29. Other solutions may also exist. Equations 2.29 and 2.30 can be easily interpreted. Indeed, in the absence of noise, when time goes to infinity, the two neurons phase-lock, and the possible phase shifts are the stable fixed point of equation 2.22 with η = 0. It is straightforward to see that these phase shifts are the solutions of equation 2.29 and that equation 2.30 is the condition for their stability. In the presence of noise, the phase shifts between the two neurons have an increase (resp. decrease) in probability density to be near the stable (resp. unstable) fixed points of the noiseless dynamics. In the limit σ → 0, P0 () is nonnegligable only for values of in the vicinity of the maxima of G(). For chemical synapses, G() is continuously differentiable at least up to the second order. Expanding G around its maxima, one finds that P0 is a sum of √ gaussians centered around these points, with a width proportional to 1/ − (). Therefore, the greater is the stability of a fixed point in the absence of noise, the sharper is the probability distribution around when noise is present. This is also true for electrical synapses for = 0. For = 0, the phase-shift distribution of the QIF model has a cusp. This is because in the QIF model, the PRF is discontinuous at = 0 and because we have modeled the spikes as a δ-function added to the subthreshold voltage. Note that the phase shift distribution depends on the couplings and the noise via the ratios ggap /σ 2 and ginh /σ 2 . However, the locations of the extrema of the phase shift distribution depend on only the coupling via the ratio ggap /gsyn and are independent of the noise. 3 Cross-Correlations in the Quadratic Integrate-and-Fire Model The QIF model is defined in section 2. Its dynamics are one-dimensional, with a temporal evolution of the voltage in its subthreshold range given by
Inhibitory and Electrical Synapses in Synchrony
645
one first-order differential equation. This equation is supplemented with a reset condition whenever the voltage reaches a threshold. In response to a suprathreshold step of current, the neuron fire spikes periodically. The QIF model is reminiscent of the more standard LIF model. However, in contrast with this model, the QIF dynamics of the voltage between the spikes are nonlinear. As a consequence of the quadratic nonlinearity in the dynamics, the temporal second-order derivative of the voltage of the QIF neuron vanishes during the interspike at a time that depends on a parameter, Vs (as shown in section 2). This parameter also controls the excitability of the neuron (see below). As we will see, this gives rise to a variety of synchronization regimes as Vs varies that do not occur in the LIF model but do occur in a conductance-based model with dynamics that are more faithful to neuronal biophysics. The goal of this section is to identify these regimes in the QIF model. We consider three situations: a pair of neurons interacting solely via inhibition, a pair of neurons interacting solely via electrical coupling, and dual coupling, in which the neurons interact via both types of synapses. We show that similar regimes are found in biophysically more realistic models. Because of the simplicity of the QIF model, many aspects of the dynamics of QIF networks (Hansel & Mato, 2003; Pfeuty et al., 2003) can be investigated using analytical techniques. In the case of a pair of neurons, an analytical formula can be established for their CC, assuming weak coupling and weakly noisy inputs (see equation 2.28). Using this formula, we conducted a systematic study of the ways in which the CC of the activity of a pair of QIF neurons depends on Vs , the firing frequency, and the nature of the neuronal coupling on synchrony. We present the results in the form of phase diagrams for the various regimes of parameters corresponding to qualitatively different shapes of CCs. Subsequently, we use numerical simulations to show that the results thus established hold for a broad range of coupling strengths and investigate new behaviors that may occur beyond this range. 3.1 Weakly Coupled Quadratic Integrate-and-Fire Neurons. 3.1.1 The Phase Response Function. The PRF of the QIF model can be computed analytically. It is given by (Ermentrout, 1981; Kuramoto, 1991; Pfeuty et al., 2003) Z(φ) =
dv(φ) dφ
−1
,
(3.1)
where φ = t/T with T given by equation 2.4 and v(φ) =
(Iext − Ic )/A tan
√ φT A(Iext − Ic ) + α + Vs . τm
(3.2)
646
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
Using equation 2.1, one finds Z(φ) =
τm A [v(φ) − Vs ]2 + Iext − Ic
.
(3.3)
Clearly Z(φ) > 0 for all φ. It is nonmonotonic and has one maximum, Z(φm ) at some φm , which depends on the frequency f and on Vs (for fixed Vr and VT ). Examples of the PRF, Z(φ), are plotted in Figure 1B for three values of Vs . For Vs < (Vr + VT )/2, φm is a decreasing function of f with φm → 1/2 in the limit f → 0 and φm → 0 in the limit f → ∞. For Vs > (VT + Vr )/2, φm increases with f . Still, in the limit f → 0, φm → 1/2 but φm → 1 for f → ∞. For Vs = (VT + Vr )/2, φm = 1/2, independently of f . Examples of the PRF are plotted in Figure 1B for three values of Vs . An analytical formula for the CC of a pair of QIF, equation 2.28, can be derived assuming weak coupling and weak noise. The results presented in this section were derived using this formula to compute the CCs. 3.1.2 The Three Generic Shapes of the CC for Inhibitory Coupling. Three examples of CCs computed using equation 2.28 are plotted in Figures 2A through 2C. The first example, displayed in Figure 2A, is for a high firing rate, 80 Hz, and a small Vs = 0.4 mV. In this case, C(0) and C(±T/2) are local minima, and C is maximum at time shifts ±τ , with τ ≈ T/6. Figures 2B and 2C for different values of Vs , Vs = 11.2 mV and Vs = 8 mV and the same firing, f = 20 Hz. In both cases, C(τ ) has a local maximum at τ = 0, but for Vs = 11.2 mV, the amplitude of this maximum is extremely small. It is much more pronounced for Vs = 8 mV. For Vs = 11.2 mV, C is also maximum at τ = ±T/2. Therefore, the pattern of firing of the neuronal pair is different in the two cases. For Vs = 8 mV, the neurons fire most of the time in synchrony, whereas for Vs = 11.2 mV, the neurons fire most of the time in antisynchrony. The three shapes of the CCs described in Figure 2 are typical. A phase diagram showing the domain of parameters corresponding to each shape can be computed. It is plotted in Figure 2D. In the dark gray region, the CC has only one maximum, located at τ = 0. For large Vs and sufficiently small frequency (gray-and-white striped region), the CC displays another peak at τ = ±T/2. However, C(0) decreases sharply below the boundary of the gray and white striped region. In most of this region, C(±T/2) is much larger than C(0), that is, the neurons are more likely to fire in antisynchrony than in synchrony. In contrast, for a small Vs and a large frequency (light gray region), the CC has two maxima, symmetrical around 0, at ±τ with 0 < τ < T/2. Figure 2E displays the CC against Vs and τ for a fixed frequency, f = 20 Hz. It shows that at this frequency, the CC is maximum around τ = 0 but broad at small Vs . The peak becomes more pronounced when Vs increases. However, when Vs is too large, the amplitude of this peak decreases, whereas a peak around τ = ±T/2 appears.
Inhibitory and Electrical Synapses in Synchrony
A C(τ)
B Vs =2mV, f=50Hz
2 1
Frequency (Hz)
2
C Vs =4mV, f=50Hz
0 τ/T
0.5
0 -0.5
0 τ/T
D
V =8mV, f=50Hz
s 4 3 2 1 0 0.5 -0.5
1
0 -0.5
100
647
0 τ/T
0.5
E
80 60
4
C(τ) 2
40
0 0
20 0
0.5
Vs (mV) 8 0
4
8 Vs (mV)
12
0
4
τ/T
12 -0.5
Figure 2: Cross-correlations of a pair of QIF neurons interacting via inhibitory synapses. (A–C) CCs for three values of Vs and f for ginh /σ 2 = 20. The synaptic time constants are τ1 = 1 msec, τ2 = 6 msec. (A): Vs = 0.4 mV, f = 80 Hz. The CC is peaked at ±T/6. (B) Vs = 8 mV, f = 20 Hz. The CC has only one peak located at τ = 0. (C) Vs = 11.2 mV, f = 20 Hz. The CC is peaked at τ = ±T/2 (antiphase). It also exhibits a small peak at τ = 0 (see inset). (D) The phase diagram. Regions in which the CCs have different shapes have different gray codes. Dark gray: CCs with a peak at τ = 0. White: CCs with a peak at τ = ±T/2. Light gray: CCs with peaks at ±τ with 0 < τ < T/2. Dark gray and white: CCs with a peak at τ = 0 and at τ = ±T/2. (E) The CCs as a function of Vs for f = 20 Hz.
The phase diagram in Figure 2 was computed for τ1 = 1 msec, τ2 = 6 msec. Similar phase diagrams are found for other values of τ1 and τ2 , but the positions of the transition lines are different. The slower the inhibition, the smaller the region where CC displays a peak around τ = ±T/2 (grayand-white striped region) and the larger the region where CC has a maxima at ±τ with 0 < τ < T/2 (red). 3.1.3 Electrical Couplings Mediate Synchrony or Antisynchrony Depending on Vs and the Firing Frequency. A detailed study of C(τ ) for a fixed frequency, f = 50 Hz, shows that when Vs increases from Vs = Vr ≡ 0 to Vs = VT ≡
648
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
12 mV, the shape of the CC undergoes a continuous transformation. This is shown in Figure 2E, where the CC is plotted against Vs and τ for a fixed frequency, f = 50 Hz. For a sufficiently small Vs , the CC is peaked around τ = ±T/2 and has a trough at τ = 0. When Vs increases, the trough persists, but when Vs becomes sufficiently large, the CC is no longer maximum at τ = ±T/2. The two maxima, symmetrical around 0, are then located at some intermediate time delay, the distance of which decreases when Vs increases. The trough at τ = 0 is also reduced, and the distribution becomes flatter when Vs increases. For some value of Vs , the two maxima merge, and the CC becomes monomodal at τ = 0. The CC keeps this shape, becoming sharper and then again flatter, as Vs increases. Therefore, for f = 50 Hz, the CC can have three qualitatively different shapes when Vs changes. Examples are plotted in Figures 3A through 3C. The phase diagram plotted in Figure 3D as a function of Vs and f shows that the three types of CCs described above for f = 50 Hz are also found for other values of the firing rate. It also shows that for sufficiently small frequency, a fourth type of CC is found for Vs sufficiently close to VT . It is maximum at τ = 0 but also at τ = ±T/2. The closer Vs is to VT , the larger the amplitude of the CC near τ = ±T/2 (not shown). The interactions induced by electrical synapses are diffusive. In contrast to chemical synapses, the synaptic currents they induced do not have intrinsic kinetics but depend on the membrane potential time course of the preand postsynaptic neurons. Subsequently, for electrical synapses, the current depends on the size and the shape of the action potential. In our model, this means that the amplitude of the spike, θ, may affect the synchronization properties of the neurons. In fact, we found that the phase diagram is affected quantitatively only when θ is changed. When θ increases, the grayand-white striped region shrinks and the white region expands. However, it can be shown that the white region always remains confined to the left half-part of the phase diagram, where Vs < 6 mV. 3.1.4 Three Modes of Interplay Between Electrical and Inhibitory Synapses. The results of the previous sections demonstrate that the CCs of a pair of QIF neurons coupled via inhibition or electrical synapses depend crucially on the parameter Vs . Moreover, comparing the phase diagram in Figure 3 with the one in Figure 2 reveals that these two types of synapses may have similar or opposing effects on synchrony, depending on Vs and f . Therefore, it is likely that when combined, the two types of synapses can either cooperate in promoting synchrony or be antagonistic, depending on the parameters. Using equations 2.28, 2.27, and 2.23, one can see that the number of maxima and minima of C(τ ) and their location depend on the coupling constants via their relative value, ggap /gsyn alone. However, other properties of C, for example, the amplitude of the extrema and their widths, depend on the absolute values of ggap and gsyn .
Inhibitory and Electrical Synapses in Synchrony
A C(τ)
2
Vs =0.4mV, f=80Hz
1
Frequency (Hz)
2
C
Vs =8mV, f=20Hz
2
1
0 -0.5
100
B
649
0 τ/T
0.5
Vs =11.2mV, f=20Hz
1
0 -0.5
0 τ/T
D
0 -0.5
0.5
0 τ/T
0.5
E
80 60
C(τ)
40 20 0
2 1 0 0
0.5
Vs (mV) 8 0
4
8 Vs (mV)
12
0
4
τ/T
12 -0.5
Figure 3: CCs for a pair of QIF neurons interacting via electrical synapses. (A– C) CCs for three values of Vs and f = 50 Hz for ggap /σ 2 = 10. (A) Vs = 2 mV, f = 50 Hz. The CC is peaked at ±T/2. (B) Vs = 4 mV. The CC is peaked at τ ± T/6. (C) Vs = 8 mV. The CC is peaked at τ = 0. (D) The phase diagram. The codes are as in Figure 2. (E) Plot of the CCs as a function of Vs and τ/T for f = 50 Hz.
We first investigate how the combination of these two types of synapses affects the extrema of C and the corresponding phase diagram as a function of f and Vs . As the ratio ggap /ginh varies from 0 to ∞, the phase diagram gradually modifies from the one in Figure 3D to the one in Figure 2D. For example, Figure 4A plots the phase diagram for ggap /ginh = 1/2. The codes have the same meaning as in Figures 2D, and 3D. Comparing Figures 4A and 3D shows that adding inhibitory synapses to electrical synapses reduces the size of the region of antisynchronous firing (white region), and increases the domain of intermediate phaseshift (light gray region) as well as the domain of bistability (gray-and-white striped region). Comparing Figure 4A with Figure 2D, one sees that adding electrical synapses to inhibitory synapses reduces the size of the domain of synchronous firing (dark grey region). This
650
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
Frequency (Hz)
100
A
B
80 60
C(τ) 4
2 0 0
40 20 0
0.5
Vs (mV) 8 0
4
8 Vs (mV)
12
0
4
τ/T
12 -0.5
Figure 4: A pair of QIF neurons interacting via dual coupling. The coupling ratio ggap /ginh = 1/2. (A) Phase diagram. The parameters and the codes are the same as in Figures 2D and 3D. Note the small region at large Vs and small frequency in which the CCs have a peak at τ = 0 and at τ = ±T/2. (B) The CC as a function of Vs for f = 50 Hz.
also decreases the size of the bistability domain (gray-and-white striped region). Figure 4B displays the CC as a function of Vs and τ for a fixed frequency, f = 50 Hz. The CCs for electrical synapses, alone, dual synapses, or inhibition alone are compared (from left to right) in Figure 5 for different values of the parameters Vs and f . In the top row (Vs = 2 mV and f = 50 Hz), electrical synapses alone promote antisynchronous firing, whereas inhibition promotes synchronous firing. When combined together, the CC is maximum for some intermediate time delay (neither 0 nor ±T/2). Note also that the CC is less modulated for dual coupling than for electrical or inhibitory couplings alone. Therefore, for these values of Vs and f , starting from dual coupling and blocking inhibition increases dramatically the probability of antisynchronous firing while suppressing almost completely the probability of synchronous firing. Blocking electrical synapses improves synchronous firing. In the middle row (Vs = 8 mV, f = 50 Hz), both electrical synapses and inhibitory synapses promote synchronous firing. When together, the synapses cooperate to increase further the probability of synchronous firing and to sharpen the CC around τ = 0. Therefore, in this case, blocking electrical synapses or inhibition reduces the probability of the neurons of firing in synchrony. In the bottom row (Vs = 11.2 mV and f = 20 Hz), electrical synapses alone promote synchrony, but with inhibition alone, the maxima of C are at τ = 0, ±T/2, C(±T/2), being much larger than C(0) (in fact, the maximum at τ = 0 cannot be seen in the figure because of its scale). For dual
Inhibitory and Electrical Synapses in Synchrony
651
Figure 5: Different regimes of interplay of electrical and inhibitory couplings for QIF neurons. (Left column) Electrical synapses (same parameters as in Figure 2). (Right column) Inhibitory synapses (same parameters as in Figure 3). (Middle column) Dual coupling. (Top row), Vs = 2 mV, f = 50 Hz. Antagonistic effects of electrical and inhibitory synapses on synchrony: electrical synapses impede synchrony, but inhibitory synapses promote it. (Middle row), Vs = 8 mV, f = 50 Hz: Electrical and inhibitory synapses cooperate. Bottom row, Vs = 11.2 mV, f = 20 Hz. Antagonistic effect of electrical and inhibitory synapses on synchrony: electrical synapses promote synchrony, but inhibitory synapses impede it.
coupling, the CC is maximum for τ = 0, ±T/2, with a larger correlation at τ = 0 than at τ = ±T/2. However, C(0) is substantially smaller in comparison to the case of electrical coupling alone. Hence, in this case, starting with neurons interacting via dual coupling and blocking inhibition would increase the probability of synchronous firing, whereas blocking electrical synapses would reduce it substantially and increase the probability of antisynchronous firing.
652
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
In summary, the weak coupling theory yields three parameter regimes for the interplay of electrical and inhibitory synapses in synchrony. In one regime, the synapses exhibit cooperative interplay, both of them promoting synchrony. In the other regimes the synapses act antagonistically. This may happen in two ways: because electrical synapses oppose synchrony but inhibitory synapses promote it or because electrical synapses promote synchrony but the net effect of the inhibitory interactions opposes it. Note that these three modes of interplay stem from the dependence of the CCs on Vs for electrical or inhibitory synapses alone combined with the fact that the function G() that determines the CCs for dual coupling, equation 2.28, depends linearly on the two interactions. 3.2 Beyond Weak Coupling and Weak Noise: The Nonlinear Interplay Between Electrical and Inhibitory Synapses. The analytical formula for the CCs used above in the derivation of the phase diagrams was established assuming weakly coupled neurons and weakly noisy inputs. Do the effects of Vs on the CCs found in this limit for inhibitory or electrical synapses alone still persist when the coupling and the noise are not weak? In the weak coupling limit, electrical and inhibitory synapses interact in a linear manner when they are combined. What are the nonlinear effects that occur at strong couplings? To answer these questions, we performed numerical simulations. Figure 6 displays the CCs obtained in numerical simulations (circles) for two values of Vs , Vs = 2 mV and Vs = 8 mV, with various coupling strengths and three combinations of the interactions (electrical, inhibitory, and dual). The noise was also varied to keep the ratios ginh /σ 2 , ggap /σ 2 fixed to 10 and 20, respectively. This is because the weak coupling theory predicts that the CCs depend on σ 2 and on ginh and ggap , via these ratios, as can be seen from equations 2.21 and 2.28. The solid lines in Figure 6 correspond to the CCs computed from the latter equations for the same values of these ratios. In the left column of Figure 6, which corresponds to relatively weak electrical and inhibitory couplings corresponding to postsynaptic potential of an amplitude of 0.2 mV, the simulations agree perfectly with the results from the weak coupling analysis. The middle column corresponds to coupling strengths, which are four times larger than in the left column. The agreement remains excellent except for dual coupling at Vs = 2 mV. Indeed, in this case, the weak coupling theory predicts that the CC should display a trough at zero time shift and that the firing probability should be maximum close to T/8. In contrast, in the simulations, the CC is maximum at zero time shift. Results for further increase of coupling strength by a factor of 3 are displayed in the right column in Figure 6. This corresponds to postsynaptic potentials of a typical size of 2 mV. For Vs = 8 mV, weak coupling theory predicts synchronous firing for electrical synapses alone. Synchrony is also found in the simulations, but
Inhibitory and Electrical Synapses in Synchrony
653
Figure 6: CCs of a pair of QIF neurons for different coupling strengths. Circles and dash lines: Results from simulations (averaging over 100 sec of activity). Solid lines: Predictions from weak coupling theory. The strength of the electrical synapses is from left to right: ggap = 0.0025, 0.01 and 0.03 corresponding, respectively, to coupling coefficients of 0.02 mS/cm2 , 0.06 mS/cm2 , and 0.16 mS/cm2 . The strength of inhibitory synapses is from left to right: ginh = 0.005 mS/cm2 , 0.02 mS/cm2 , and 0.06 mS/cm2 . The noise and the external current were adjusted to keep constant ggap /σ 2 = 10 and ginh /σ 2 = 20. The firing frequency is fixed (50 Hz). (A) Vs = 2 mV. (B) Vs = 8 mV.
the CC is much more peaked than predicted. This is due to direct threshold crossing, which is rare when the coupling is weak but contributes a great deal to the firing of the postsynaptic neuron for strong electrical coupling. For inhibitory coupling, the agreement between the theory and the simulations remains reasonable. Combining the two interactions sharpens the
654
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
CC, as predicted by the weak coupling theory, but quantitative agreement is poor. For Vs = 2 mV, the agreement is excellent for inhibitory coupling. For electrical coupling, the theory predicts antisynchronous firing. However, in the simulations, the CC is flat—the probability of firing is almost homogeneous, as it would be without coupling. Remarkably, in spite of the fact that electrical synapses are not able to induce synchrony, they strongly amplify synchrony and sharpen the CC when they are added to the inhibitory synapses (compare the CC in the middle line, right column). This nonlinear interplay between the two types of synapses requires strong couplings. 3.3 Predicting the Shape of the CCs from the Shape of the Phase Response Function. Our study of the QIF model in the weak coupling limit relies on the analytical formula, equation 2.28, which relates the CC to the synaptic couplings and the neuronal PRFs. The main effect on the PRF of changing Vs is to shift the location of its maximum. It varies continuously from 0 to 1, as Vs increases from Vr to VT (see Figure 1). Since equation 2.28 holds for any neuronal dynamics, our analysis of the QIF model suggests that more generally, synchrony may be predicted from the shape of the PRF independent of the details of the neuronal dynamics. These predictions can be summarized as follows. For electrical synapses, a pair of neurons with PRF skewed to the right tends to fire synchronously when interacting via electrical synapses. In contrast, neurons with PRF sufficiently skewed to the left tend to fire antisynchronously. For inhibitory synapses, CCs always have a peak at τ = 0 unless the PRF is too skewed to the left and the firing rate, f , of the neurons is too large. When the PRF is strongly skewed to the right and f is sufficiently small, the CC is also peaked around τ = ±T/2. The latter peaks may be much higher than the peak at τ = 0, indicating that the net effect of inhibition opposes synchrony. The faster the synapses, the larger the domain of parameters in which this happens. Our simulations of the QIF model indicate that these predictions from the weak coupling theory remain relevant, at least qualitatively, within a broad range of coupling strengths. For weak enough synapses, the interplay of electrical and inhibitory interactions is essentially linear. This property stems from the general formula, equation 2.28, and therefore it is not restricted to the QIF model. Hence, we expect that for general neuronal dynamics, the shapes of the CC for dual coupling can be predicted from the effect of each synapse taken separately, provided the interactions are not too strong. When the interactions are strong, nonlinear effects start to be significant. Nevertheless, knowing the PRF of the neurons can serve to qualitatively predict the interplay of the two types of interactions. For instance, from the analysis of the QIF model, we predict that for neurons with a PRF skewed to the left, adding strong electrical synapses to inhibitory synapses amplifies synchronous firing, although singly electrical synapses would not lead to synchrony. This effect
Inhibitory and Electrical Synapses in Synchrony
655
is a consequence of the rapid direct threshold crossing induced by strong electrical synapses when they act at a time when the postsynaptic neuron is not far from threshold. Therefore, one may expect that such synergetic interplay of electrical and inhibitory interactions is a general effect that occurs in conductance-based models and in real neurons. 4 Cross-Correlations in a Conductance-Based Model The results of the previous section suggest that the shape of the CC of a pair of neurons can be related to their intrinsic properties, provided one knows how ionic channels and neuronal morphology shape their PRF. In this section, we verify that this indeed holds in the framework of a twocompartment conductance-based model neuron. The somatic compartment incorporates one fast and one persistent sodium current and a delayed rectifier potassium current. The dendritic compartment is passive. Details of the model are given in section 2. To be specific, we focus on the effects of the persistent sodium current and the dendritic compartment on the synchrony of the neurons. The results presented in this section lead to experimentally testable predictions, as explained in section 5. 4.1 Increasing gNaP in the Conductance-Based Model and Decreasing Vs in the QIF Model Have Similar Effects on the PRF. The effect of INaP on the voltage trace and the PRF of the neuron is shown in Figures 7A and 7B. For all the considered values of gNaP , Z(φ) is positive for all φ. It is also nonmonotonic and monomodal. The PRF is small just before, during, and just after the action potential. The position of the maximum depends on gNaP . For gNaP = 0, it is in the second half of the period at φ ≈ 2/3. As a result, the global shape of the PRF is skewed toward the second half of the period of the neuron. The response of the neuron increases with gNaP , especially in the first half of the period. This shifts the maximum of Z(φ) toward smaller φ. For large enough gNaP , the PRF is skewed toward the left. The PRF also depends on the firing rate, f . For a given parameter set, the maximum of Z moves closer to 1/2 when f goes to 0; in that limit, the function Z becomes proportional to 1 − cos 4π φ (Ermentrout, 1996), as in the QIF model. Therefore, increasing gNaP in our conductance-based model or decreasing Vs in the QIF model affects the PRF of the neurons in a similar way. Hence, we may expect that this should also affect the CCs similarly. We show that this is indeed the case. 4.2 The Effect of the Persistent Sodium Current on the Cross-Correlations. In this section, we assume that gc = 0 (in this case, the model has a single compartment). Figure 8 plots the CCs for three values of gNap — when the neurons are coupled solely via electrical synapses (left column), chemical synapses (right column), or both synapses (middle column). The synaptic conductances were ggap = 0.005 mS/cm2 and ginh = 0.02 mS/cm2 .
656
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
Figure 7: Voltage traces and phase-response functions of the conductance-based model neuron are shaped by the persistent sodium current. The soma and the dendrite are decoupled (gc = 0). (A) Voltage traces. (B) PRF. The PRFs were obtained by computing the delay on the spike times induced by small perturbations (see Hansel et al., 1993). The external current was adjusted to keep the firing constant, f = 50 Hz: The external currents are 1.1 µA/cm2 , −0.3 µA/cm2 , and −1.04 µA/cm2 , respectively, for gNaP = 0, 0.2 mS/cm2 , and 0.7 mS/cm2 .
The corresponding spikelets and IPSPs are also shown in Figure 8A. Their peak amplitudes are about 0.3 mV and 0.4 mV, respectively. The results obtained for gNaP = 0.7 mS/cm2 , f = 50 Hz are similar to those obtained for the QIF model with Vs = 2 mV and the same firing frequency (see Figure 5 left line). In both cases, electrical synapses promote antisynchrony, whereas inhibition alone promotes synchrony. For dual coupling, the two trends compete, and the peak of the CC around τ = 0 is very flat. Reducing gNaP to gNaP = 0.2 mS/cm2 but keeping the frequency constant, f = 50 Hz, electrical synapses or inhibitory synapses alone promote synchronous firing. For dual coupling, the two types of synapses cooperate, leading to more synchronous activity than when each of them is considered alone. This is similar to what was obtained in the QIF model for Vs = 2, f = 20 Hz (see Figure 5, middle row). For even smaller gNaP and also reducing the frequency ( f = 20 Hz) (see Figure 8B, bottom row) leads to CCs that are similar to those for Vs = 11.2 mV and f = 20 Hz in the QIF (bottom row in Figure 5). In both cases, electrical synapses alone favor synchronous firing. Adding inhibitory interactions substantially reduces the probability of the neurons’ firing simultaneously. This is because for gNaP = 0, electrical and inhibitory synapses act antagonistically, since the latter promote antisynchrony (see Figure 8B, bottom row, right column). We also investigated the effect of gNaP on the CCs when the synapses are strong. Results are shown in Figure 9 for synaptic conductances six times
Inhibitory and Electrical Synapses in Synchrony
657
Figure 8: Effects of the persistent sodium current and the firing frequency on the CCs in the conductance-based model. (A) Postsynaptic potentials due to one presynaptic spike are shown (for gNaP = 0). The soma and the dendrite are decoupled (gc = 0). (Left) Electrical coupling, ggap = 0.005 mS/cm2 . (Right) Inhibitory coupling, ginh = 0.02 mS/cm2 . (Middle) Dual coupling. (B) CCs are computed over 100 sec of activity for three values of gNaP . The SD of the noise is σ = 0.15 mV/ms1/2 . (Top row) gNaP = 0.7 mS/cm2 , f = 50 Hz. (Iext = −1.04 µA/cm2 ). Electrical and inhibitory synapses have antagonistic effects on synchrony. Electrical synapses suppress synchrony and inhibition-induced synchrony. (Middle row) gNaP = 0.2 mS/cm2 , f = 50 Hz (Iext = −0.30 µA/cm2 ). Electrical and chemical synapses cooperate. (Bottom row) gNaP = 0, f = 20 Hz, (Iext = 0.56 µA/cm2 ). Electrical and inhibitory synapses have antagonistic effects on synchrony. Inhibitory synapses suppress synchrony, but electrical synapses promote it.
658
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
larger than in the results depicted in Figure 8, namely, for ggap = 0.03 mS/cm2 and ginh = 0.12 mS/cm2 . The amplitude of the corresponding spikelets and inhibitory postsynaptic potentials is about 2.5 mV (see Figure 9A). The coupling coefficient for electrical synapses is 20%. This corresponds to synapses with strengths in the upper physiological range. Our numerical simulations confirmed that the crucial influence of the persistent sodium current on the CCs also occurs for such strong synapses. Indeed, for gNaP = 0.2 mS/cm2 , the neurons fire in synchrony, and the CC has a peak at τ = 0 (see Figure 9C, left panel). In contrast, for gNaP = 0.7 mS/cm2 (see Figure 9B, left panel), the central peak of the CC is suppressed. Note, however, that the peak is sharper and larger in Figure 9C than in Figure 5B, and the peaks of the CC in Figure 9A are around τ = ±3T/8 and not at τ = ±T/2, as in the corresponding case for weaker coupling (see Figure 5). The right panels in Figures 9A and on B show the CCs for strong inhibition. Reducing gNaP has a minor effect on the CC peak value, as in the corresponding case in Figure 5. However, in contrast to weak coupling, this also decreases the firing probability around τ = T/4 and increases it at τ = T/2. This did not happen at weaker coupling (see Figure 5). The most significant difference between the CCs at weak and strong couplings occurs when electrical and inhibitory interactions are combined and gNaP is sufficiently large. Indeed, in this case, the two types of synapses show synergy. As in the QIF model, for small Vs , synchronous firing is completely suppressed for electrical synapses alone. But when added to inhibition, the peak of the CC at τ = 0 is amplified. 4.3 Synchrony of Neurons Coupled via Electrical Synapses Decreases When the Somato-Dendritic Coupling Increases. We now assume gNaP = 0 and discuss the effect of the dendrite on synchrony. Therefore, we assume gc = 0. In this case, perturbations acting on the somatic or the dendritic compartment affect the firing of the neuron differently. Therefore, two PRFs can be defined, one, Zsoma , for perturbations on the soma and the other, Zdendrite , for perturbations on the dendrite. These two functions are plotted in Figure 10A for our two-compartment model assuming gNaP = 0, a somato-dendritic coupling conductance gc = 0.3 mS/cm2 and for a firing frequency, f = 50 Hz. The somatic PRF for gc = 0 and f = 50 Hz is also plotted on the same figure (dashed line). The coupling with the dendritic compartment affects the shape of the somatic PRF by skewing it toward the first half of the period (compare the somatic PRFs for gc = 0.3 mS/cm2 and gc = 0). This is because the voltage perturbation at the soma backpropagates to the dendrite, and from there reverberates back to the soma. Therefore, the overall response to a perturbation at the soma is the sum of two contributions. The first one is the same as though the dendrite was absent. The second one, which is due to the dendrite, is a delayed and broadened version of the first one. Together, these
Inhibitory and Electrical Synapses in Synchrony
659
Figure 9: Nonlinear interplay between strong electrical and inhibitory synapses for conductance-based neurons. Synaptic conductances are ggap = 0.03 mS/cm2 and ginh = 0.12 mS/cm2 . The SD of the noise is adjusted to obtain ggap /σ 2 and ginh /σ 2 to ≈ 1.1 and ≈ 4.4. The external current is such that f = 50 Hz. (A) Postsynaptic potential induced by a presynaptic action potential for (left to right) electrical coupling alone, dual coupling, and inhibitory coupling alone. (B) gNaP = 0.7 mS/cm2 (Iext = −1.04 µA/cm2 ). Electrical synapses alone reduce probability to fire in-phase (top), but combined with inhibitory synapses they amplify synchronous firing. (see right panels). (C) gNaP = 0.2 mS/cm2 (Iext = −0.30 µA/cm2 ). Electrical synapses strongly amplify the peak in-phase with or without inhibitory synapses.
two contributions lead to a response that is stronger and extends earlier in the limit cycle. When the perturbation is on the dendrite, the PRF is skewed to the left, as shown in Figure 10A (solid line) compared to the somatic PRF (dash-dot line). This is because the dendritic compartment delays and filters the perturbation before it can affect the firing of action potentials in the somatic compartment.
660
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
0.4
A ES in dendrites
Z
ES in soma with dendrites
0.2 0
ES in soma without dendrites
0
φ
1
B
4
C(τ)
C(τ)
4 2 0 -0.5
0 τ/T
0.5
C 50 Hz
10 Hz
2
0
0 τ/T
Figure 10: Effect of the dendritic compartment on the PRF and on the CCs. (A) The somatic PRF for gc = 0, Iext = 1.02 µA/cm2 (dash line) and gc = 0.3 mS/cm2 , (dash-dot line); the dendritic PRF for gc = 0.3 mS/cm2 , Iext = 1.50 µA/cm2 ) (solid line). f = 50 Hz in all the cases. (B) The CC for somatic electrical synapses for gc = 0 (dashed line) and gc = 0.3 mS/cm2 (dash-dot line) and for dendritic synapses for gc = 0.3 mS/cm2 (solid line). The firing frequency is f = 50 Hz. (C) CCs with dendritic synapses for two values of the firing frequencies f = 50 Hz (solid line) and f = 10 Hz (circles).
Therefore, adding a dendritic compartment to the soma shapes the PRF similarly to reducing Vs in the QIF model or increasing gNaP in the single compartment model. Hence, we expect that for electrical synapses located on the soma, synchrony decreases with gc . We also expect that neurons are less synchronized if the electrical synapses are on the dendrites than if they are on the soma. This is confirmed by the results displayed in Figure 10B, where the CCs of two neurons firing at 50 Hz are compared for somatic electrical synapses for gc = 0.3mS/cm2 (dash-dot line) and gc = 0 (dashed line) and dendritic synapses (solid line). In the first two cases, the CC is peaked around τ = 0. However, it is much broader and smaller, gc = 0.3 mS/cm2 , than for gc = 0. Moving the synapses to the dendrites reduces the synchrony further. In the example shown in Figure 10B (solid line), it even leads to antisynchronous firing. The transition to antisynchrony induced by electrical
Inhibitory and Electrical Synapses in Synchrony
661
synapses when the dendritic compartment is added is basically similar to the transition to antisynchrony in the QIF model when Vs is reduced. As in the QIF, this effect is firing rate dependent. This is shown in Figure 10C, where the CC is plotted for f = 50 Hz and f = 10 Hz. In the latter case, the CC, although broad, is maximum at τ = 0 in contrast with f = 50 Hz, where antisynchrony occurs. 5 Discussion 5.1 Relating CCs to Cellular and Synaptic Properties. We have shown that the CCs of the activity of a pair of tonically spiking neurons interacting via electrical and/or inhibitory synapses can be predicted if one knows the shape of their PRF. Subsequently, one can predict how cellular properties affect the CCs if one knows how the former shape the PRF. Strictly speaking, the PRF is defined only when neurons are weakly interacting. However, in our model, weak coupling predictions remain valid at least qualitatively for spikelet amplitudes and IPSPs as large as 2 to 2.5 mV. These values are similar to those observed in vitro (Galarreta & Hestrin, 2001, 2002). In the case of the QIF model, we focused on the effect of the parameter Vs . The effects of Vr and VT can be studied in a similar way. In conductancebased models as in real neurons ionic channels determine the PRF (Crook et al., 1998a; Ermentrout et al., 2001; Oprisan & Canavier, 2002; Acker et al, 2003; Pfeuty et al., 2003). We have focused here on the effect of INaP or of a passive dendrite (modeled as a single compartment), which shifts the maximum of the PRF toward the left. Other inward currents, such as calcium currents, have the same effects on the PRF and must affect synchrony in a similar way. Potassium currents, in contrast, skew the PRF to the right (Ermentrout et al., 2001; Pfeuty et al., 2003). Therefore, we predict that potassium and sodium currents have mirror effects on synchrony (Pfeuty et al., 2003). We have verified these predictions in simulations of conductance-based models (results not shown). We have found that electrical synapses located on the dendritic compartment of the conductance-based neuron are less synchronizing than if they are located on the soma. One contribution to this effect is the additional delay induced by the dendritic compartment, which is like shifting the PRF to the left. One can similarly predict the effect of conductions or the synaptic delays in the inhibitory interactions. Although we did not include these delays in our model, our approach can be straighforwardly extended to take them into account (Izhikevich, 1998). For given cellular properties, this would amount to shifting the PRF of the neurons to the left, as would be the effect of increasing gNaP without delays. 5.2 Linear and Nonlinear Interplays Between Electrical and Inhibitory Synapses. When the interactions are not too strong, the CCs of the activity of two neurons are very well approximated by our analytical result, equa-
662
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
tions 2.21 and 2.28, which relates the CC to the effective phase interaction, . For dual coupling, is the sum of two contributions due to each of the interactions. In this sense, the interactions combine their effect linearly. This implies that the CC for dual coupling is the normalized product of the CCs for each interaction alone. When the interactions are sufficiently strong, substantial deviations from equations 2.21 and 2.28 occurs due, for instance, to nonlinear interaction between electrical and inhibitory synapses. This is what we have found when electrical synapses amplify synchronous firing of a pair of neurons when combined with inhibition, although the CC would display a trough at zero time delay with electrical synapses alone due to the intrinsic properties of the neurons. A counterpart of this synergetic effect in a large network of such neurons is enhancement of synchrony by electrical synapses when combined with inhibition, although alone the latter would promote asynchronous firing (results not shown). This effect should not be confused with the more trivial fact that combining electrical and inhibitory synapses enhances synchrony when the neuronal properties are such that both interactions alone promote synchronous firing. 5.3 Relation to Previous Studies. 5.3.1 Modeling Works. Previous work has addressed the role of cellular properties in inhibition-mediated synchrony focusing on the interplay between postinhibitory rebound (PIR) and inhibition. They have shown that PIR stabilizes in-phase synchrony (Wang & Rinzel, 1992; Golomb, Wang & Rinzel, 1994) for sufficiently slow and strong inhibition. The synchronizing role of inhibition studied here is essentially different; it occurs for QIF neurons that do not display PIR, and it does not require strong coupling. Rather, it is similar to the mechanism considered by van Vreeswijk et al. (1994) and Hansel et al. (1995). Our study helps clarify how intrinsic neuronal properties affect synchrony in this mechanism. Previous studies of conductance-based models have shown that a pair of electrically coupled neurons may fire in synchrony or in antisynchrony depending on the model or the firing rate (Sherman & Rinzel, 1992; Han, Kurrer, & Kuramoto, 1995; Chow & Kopell, 2000). For the compartmental model studied by Alvarez, Chow, van Bockstaele, and Williams (2002), electrical synapses located on the soma synchronize the spikes of the neurons in the whole range of frequency investigated. In contrast, for synapses on the dendrites, synchrony occurs only for sufficiently small firing rates. Our work provides a general framework to understand this diversity of behaviors. Synchrony of a pair of LIF neurons interacting via electrical and/or inhibitory synapses has been studied by Lewis and Rinzel (2003). The PRF of the LIF increases monotonically with the phase, regardless of the cellular parameters (Kuramoto, 1991; Hansel et al., 1995). Such a shape is found in the QIF model for Vs → VT . Therefore, in this limit, the properties of
Inhibitory and Electrical Synapses in Synchrony
663
the QIF model resemble those of the LIF. For instance, in the regions in the bottom-right of the QIF phase diagrams for electrical synapses (see Figure 3) and for inhibitory synapses (see Figure 2), the CCs have a peak at τ = 0 and at τ = ±T/2. Changing the size of the spike θ and the inhibitory time constants modifies the extent of these regions and their overlap. As a consequence, for large Vs and small f , the interplay of the two types of synapses depends on these parameters. For sufficiently large θ and fast inhibition (the case considered in Figure 4) C(T/2) decreases and C(0) increases when electrical coupling is added to inhibitory coupling, compared to the case of inhibition alone. The opposite effect occurs for slow inhibition and small θ (result not shown). This is similar to what Lewis and Rinzel (2003) found for the LIF, but it occurs only in a restricted domain of the phase diagrams of the QIF. The dynamics of networks consisting of a large number of inhibitory interneurons coupled also via electrical synapses have been investigated recently (Traub et al., 2001; Bartos et al., 2002). Bartos et al. found that in a network of Wang-Buszaki (WB) neurons, electrical synapses enhanced synchrony even in the absence of inhibition. This agrees with our work, since one can check that the PRF of WB neurons is qualitatively similar to the one of our model in the absence of INaP . Traub et al. found that in a network of multicompartmental neurons coupled with inhibition, electrical synapses located on the dendrites improve synchrony. The mechanism underlying this effect is not clear since Traub et al. did not study their model when inhibition is suppressed.
5.3.2 Experimental Work. Recent experimental work suggests that electrically coupled pairs of LTS and pairs of FS cells have different synchronization properties (Mancilla, Lewis, Pinto, Rinzel, & Connors, 2002). Alvarez et al. (2002) have shown that synchrony of neurons in locus coeruleus (LC) interacting via electrical synapses is affected differentially by the firing rate of the neurons in adult and neonatal slices. It would be interesting to investigate experimentally whether these differences in synchronization properties can be explained by differences in the neuronal PRF, as our work suggests. The synchronizing effect of electrical synapses has been also assessed in neocortex and hippocampus by showing a reduction in synchronous oscillations in the gamma frequency range after suppressing Cx36-based electrical synapses (Deans et al., 2001; Hormuzdi et al., 2001). In the work of Tamas et al. (2000), both inhibitory and electrical synapses were found to be required to obtain tight synchrony in a neocortical slice. Although all of these studies support the role of electrical synapses in synchrony, the underlying mechanism remains to be defined in the light of our work, which shows that synchrony occurs in different ways depending on the intrinsic properties of the neurons and on the strength of the interactions.
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
τ/T
τ/T
gNaP
C
τ/T
τ/T
gNaP
C(t) τ/T
C(t)
τ/T
C(t)
τ/T
Large gNaP C(t)
C(t)
C(t)
Frequency
C(t)
τ/T
C(t)
C(t)
C(t)
τ/T
Inhibitory coupling
τ/T
C(t)
B
Electrical coupling
C(t)
Frequency
A
ggap
664
τ/T
τ/T
ginh
Figure 11: Predictions for the dependence of the shape of CCs on cellular and synaptic properties. (A) Qualitative predictions for the effect of changing gNaP and the firing frequency for the CCs of a pair of neurons interacting solely via electrical synapses. The coupling strength is fixed. (B) Qualitative predictions for a pair of neurons interacting solely via inhibitory synapses. The synaptic coupling conductance is fixed. (C) Qualitative predictions for the effects of the inhibitory and electrical coupling strength on the shapes of the CCs for dual coupling. The persistent-sodium conductance and the firing rates of the neurons are fixed and sufficiently large.
5.4 Predictions for Experiments. The dynamic clamp technique (Sharp et al., 1993; Prinz et al., 2004; Brizzi et al., 2004) can be used to modify artificially the cellular properties of neurons as well as the couplings between neurons. Therefore, it can serve to study how the dynamics of neurons depend on their intrinsic properties or the nature of their interactions (Prinz et al., 2004; Perez-Velazquez, Carlen, & Skinner, 2001). This approach can be used to test experimentally the predictions of this work, for instance, in pairs of cortical interneurons activated by injection of external noisy currents. Pairs of interneurons interact frequently via electrical and inhibitory synapses. A first experiment would be to record from pairs coupled only via electrical or inhibitory synapses to study how their CCs vary as a function of the average firing rates. We predict different behaviors, as schematically depicted in Figures 11A and 11B. Which of these behaviors actually occurs depends on the balance between inward and outward currents in the cells and on the location of the synapses on their dendritic trees. A second experiment would be to study the effect of the modifications of the cellular properties. We focus here on the effect of INaP . Depending on the outcomes of the first experiment, the effect would have to be tested by either increasing or decreasing the importance of this current (a decrease can be achieved by simulating a persistent sodium current with a negative conductance). We also predict that the effect of combination between inhibitory and electrical coupling on CCs depends on their strength as depicted in Figure 11C. Of particular interest would be to test the nonlinear interplay of these interactions.
Inhibitory and Electrical Synapses in Synchrony
665
Appendix A: Parameters of the Conductance-Based Model The dynamics of the membrane potentials of the somatic and dendritic compartments of our conductance-based model are given by C
dV = Iext −IL −INa −IK −INaP −gc (V − Vd )+Inoise +Igap +Iinh dt
(A.1)
C
dVd = −gl (Vd − Vld ) − gsd(Vd − V) + Igap , dt
(A.2)
where IL = −gL (V − VL ) is a leak current, Iext is a constant external current, Inoise a gaussian white noise with zero mean, and standard deviation, σ . The synaptic current received by the neuron i from the neuron j is Iinh = −ginh sj (Vi − Vinh ) for the inhibitory current, and Igap = −ggap (Vi − Vj ) or Igap = −ggap (Vdi − Vdj ) for current due to an electrical synapse respectively located between somatic or dendritic compartments. INa = gNa m3∞ h (V−VNa ) and IK = gK n4 (V−VK ) are the spike-generating currents, and INaP = gNaP p∞ (V − VNa ) is a noninactivating (persistent) sodium current. The kinetics of the gating variable h, n, s are given by dx = αx (V)(1 − x) − βx (V)x, dt
(A.3)
with x = h, n, s and αh (V) = 0.21 e−(V+58)/20 , βh (V) = 3./(1 + e−(V+28)/10 ), αs (V) = 50 (1 + tanh(V/4)), βs (V) = 1/τinh The activation functions, m∞ and p∞ , are given by m∞ (V) = αm (V)/(αm (V) + βm (V)), where αm (V) = 0.1(V + 35)/(1 − e−(V+35)/10 ), βm (V) = 4e−(V+60)/18 , and p∞ (V) = 1/(1 + e−(V+40)/6 ). Throughout this work, the parameters gNa = 35 mS/cm2 , VNa = 55 mV, gK = 10 mS/cm2 , VK = −75 mV, gL = 0.15 mS/cm2 , VL = −65 mV, gld = 0.15 mS/cm2 , Vld = −65 mV, Vinh = −75 mV, τinh = 6 msec and C = 1 µF/cm2 are kept constant. The conductance gNaP varies from 0 to 0.7mS/cm2 , and the conductance between the dendritic and somatic compartment is either gc = 0 (one-compartment model) or 0.3 mS/cm2 . Appendix B: Relationship Between the Phase-Shift Distribution Function and the Cross-Correlation in the Weak Coupling and Weak Noise Limit The normalized cross-correlation of the spike trains is (see equation 2.14) C(t2 − t1 ) =
< S1 (t1 )S2 (t2 ) > , < S1 (t) >< S2 (t) >
(B.1)
where S1 (t) is defined as 1/δ if there is a spike in a bin with a size δ and 0 otherwise. The average firing rate of the neuron is < S1 (t) >= P1 (t) = f .
666
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
We define the joint probability P1,2 (t1 , t2 ) as the probability that neuron 1 fires between t1 and t1 + δ and neuron 2 fires between t2 and t2 + δ. Then, < S1 (t1 )S2 (t2 ) >= P1,2 (t1 , t2 ) = P2|1 (t2 |t1 ) P(t1 ),
(B.2)
where P2|1 is the conditional probability that neuron 2 fires between t2 and t2 + δ given that neuron 1 fired between t1 and t1 + δ. Assuming stationarity, this function depends solely on the difference τ = t2 −t1 : P2|1 (t2 |t1 ) = P˜2|1 (τ ), and hence < S1 (t1 )S2 (t1 + τ ) > P˜2|1 (τ ) = . < S1 (t) >< S2 (t) > f
(B.3)
Switching from a function of time to a function of phase, we use the equation P˜2|1 (τ )dτ = P0 Tt d Tτ , to obtain C(τ ) ≡
τ < S1 (t1 )S2 (t1 + τ ) > = P0 , < S1 (t) >< S2 (t) > T
(B.4)
where P0 is the probability distribution of the phase shift between the two neurons. Acknowledgments We thank B. Connors, C. Landisman, and C. van Vreeswijk for useful discussions and C. Meunier for careful reading of this manuscript. This work was supported in part by a NATO PST-CLG (reference 977683), the ACI “Neurosciences integratives et computationnelles” (Minist`ere de la Recherche, France), and project SECyT-ECOS A99E01. D.G. was supported by the Israel Science Foundation (grant no. 657/01). References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Acker, C. D., Kopell, N., & White, J. A. (2003). Synchronization of strongly coupled excitatory neurons: Relating network behavior to biophysics. J. Comput. Neurosci., 15, 71–90. Alvarez, L. F., Chow, C., van Bockstaele, E. J., & Williams, J. T. (2002). Frequencydependent synchrony in locus coeruleus: Role of electrotonic coupling. Proc. Natl. Acad. Sci., 99, 4032–4036. Amitai, Y., Gibson, J. R., Patrick, A., Ho, B., Connors, B. W., & Golomb, D. (2002). Spatial organization of electrically coupled network of interneurons in neocortex. J. Neurosci., 22, 4142–4152. Bartos, M., Vida, I., Frotscher, M., Meyer, A., Monyer, H., Geiger, J. R., & Jonas, P. (2002). Fast synaptic inhibition promotes synchronized gamma oscillations in hippocampal interneuron networks. Proc. Natl. Acad. Sci., 99, 13222–13227.
Inhibitory and Electrical Synapses in Synchrony
667
Beierlein, M., Gibson, J. R., & Connors, B. W. (2000). A network of electrically coupled interneurons drives synchronized inhibition in neocortex. Nat. Neurosci., 3, 904–910. Bem, T., & Rinzel, J. (2004). Short duty cycle destabilizes a half-center oscillator, but gap junctions can restabilize the anti-phase pattern. J. Neurophysiol., 91, 693–703. Blatow, M., Rozov, A., Katona, I., Hormuzdi, S. G., Meyer, A. H., Whittington, M. A., Caputi, A., & Monyer, H. (2003). A novel network of multipolar bursting interneurons generates theta frequency oscillations in neocortex. Neuron, 38, 805–817. Bou-Flores, C., & Berger, A. J. (2001). Gap junctions and inhibitory synapses modulate inspiratory motoneuron synchronization. J. Neurophysiol., 85, 1543– 1551. Brizzi, L., Meunier, C., Zytnicki, D., Donnet, M., Hansel, D., Lamotte D’Incamps, B., & van Vreeswijk, C. (2004). How shunting inhibition affects the discharge of lumbar motoneurones. A dynamic clamp study in anaesthetized cats. J. Physiol, 558, 671–683. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-fire neurons with low firing rates. Neural Comput., 11, 1621–1671. Chow, C. C., & Kopell, N. (2000). Dynamics of spiking neurons with electrical coupling. Neural Comput., 12, 1643–1678. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998a). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillations. Neural Comput., 10, 837–854. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998b). Dendritic and synaptic effects in systems of coupled cortical oscillators. J. Comput. Neurosci., 5, 315– 329. Deans, M. R., Gibson, J. R., Sellitto, C., Connors, B. W., & Paul, D. L. (2001). Synchronous activity of inhibitory networks in neocortex requires electrical synapses containing connexin36. Neuron, 31, 477–485. Draguhn, A., Traub, R. D., Schmitz, D., & Jefferys, J. G. (1998). Electrical coupling underlies high-frequency oscillations in the hippocampus in vitro. Nature, 394, 189–192. Ermentrout, B. (1981). n:m phase-locking of weakly coupled oscillators. J. Math. Biol., 12, 327–342. Ermentrout, B. (1996). Type I membranes, phase resetting curves and synchrony. Neural Comput., 8, 979-1001. Ermentrout, B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comput., 13, 1285–1310. Friedman, D., & Strowbridge, B. W. (2003). Both electrical and chemical synapses mediate fast network oscillations in the olfactory bulb. J. Neurophysiol., 89, 2601–2610.
668
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
Fukuda, T., & Kosaka, T. (2000). Gap-junction coupling linking the dendritic network of GABAergic neurons in the hippocampus. J. Neurosci., 20, 1519– 1528. Furshpan, E. J., & Potter, D. D. (1959). Transmission at the giant motor synapses of the crayfish. J. Physiol., 145, 289–325. Galarreta, M., & Hestrin, S. (1999). A network of fast spiking cells in the neocortex connected by electrical synapses. Nature, 402, 72–75. Galarreta, M., & Hestrin, S. (2001). Electrical synapses between GABA-releasing neurons. Nature Neurosci., 2, 425–433. Galarreta, M., & Hestrin, S. (2002). Electrical and chemical synapses among parvalbumin fast-spiking GABAergic interneurons in adult mouse neocortex. Proc. Natl. Acad. Sci. USA., 19, 12438–12443. Gerstner, W., & Kistler W. M. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Gibson, J. R., Beierlein, M., & Connors, B. (1999). Two networks of inhibitory neurons electrically coupled. Nature, 402, 75–79. Golomb, D., Hansel, D., & Mato, G. (2001). Mechanisms of synchrony of neural activity in large networks. In F. Moss & S. Gielen (Eds.), Handbook of biological physics, Vol. 4: Neuro-informatics and neural modeling (pp. 887–968). Amsterdam: Elsevier Science. Golomb, D., Wang, X. J., & Rinzel, J. (1994). Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiol., 72, 1109– 1126. Han, S. K., Kurrer, C., & Kuramoto, Y. (1995). Dephasing and bursting in coupled neural oscillators. Phys. Rev. Lett., 75, 3190–3193 Hansel, D., & Mato, G. (2003) Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comput., 15, 1–56. Hansel, D., Mato, G., & Meunier, C. (1993). Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 23, 367–372. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hormuzdi, S. G., Pais, I., Lebeau, F. E .N., Towers, S. K., Rozov, A., Buhl, E., Whittington, M. A., & Monyer, H. (2001). Impaired electrical signaling disrupts gamma frequency oscillations in connexin 36-deficient mice. Neuron, 31, 487–495. Izhikevich, E. M. (1998). Phase models with explicit time delays. Phys. Rev. E, 58, 905–908 Kita, H., Kosaka, T., & Heizmann, C. W. (1990). Parvalbumin-immunoreactive neurons in the rat neostriatum: A light and electron microscopic study. Brain Res., 536, 1–15. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer. Kuramoto, Y. (1991). Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30.
Inhibitory and Electrical Synapses in Synchrony
669
Landisman, C. E., Long, M. A., Beierlein, M., Deans, M. R., Paul, D. L., & Connors, B. W. (2002). Electrical synapses in the thalamic reticular nucleus. J. Neurosci., 22, 1002–1009. Latham, P. E., Richmond, B. J., Nelson P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. Lewis, T., & Rinzel, J. (2003). Dynamics of spiking neurons connected by both inhibitory and electrical coupling. J. Comput. Neurosci., 14, 283–309. Mancilla, J. G., Lewis, T., Pinto, D. J., Rinzel, J., & Connors, B. W. (2002). Firing dynamics of single and coupled pairs of inhibitory interneurons in neocortex. Program No. 840.13. Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Available at: http://apu.sfn.org. Mann-Metzer, P., & Yarom, Y. (1999). Electrical coupling interacts with intrinsic properties to generate synchronized activity in cerebellar networks of inhibitory interneurons. J. Neurosci., 19, 3298–3306. Nomura, M., Fukai, T., & Aoyagi, T. (2003). Synchrony of fast-spiking interneurons interconnected by GABAergic and electrical synapses. Neural Comput., 15, 2179–2198. Oprisan, S. A., & Canavier, C. C. (2002). The influence of limit cycle topology on the phase resetting curve. Neural Comput, 14, 1027–1057. Perez-Velazquez, J. L., Carlen, P. L. & Skinner F. K. (2001). Artificial electrotonic coupling affects neuronal firing patterns depending upon cellular characteristics. Neuroscience, 103, 841–849. Perez-Velazquez, J. L., Valiante, T. A., & Carlen, P. L. (1994). Modulation of gap junctional mechanisms during calcium-free induced field burst activity: A possible role for electrotonic coupling in epileptogenesis. J. Neurosci., 14, 4308–4317. Pfeuty, B., Mato, G., Golomb, D., & Hansel, D. (2003). Electrical synapses and synchrony: The role of intrinsic currents. J. Neurosci, 23, 6280–6294. Prinz, A. A., Abbott, L. F., & Marder A. (2004). The dynamic clamp comes of age. Trends Neurosci., 27, 228–234. Sharp, A. A., O’Neil, M. B., Abbott, L. F., & Marder, E. (1993). The dynamic clamp: Artificial conductances in biological neurons. Trends Neurosci., 16, 389–394. Sherman, A., & Rinzel, J. (1992). Rhythmogenic effects of weak electrotonic coupling in neuronal models. Proc. Natl. Acad. Sci., 89, 2471–2474. Skinner, F. K., Zhang, L., Velazquez, J. L., & Carlen, P. L. (1999). Bursting in inhibitory interneuronal networks: A role for gap-junctional coupling. J. Neurophysiol., 81, 1274–1283. Tamas, G., Buhl, E. H., Lorincz, A., & Somogyi, P. (2000). Proximally targeted GABAergic synapses and gap-junctions precisely synchronize cortical interneurons. Nature Neurosci., 3, 366–371. Traub, R. D., Kopell, N., Bibbig, A., Buhl, E., Lebeau F. E. N., & Whittington, M. (2001). Gap junction between interneuron dendrites can enhance synchrony of gamma oscillations in distributed networks. J. Neurosci., 21, 9478–9486. van Kampen, N. G. (1981). Stochastic processes in physics and chemistry. Amsterdam: Elsevier Science. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comput. Neurosci., 1, 313–321.
670
B. Pfeuty, G. Mato, D. Golomb, and D. Hansel
Venance, L., Rozov, A., Blatow, M., Burnashev, N., Feldmeyer, D., & Monyer, H. (2000). Connexin expression in electrically coupled postnatal rat brain neurons. Proc. Natl. Acad. Sci., 97, 10260–10265. Wang, X. J., & Busz´aki, J. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X. J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comput., 4, 84–117 Watanabe, A. (1958). The interaction of electrical activity among neurons of lobster cardiac ganglion. Jap. J. Physiol., 8, 305–318. Received June 1, 2004; accepted August 3, 2004.
LETTER
Communicated by Chris Watkins
Spikernels: Predicting Arm Movements by Embedding Population Spike Rate Patterns in Inner-Product Spaces Lavi Shpigelman
[email protected] School of Computer Science and Engineering and Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
Yoram Singer
[email protected] School of Computer Science and Engineering Hebrew University, Jerusalem 91904, Israel
Rony Paz
[email protected] Eilon Vaadia
[email protected] Interdisciplinary Center for Neural Computation and Department of Physiology, Hadassah Medical School The Hebrew University Jerusalem, 91904, Israel
Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this letter is the construction of biologically motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach by comparing the Spikernel to various standard kernels in the task of predicting hand movement velocities from cortical recordings. All of the kernels that we tested in our experiments outperform the standard scalar product used in linear regression, with the Spikernel consistently achieving the best performance.
1 Introduction Neuronal activity in primary motor cortex (MI) during multijoint arm reaching movements in 2D and 3D (Georgopoulos, Schwartz, & Kettner, 1986; Neural Computation 17, 671–690 (2005)
c 2005 Massachusetts Institute of Technology
672
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
Georgopouls, Kettner, & Schwartz, 1988) and drawing movements (Schwartz, 1994) has been used extensively as a test bed for gaining understanding of neural computations in the brain. Most approaches assume that information is coded by firing rates, measured on various timescales. The tuning curve approach models the average firing rate of a cortical unit as a function of some external variable, like the frequency of an auditory stimulus or the direction of a planned movement. Many studies of motor cortical areas (Georgopoulos, Kalaska, & Massey, 1983; Georgopoulos et al., 1988; Moran & Schwartz, 1999; Schwartz, 1994; Laubach, Wessberg, & Nicolelis, 2000) showed that while single units are broadly tuned to movement direction, a relatively small population of cells (tens to hundreds) carries enough information to allow for accurate prediction. Such broad tuning can be found in many parts of the nervous system, suggesting that computation by distributed populations of cells is a general cortical feature. The population vector method (Georgopoulos et al., 1983, 1988) describes each cell’s firing rate as the dot product between that cell’s preferred direction and the direction of hand movement. The vector sum of preferred directions, weighted by the measured firing rates, is used both as a way of understanding what the cortical units encode and as a means for estimating the velocity vector. Several recent studies (Fu, Flament, Cotz, & Ebner, 1997; Donchin, Gribova, Steinberg, Bergman, & Vaadia, 1998; Schwartz, 1994) propose that neurons can represent or process multiple parameters simultaneously, suggesting that it is the dynamic organization of the activity in neuronal populations that may represent temporal properties of behavior such as the computation of transformation from desired action in external coordinates to muscle activation patterns. A few studies (Vaadia, et al., 1995; Laubach, Schuler, & Nicolelis, 1999; Reihle, Grun, Diesmann, & Aersten, 1997) support the notion that neurons can associate and dissociate rapidly to functional groups in the process of performing a computational task. The concepts of simultaneous encoding of multiple parameters and dynamic representation in neuronal populations together could explain some of the conundrums in motor system physiology. These concepts also invite using of increasingly complex models for relating neural activity to behavior. Advances in computing power and recent developments of physiological recording methods allow recording of ever growing numbers of cortical units that can be used for real time analysis and modeling. These developments and new understandings have recently been used to reconstruct movements on the basis of neuronal activity in real time in an effort to facilitate the development of hybrid brain-machine interfaces that allow interaction between living brain tissue and artificial electronic or mechanical devices to produce brain-controlled movements (Chapin, Moxon, Markowitz, & Nicolelis, 1999; Laubach et al., 2000; Nicolelis, 2001; Wessberg et al., 2000; Laubach et al., 1999; Nicolelis, Ghazanfar, Faggin, Votaw, & Oliveira, 1997; Isaccs, Weber, & Schwartz, 2000).
Spikernels
673
Most of the current attempts that focus on predicting movement from cortical activity rely on modeling techniques that employ parametric models to describe a neuron’s tuning (instantaneous spike rate) as a function of current behavior. For instance, cosine-tuning estimation (population vector) is used by Taylor, Tillery, and Schwartz (2002), and linear regression was employed by Wessberg et al. (2000) and Serruya, Hatsopoulos, Paninski, Fellows, and Donoghue (2002). Brown, Frank, Tang, Quirk, and Wilson (1998) and Brockwell, Rojas, and Kass (2004) have applied filtering methods that, apart from assuming parametric models of spike rate generation, also incorporate a statistical model of the movement. An exception is the use of artificial neural nets (Wessberg et al, 2000) (though this study reports getting better results by linear regression). This article describes the tailoring of a kernel method for the task of predicting two-dimensional hand movements. The main difference between our method and the other approaches is that our nonparametric approach does not assume cell independence, imposes relatively few assumptions on the tuning properties of the cells and how the neural code is read, allowing representation of behavior in highly specific neural patterns, and, as we demonstrate later, results in better predictions on our test sets. However, due to the implicit manner in which neural activity is mapped to feature space, improved results are often achieved at the expense of understanding tuning of individual cells. We attempt to partially overcome this difficulty by describing the feature space that our kernel induces and by explaining the way our kernel parameters affect the features. The letter is organized as follows. In section 2, we describe the problem setting that this article is concerned with. In section 3, we introduce and explain the main mathematical tool that we use: the kernel operator. In section 4, we discuss the design and implementation of a biologically motivated kernel for neural activities. The experimental method is described in section 5. We report experimental results in section 6 and give conclusions in section 7. 2 Problem Setting Consider the case where we monitor instantaneous spike rates from q cortical units during physical motor behavior of a subject (the method used for translating recorded spike times into rate measures is explained in section 5). Our goal is to learn a predictive model of some behavior parameter (such as hand velocities, as described in section 5) with the cortical activity as the input. Formally, let S ∈ Rq×m be a sequence of instantaneous firing rates from q cortical units consisting of m time samples. We use S, T to denote sequences of firing rates and denote by len(S) the time length of a sequence S (len(S) = m in above definition). Si ∈ Rq designates the instantaneous firing rates of all neurons at time i in the sequence S and Si,k ∈ R the rate of neuron k. We also use Ss to denote the concatenation of S with one more sample s ∈ Rq . The instantaneous firing rate of a unit k in sample s is then
674
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
sk ∈ R. We also need to employ a notation for subsequences. A consecutive subsequence of S from time t1 to time t2 is denoted St1 :t2 . Finally, throughout the work, we need to examine possibly nonconsecutive subsequences. We denote by i ∈ Rn a vector of time indices into the sequence S such that 1 ≤ i1 < i2 < · · · < in ≤ len(S). Si ∈ Rq×n is then an ordered selection of spike rates (of all neurons) from times i. Let y ∈ Rm denote a parameter of the movement that we would like to predict (e.g., the movement velocity in the horizontal direction, vx ). Our goal is to learn an approximation y of the form f: Rq×m → Rm from neural firing rates to movement parameter. In general, information about movement can be found in neural activity both before and after the time of movement itself. Our long-term plan, though, is to design a model that can be used for controlling a neural prosthesis. We therefore confine ourselves to causal predictors that use an l (l ∈ Z) long window of neural activity ending at time step t, S(t−l+1):t , to predict yt ∈ R. We would like to make yt = ft S(t−l+1):t as close as possible (in a sense that is explained in the sequel) to yt . 3 Kernel Method for Regression The major mathematical tool employed in this article is kernel operators. Kernel operators allow algorithms whose interface to the data is limited to scalar products to employ complicated premappings of the data into feature spaces by use of kernels. Formally, a kernel is an inner-product operator K: X × X → R where X is some arbitrary vector space. In this work, X is the space of neural activities (specifically, population spike rates). An explicit way to describe K is via a mapping φ: X → H from X to a feature space H such that K(x, x ) = φ(x) · φ(x ). Given a kernel operator, we can use it to perform various statistical learning tasks. One such task is support vector regression (SVR) (see, e.g., Smola & Scholkopf, ¨ 2004) which attempts to find a regression function for target values that is linear if observed in the (typically very high-dimensional) feature space mapped by the kernel. We give here a brief description of SVR for clarity. SVR employs the ε-insensitive loss function (Vapnik, 1995). Formally, this loss is defined as yt − f (S(t−l+1):t ) = max 0, yt − f (S(t−l+1):t ) − ε . ε Examples that fall within ε of the target value (yt ) do not contribute to the error. Those that fall further away contribute linearly to the loss (see Figure 1, left). The form of the regression model is f (S(t−l+1):t ) = w · φ S(t−l+1):t + b,
(3.1)
which is linear in the feature space H and is implicitly defined by the kernel. Combined with a loss function, f defines a hyperslab of width ε centered at the estimate (see Figure 1, right, for a single-dimensional illustration).
Spikernels
675 y f(x)
Loss
ε
ε
y − yˆ
φ(x)
Figure 1: Illustration of the ε-insensitive loss (left) and an ε-insensitive area around a linear regression in a single-dimensional feature space (right). Examples that fall within distance ε have zero loss (shaded area).
For each trial i and time index t designating a frame within the trial, we received a pair, (Si(t−l+1):t , yit ), consisting of spike rates and a corresponding target value. The optimization problem employed by SVR is arg min w
1 w2 + C yit − f Si(t−l+1):t , ε 2 i,t
(3.2)
where ·2 is the squared vector norm. This minimization casts a trade-off (weighed by C ∈ R) between a small empirical error and a regularization term that favors low sensitivity to feature values. Let φ(x) denote the feature vector that is implicitly implemented by kernel function K(·, x). Then there exists a regression model equivalent to the one given in equation 3.1, is a minimum of equation 3.2, and takes the form f (T) = At,i K(Si(t−l+1):t , T) + b. t,i
Here, T is the observed neural activity, and A is a (typically sparse) matrix of Lagrange multipliers determined by the minimization problem in equation 3.2. In summary, SVR solves a quadratic optimization problem aimed at finding a linear regressor in an induced high-dimensional feature space. This regressor is the best penalized estimate for the ε-insensitive loss function where the penalty is the squared norm of the regressor’s weights. 4 Spikernels The quality of kernel-based learning is highly dependent on how the data are embedded in the feature space via the kernel operator. For this reason, several studies have been devoted to developing new kernels (Jaakola & Haussler, 1998; Genton, 2001; Lodhi, Shawe-Taylor, Cristianini, & Watkins, 2000). In fact, a good kernel could render the work of a classification or regression algorithm trivial. With this in mind, we develop a kernel for neural spike rate activity.
676
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
R a te
Pattern A Pattern B
T im e
(a) Bin by bin P attern A P attern B
P attern A P attern B
R a te
R a te
Time of Interest
T im e
(b) Time warp
T im e
(c) Time of interest
Figure 2: Illustrative examples of pattern similarities. (a) Bin-by-bin comparison yields small differences. (b) Patterns with large bin-by-bin differences that can be eliminated with some time warping. (c) Patterns whose suffix (time of interest) is similar and prefix is different.
4.1 Motivation. Our goal in developing a kernel for spike trains is to map similar patterns to nearby areas of the feature space. In the description of our kernel, we attempt to capture some well-accepted notions on similarities between spike trains. We make the following assumptions regarding similarities between spike patterns: • A commonly made assumption is that firing patterns that have small differences in a bin-by-bin comparison are similar. This is the basis of linear regression. In Figure 2a, we show an example of two patterns that are bin-wise similar. • A cortical population may display highly specific patterns to represent specific information such as a particular stimulus or hand movement.
Spikernels
677
Though in this work we make use of spike counts within each time bin, we assume that a pattern of spike counts may be specific to an external stimulus or action and that the manner in which these patterns change as a function of the behavior may be highly nonlinear (Segev & Rall, 1998). Many models of neuronal computation are based on this assumption (see, e.g., Pouget & Snyder, 2000). • Two patterns may be quite different from a simple bin-wise perspective, yet if they are aligned using a nonlinear time distortion or shifting, the similarity becomes apparent. An illustration of such patterns is given in Figure 2b. In comparing patterns, we would like to induce a higher score when the time shifts are small. Some biological plausibility for time-warped integration of inputs may stem from the spatial distribution of synaptic inputs on cortical cells’ dendrites. In order for two signals to be integrated along a dendritic tree or at the axon hillock, their sources must be appropriately distributed in space and time, and this distribution may be sensitive to the current depolarization state of the tree (Magee, 2000). • Patterns that are associated with identical values of an external stimulus at time t may be similar at that time but rather different at t ± when values of the external stimulus for these patterns are no longer similar (as illustrated in Figure 2c. We would like to give a higher similarity score to patterns that are similar around a time of interest (prediction time), even if they are rather different at other times. 4.2 Spikernel Definition. We describe our kernel construction, the Spikernel, by specifying the features that make up the feature space. Our construction of the feature space builds on the work of Haussler (1999), Watkins (1999), and Lodhi et al. (2000). We chose to directly specify the features constituting the kernel rather than describe the kernel from a functional point of view, as it provides some insight into the structure of the feature space. We first need to introduce a few more notations. Let S be a sequence of length l = len(S). The set of all possible n-long index vectors defining a subsequence of S is In,l = i: i ∈ Zn , 1 ≤ i1 < · · · < in ≤ l . Also, let d(α, β ), (α, β ∈ Rq ) denote a bin-wise distance over a pair of samples (population rates). We also overload notation and denote by
n firing d (Si , U) = k=1 d Sik , Uk a distance between sequences. The sequence distance is the sum of distances over the samples. One such function (the d we explore here) is the squared 2 norm. Let µ, λ ∈ (0, 1). The U component of our (infinite) feature vector φ(S) is defined as
φnU (S) =
µd(Si ,U) λlen(S)−i1 ,
(4.1)
i∈In,len(S)
where i1 is the first entry in the index vector i. In words, φnU (S) is a sum
678
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
over all n-long subsequences of S. Each subsequence Si is compared to U (the feature coordinate) and contributes to the feature value by adding a number that is proportional to the bin-wise similarity between Si and U and is inversely proportional to the amount of stretching Si needs to undergo. That is, the feature entry indexed by U measures how similar U is to the latter part of the time series S. This definition seems to fit our assumptions on neural coding for the following reasons: • A feature φnU (S) gets large values only if U is bin-wise similar to subsequences of the pattern S. • It allows for learning complex patterns: choosing small values of λ and µ (or concentrated d measures) means that each feature tends toward being either 1 or 0, depending on whether the feature coordinate U is almost identical to a suffix of S. • We allow gaps in the indexes defining subsequences, thus allowing for time warping. • Subpatterns that begin further from the required prediction time are penalized by an exponentially decaying weight; thus, more weight is given to activities close to the prediction time (the end of the sequence), which designates our time of interest. 4.3 Efficient Kernel Calculation. The definition of φ given by equation 4.1 requires the manipulation of an infinite feature space and an iteration over all subsequences that grows exponentially with the lengths of the sequences. Straightforward calculation of the feature values and performing the induced inner product is clearly impossible. Building on ideas from Lodhi et al. (2000), who describe a similar kernel (for finite sequence alphabets and no time of interest), we developed an indirect method for evaluating the kernel through a recursion that can be performed efficiently using dynamic programming. The solution we now describe is rather general, as it employs an arbitrary base feature vector ψ . We later describe our specific design choice for ψ . The definition in equation 4.1 is a special case of the following feature definition,
φnU (S)
=
i∈In,len(S)
len(S)−i 1 ψ U k S ik λ ,
(4.2)
k
where n specifies the length of the subsequences, and for the Spikernel, we substitute d S ,U ψ U k S ik = µ i k k .
Spikernels
679
The feature vector of equation 4.2 satisfies the following recursion: len(S) i=1 λlen(S)−i+1 ψ (Si ) φn−1 (S1:i−1 ) len(S) ≥ n > 0 n φ (S) = 0 len(S) < n, n > 0 (4.3) 1 n=0 The recursive form is derived by decomposing the sum over all subsequences into a sum over the last subsequence index and a sum over the rest of the indices. The fact that φn (S) is the zero vector for sequences of length less than n is due to equation 4.2. The last value of φn when n = 0 is defined to be the all-one vector. The kernel that performs the dot product between two feature vectors is then defined as Kn (S, T) = φn (S) · φn (T) = φnU (S)φnU (T)dU, Rq×n
where n designates subsequences of length n as before. We now plug in the recursive feature vector definition from equation 4.3:
len(S) n−1 n len(S)−i+1 λ ψ Un (Si ) φU1:n−1 (S1:i−1 ) K (S, T) = Rq×n
×
i=1
len(T)
dU. λlen(T)−j+1 ψ Un Tj φn−1 U1:n−1 T1:j−1
(4.4)
j=1
Next, we note that n−1 n−1 φn−1 (S1:i−1 ) · φn−1 (T1:j−1 ) U (S1:i−1 )φU (T1:j−1 )dU = φ Rq×n−1 = Kn−1 S1:i−1 , T1:j−1 , and also Rq
ψ Uk (Si )ψ Uk (Tj )dUk = ψ (Si ) · ψ (Tj ).
Defining K∗ (Si , Tj ) = ψ (Si )·ψ (Si ), we now insert the integration in equation 4.4 into the sums to get Kn (S, T) len(S) len(T) n−1 = λlen(S)+len(T)−i−j+2 ψ (Si )· ψ Tj φ (S1:i−1 )· φn−1 T1:j−1 i=1 j=1 len(S) len(T)
=
i=1 j=1
λlen(S)+len(T)−i−j+2 K∗ Si , Tj Kn−1 S1:i−1 , T1:j−1 .
(4.5)
680
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
We have thus derived a recursive definition of the kernel, which inherits the initial conditions from equation 4.3: ∀S, T K0 (S, T) = 1 if min len(S), len(T) < i Ki (S, T) = 0.
(4.6)
The kernel function K∗ (·, ·) may be any kernel operator and operates on population spike rates sampled at single time points. To obtain the Spikernel described above, we chose K∗ (α, β ) =
Rq
µd(u,α) µ
d u,β
du.
A more time-efficient description of the recursive computation of the kernel may be obtained if we notice that the sums in equation 4.5 are used, up to a constant λ, in calculating the kernel for prefixes of S and T. Caching these sums yields Kn (Ss, Tt) = λKn (S,Tt)+λKn (Ss,T)−λ2 K(S,T)+λ2 K∗ (s, t) Kn−1 (S,T) , where s and t are the last samples in the sequences (and the initial conditions of equation 4.6 still hold). Calculating Kn (S, T) using the above recursion is computationally cheaper than equation constructing a three-dimensional 4.5 and comprises array, yielding an O len(S) len(T) n dynamic programming procedure. 4.4 Spikernel Variants. The kernels defined by equation 4.1 consider only patterns of fixed length (n). It makes sense to look at sub-sequences of various lengths. Since a linear combination of kernels is also a kernel, we can define our kernel to be K(S, T) =
n
pi Ki (S, T),
i=1
Rn+
is a vector of weights. The weighted kernel summation can where p ∈ be interpreted as performing the dot product between vectors that take the √ √ form [ p1 φ1 (·), . . . , pn φn (·)], where φi (·) is the feature vector that kernel i K (·, ·) represents. In the results, we use this definition with pi = 1 (a better weighing was not explored). Different choices of K∗ (α, β ) allow different methods of comparing population rate values (once the time bins for comparison were chosen). We 2 q 2 chose d(α, β ) to be the squared 2 norm: α − β 2 = k=1 αk − β k . The kernel K∗ (·, ·) then becomes 2 1 α−β 2 K∗ (α, β ) = cµ 2 ,
Spikernels
with c =
681 π −2 ln µ
q
; however, c can (and did) set to 1 WLOG. Note that 2 − ln(µ) α−β 2 and is therefore a K∗ (α, β ) can be written as K∗ (α, β ) = e 2 simple gaussian. It is also worth noticing that if λ ≈ 0, Kn (S, T) → ce−
ln(µ) S−T22 2
.
That is, in this limit, no time warping is allowed, and the Spikernel is reduced to the standard exponential kernel (up to a constant c). 5 Experimental Method 5.1 Data Collection. The data used in this work were recorded from the primary motor cortex of a rhesus (Macaca mulatta) monkey (approximately 4.5 kg). The animal’s care and surgical procedures accorded with The NIH Guide for the Care and Use of Laboratory Animals (rev. 1996) and with the Hebrew University guidelines supervised by the institutional committee for animal care and use. The monkey sat in a dark chamber, and eight electrodes were introduced into each hemisphere. The electrode signals were amplified, filtered, and sorted (MCP-PLUS, MSD, Alpha-Omega, Nazareth, Israel). The data used in this report were recorded on four different days during a one-month period of daily sessions. Up to 16 microelectrodes were inserted on each day. Recordings include spike times of single units (isolated by signal fit to a series of windows) and multiunits (detection by threshold crossing). The monkey used two planar-movement manipulanda to control two cursors (× and + shapes) on the screen to perform a center-out reaching task. Each trial begun when the monkey centered both cursors on a central circle for 1.0 to 1.5 s. Either cursor could turn green, indicating the hand to be used in the trial (× for right arm and + for the left). Then, after an additional hold period of 1.0 to 1.5ss one of eight targets (0.8 cm diameter) appeared at a distance 4 cm from the origin. After another 0.7 to 1.0ss, the center circle disappeared (referred to as the Go Signal), and the monkey had to move and reach the target in less than 2 s to receive liquid reward. We obtained data from all channels that exhibited spiking activity (i.e., were not silent) during at least 90% of the trials. The number and types of recorded units and the number of trials used in each of the recording days are described in Table 1. At the end of each session, we examined the activity of neurons evoked by passive manipulation of the limbs and applied intracortical microstimulation (ICMS) to evoke movements. The data presented here were recorded in penetration sites where ICMS evoked shoulder and elbow movements. Identification of the recording area was also aided by MRI. More information can be found in (Paz, Boraud, Natan, Bergman, and Vaadia (2003).
682
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
Table 1: Summary of the Data Sets Obtained.
Data Set
Number of Single Units
1 2 3 4
27 22 27 25
Number of Mean Number of Multiunit Spike Rate Successful Trials Channels (spikes/sec) 16 9 10 13
8.6 11.6 8.0 6.5
317 333 155 307
Net Cumulative Movement Time (sec) 529 550 308 509
The results that we present here refer to prediction of instantaneous hand velocities (in 2D) during the movement time (from Go Signal to Target Reach times) of both hands in successful trials. Note that some of the trials required movement of the left hand while keeping the right hand steady and vice versa. Therefore, although we considered only movement periods of the trials, we had to predict both movement and nonmovement for each hand. 5.2 Data Preprocessing and Modeling. The recordings of spike times were made at 1 ms precision, and hand positions were recorded at 500 Hz. Our kernel method is suited for use with windows of observed neural activity as time series and with the movement values at the ends of those windows. We therefore chose to partition spike trains into bins, count the spikes in each bin, and use a running window of such spike counts as our data with the hand velocities at the end of the window as the label. Thus, a labeled example (St , vt ) for time t consisted of the X (left-right) or Y (forward-backward) velocity for the left or right hand as the target label vt and the preceding window of spike counts from all q cortical units as the input sequence St . In one such preprocessing, we chose a running window of 10 bins of size 100 ms each (a 1-second-long window), and the time interval between two such windows was 100 ms. Two such consecutive examples would then have nine time bins of overlap. For example, the number of cortical units q in the first data set was 43 (27 single +16 multiple), and the total length of all the trials used in that data set is 529 seconds. Hence, in that session, with the above preprocessing, there are 5290 consecutive examples, where each is a 43 × 10 matrix of spike counts and there are 4 sequences of movement velocities (X/Y and R/L hand) of the same length. Before we could train an algorithm, we had to specify the time bin size, the number of bins, and the time interval between two observation windows. Furthermore, each kernel employs a few parameters (for the Spikernel, they are λ, µ, n, and pi , though we chose pi = 1 a priori) and the SVM regression setup requires setting two more parameters, ε and C. In order to test the algorithms, we first had to establish which parameters work best. We used the first half of data set 1 in fivefold cross-validation to choose the best
Spikernels
683
Table 2: Parameter Values Tried for Each Kernel. Kernel
Parameters
Spikernel
λ ∈ {0.5, 0.7, 0.9, 0.99}, µ ∈ {0.7, 0.9, 0.99, 0.999}, n ∈ {3, 5, 10}, pi = 1, C = 0.1, ε = 0.1 γ ∈ {0.1, 0.01, 0.001, 0.005, 0.0001}, C ∈ {1, 5, 10}, ε = 0.1 C ∈ {0.001, 10−4 , 10−5 }, ε = 0.1 C ∈ {10−4 , 10−5 , 10−6 }, ε = 0.1 C ∈ {100, 10, 1, 0.1, 0.01, 0.001, 10−4 }, ε = 0.1
Exponential Polynomial, second degree Polynomial, third degree Linear
Notes: Not all combinations were tried for the Spikernel. n was chosen to be and some peripheral combinations of µ and λ were not tried either.
number of bins , 2
preprocessing and kernel parameters by trying a wide variety of values for each parameter, picking the combinations that produced the best predictions on the validation sets. For all kernels, the prediction quality as a function of the parameters was single peaked, and only that set of parameters was chosen per kernel for testing of the rest of the data. Having tuned the parameters, we then used the rest of data set 1 and the other three data sets to learn and predict the movement velocities. Again, we employed fivefold cross-validation on each data set to obtain accuracy results on all the examples. The five-fold cross-validation was produced by randomly splitting the trials into five groups: four of the five groups were used for training, and the rest of the data was used for evaluation. This process was repeated five times by using each fifth of the data once as a test set. The kernels that we tested are the Spikernel, the exponential kernel 2 K(S, T) = e−γ (S−T) , the homogeneous polynomial kernel K(S, T) = (S · T)d d = 2, 3, and standard scalar product kernel K(S, T) = S · T, which boils down to a linear regression. In our initial work, we also tested polynomial kernels of higher order and nonhomogeneous polynomial kernels, but as these kernels produced worse results than those we describe here, we did not pursue further testing. During parameter fitting (on the first half of data set 1), we tried all combinations of bin sizes {50ms, 100ms, 200ms} and number of bins {5, 10, 20} except for bin size and number of bins resulting in windows of length 2 s or more (since full information regarding the required movement was not available to the monkey 2S before the Go Signal and therefore neural activity at that time was probably not informative). The time interval between two consecutive examples was set to 100 ms. For each preprocessing, we then tried each kernel with parameter values described in Table 2. 6 Results Prediction samples of the Spikernel and linear regression are shown in Figure 3. The samples are of consecutive trials taken from data set 4 without
684
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
v
y
1 0 −1 200
202
x
206
208
210
212
214
216
218
220
Go Signal / Target Reach Actual Spikernel Linear Regression
1
v
204
0 −1 200
202
204 206 208 210 212 214 216 cumulative prediction time from session start (sec)
218
220
Figure 3: Sample of actual movement, Spikernel prediction and linear regression prediction of left-hand velocity in data set 4 in a few trials: (top) vy , (bottom) vx (both normalized). Only intervals between Go Signal and Target Reach are shown (separated by vertical lines). Actual movement in thick gray, Spikernel in thick black, and linear regression in thin black. Time axis is seconds of accumulated prediction intervals from the beginning of the session.
prior inspection of their quality. We show the Spikernel and linear regression as prediction quality extremes. Both methods seem to underestimate the peak values and are somewhat noisy when there is no movement. The Spikernel prediction is better in terms of mean absolute error. It is also smoother than the linear regression. This may be due to the time warping quality of the features employed. We computed the correlation coefficient (r2 ) between the recorded and predicted velocities per fold for each kernel. The results are shown in Figure 4 as scatter plots. Each circle compares the Spikernel correlation coefficient scores to those of one of the other kernels. There are five folds for each of the four data sets and four correlation coefficients for each fold (one for each movement direction and hand), making up 80 scores (and circles in the plots) for each of the kernels compared with the Spikernel. Results above the diagonal are cases where the Spikernel outperformed. Note that though the correlation coefficient is informative and considered a standard performance indicator, the SVR algorithm minimizes the ε-insensitive loss. The results in terms of this loss are qualitatively the same (i.e., the kernels are ranked by this measure in the same manner), and we therefore omit them.
Spikernels
685
0.8
0.8
Spikernel − r
Spikernel − r
2
1
2
1
0.6
0.4
0.2
0
0
0.4
0.2
0.2 0.4 0.6 0.8 Exponential Kernel − r 2
0
1
0
0.2 0.4 0.6 0.8 Polynomial 2nd Deg. Kernel − r 2
1
0.8
0.8
1
Spikernel − r
2
1
2
Spikernel − r
0.6
0.6
0.4
0.2
0
0
0.6
0.4
0.2
0.2 0.4 0.6 0.8 Polynomial 3rd Deg. Kernel − r 2
1
0
0
0.2 0.4 0.6 0.8 Linear Regression − r 2
1
Figure 4: Correlation coefficient comparisons of the Spikernel versus other kernels. Each scatter plot compares the Spikernel to one of the other kernels. Each circle shows the correlation coefficient values obtained by the Spikernel and by the other kernel in one fold of one of the data sets for one of the two axes of movement. (a) Compares the Spikernel to the exponential kernel. (b) Compares with the second-degree polynomial kernel. (c) Compares with the third-degree polynomial kernel. (d) Compares with linear regression. Results above the diagonals are cases where the Spikernel outperformed.
686
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
Table 3: Chosen Preprocessing Values, Parameter Values, and Mean Correlation Coefficient Across Folds for Each Kernel in Each Data Set. Kernel
Kernel Parameters
Day 1
Day 2
Day 3
Day 4
Total
LhX RhX LhX RhX LhX RhX LhX RhX Mean LhY RhY LhY RhY LhY RhY LhY RhY
Type
Preprocessing
Spikernel
10 bins 100 ms
µ = .99 λ = .7 .71 .79 N = 5 C = 1 .74 .63
.77 .63 .87 .79
.75 .70 .67 .66
.69 .66 .79 .71
.72
Exponential
5 bins 200 ms
γ = .01 C = 5
.67 .76 .69 .45
.73 .57 .85 .74
.71 .67 .56 .61
.66 .63 .74 .65
.67
Polynomial second degree
10 bins 100 ms
C = 10−4
.67 .76 .70 .51
.72 .54 .85 .74
.70 .68 .61 .63
.65 .62 .76 .68
.68
Polynomial third degree
10 bins 100 ms
C = 10−5
.65 .72 .66 .51
.69 .52 .84 .71
.67 .68 .61 .62
.63 .61 .75 .64
.66
Linear (dot product)
10 bins 100 ms
C = 0.01
.54 .65 .67 .35
.62 .34 .75 .57
.66 .64 .50 .57
.51 .51 .70 .62
.57
Notes: Total means (across data sets, folds hands, and movement directions) are computed for each kernel. The Spikernel outperforms the rest, and the nonlinear kernels outperform the linear regression in all data sets.
Table 3 shows the best bin sizes, number of bins, and kernel parameters that were found on the first half of data set 1 and the mean correlation coefficients obtained with those parameters on the rest of the data. Some of the learning problems were easier than others (we observed larger differences between data sets than between folds of the same data set). Contrary to our preliminary findings (Shpigelman, Singer, Paz, & Vaadia, 2003), there was no significantly easier axis to learn. The mean scores define a ranking order on the kernels. The Spikernel produced the best mean results, followed by the polynomial second-degree kernel, the exponential kernel, the polynomial third-degree kernel, and finally, the linear regression (dot product kernel). To determine how consistent the ranking of the kernels is, we computed the difference in correlation coefficient scores between each pair of kernels in each fold and determined the 95% confidence interval of these differences. The Spikernel is significantly better than the other kernels and the linear regression (the mean difference is larger than the confidence interval). The other nonlinear kernels (exponential and polynomial second and third degree) are also significantly better than linear regression, but the differences between these kernels are not significant. For all kernels, windows of 1 s were chosen over windows of 0.5 s as best preprocessing (based on the parameter training data). However, within the 1 s window, different bin sizes were optimal for the different kernels. Specifically, for the exponential kernel, 5 bins of size 200 ms, were best and for the rest, 10 bins of 100 ms were best. Any processing of the data (such as time binning or PCA) can only cause loss of information (Cover & Thomas, 1991)
Spikernels
687
Table 4: Mean Correlation Coefficients (Over All Data) for the Spikernel with Different Combinations of Bin Sizes and Number of Bins. Number of Bins
50 ms
100 ms
200 ms
5 bins 10 bins 20 bins
.52 .66 .66
.65 .72
.68
Note: The best fit to the data is still 10 bins of 100 ms.
but may aid an algorithm. The question of what the time resolution of cells is is not answered here, but for the purpose of linear regression and SVR (with the kernels tested) on our MI data, it seems that binning was necessary. In many cases, a poststimulus time histogram, averaged over many trials, is used to get an approximation of firing rate that is accurate in both value and position in time (assuming that one can replicate the same experimental condition many times). In single-trial reconstruction, this approach cannot be used, and the trade-off between time precision versus accurate approximation of rate must be balanced. To better understand what the right bin size is, we retested the Spikernel with bin sizes of 50 ms, 100 ms, and 200 ms and number of bins 5, 10, and 20, leaving out combinations resulting in windows of 2 seconds or more. The mean correlation coefficients (over all the data) are shown in Table 4. The 10 by 100 ms configuration is optimal for the whole data as well as for the parameter fitting step. Note that since the average spike count per second was less then 10, the average spike count in bins of 50 ms to 200 ms bins was 0.5 to 2 spikes. In this respect, the data supplied to the algorithms are on the border between a rate and a spike train (with crude resolution). Run-time issues were not the focus of this study, and little effort was made to make the code implementation as efficient as possible. Run time for the exponential and polynomial kernels (prediction time) was close to real time (the time it took for the monkey to make the movements) on a Pentium 4, 2.4 GHz CPU. Run time for the Spikernel is approximately 100 times longer than the exponential kernel. Thus, real time use is currently impossible. The main contribution of the Spikernel is the allowance of time warping. To better understand the effect of λ on Spikernel prediction quality, we retested the Spikernel on all of the data, using 10 bins of size 100 ms (n = 5, µ = 0.99) and λ ∈ {0.5, 0.7, 0.9}. The mean correlation coefficients were {0.68, 0.72, 0.66}, respectively. We also retested the exponential kernel with γ = 0.005 on all the data with 5 bins of size 100 ms. This would correspond to a Spikernel with its previous parameters (µ = 0.99, n = 5) but no time warping. The correlation coefficient achieved was 0.64. Again, the relevance of time warping to neural activity cannot be directly addressed by our method. However, we can safely conclude that consideration of time-
688
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
warped neural activity (in the Spikernel sense) does improve prediction. We believe that it would make sense to assume that this is relevant to cells as well, as mentioned in section 4.1, because the 3D structure of a cell’s dendritic tree can align integration of differently placed inputs from different times. 7 Conclusion In this article, we described an approach based on recent advances in kernelbased learning for predicting response variables from neural activities. On the data we collected, all the kernels we devised outperform the standard scalar product that is used in linear regression. Furthermore, the Spikernel, a biologically motivated kernel, operator consistently outperforms the other kernels. The main contribution of this kernel is the allowance of time warping when comparing two sequences of population spike counts. Our current research is focused in two directions. First, we are investigating the adaptations of the Spikernel to other neural activities such as local field potentials and the development of kernels for spike trains. Our second goal is to devise statistical learning algorithms that use the Spikernel as part of a dynamical system that may incorporate biofeedback. We believe that such extensions are important and necessary steps toward operational neural prostheses. Acknowledgments We thank the anonymous reviewers for their constructive criticism. This study was partly supported by a Center of Excellence grant (8006/00) administrated by the ISF, BMBF-DIP, and by the United States–Israel Binational Science Foundation. L.S. is supported by a Horowitz fellowship. References Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91, 1899–1907. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18, 7411–7425. Chapin, J. K., Moxon, K. A., Markowitz, R. S., & Nicolelis, M. A. (1999). Realtime control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience, 2, 664–670. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. Wiley.
Spikernels
689
Donchin, O., Gribova, A., Steinberg, O., Bergman, H., & Vaadia, E. (1998). Primary motor cortex is involved in bimanual coordination. Nature, 395, 274– 278. Fu, Q. G., Flament, D., Coltz, J. D., & Ebner, T. J. (1997). Relationship of cerebellar Purkinje cell simple spike discharge to movement kinematics in the monkey. Journal of Neurophysiology, 78, 478–491. Genton, M. G. (2001). Classes of kernels for machine learning: A statistical perspective. Journal of Machine Learning Research, 2, 299–312. Georgopoulos, A. P., Kalaska, J., & Massey, J. (1983). Spatial coding of movements: A hypothesis concerning the coding of movement direction by motor cortical populations. Experimental Brain Research (Supp.), 7, 327–336. Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. Journal of Neuroscience, 8, 2913–2947. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Haussler, D. (1999). Convolution kernels on discrete structures (Tech. Rep. No. UCSC-CRL-99-10). Santa Cruz, CA: Uniiversity of California. Isaacs, R. E., Weber, D. J., & Schwartz, A. B. (2000). Work toward real-time control of a cortical neural prosthesis. IEEE Trans. Rehabil. Eng., 8, 196–198. Jaakola, T. S., & Haussler, D. (1998). Exploiting generative models in discriminative calssifiers. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Laubach, M., Wessberg, J., & Nicolelis, M. A. (2000). Cortical ensemble activity increasingly predicts behavior outcomes during learning of a motor task. Nature, 405(1), 141–154. Laubach, M., Shuler, M., & Nicolelis, M. A. (1999). Independent component analyses for quantifying neuronal ensemble interactions. J. Neurosci Methods, 94, 141–154. Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C. J. C. H. (2000). Text classification using string kernels. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in nueural information processing systems. Cambridge, MA: MIT Press. Magee, J. (2000). Dendritic integration of excitatory synaptic input. Nat. Rev. Neuroscience, 1, 181–190. Moran, D. W., & Schwartz, A. B. (1999). Motor cortical representation of speed and direction during reaching. Journal of Neurophysiology, 82, 2676–2692. Nicolelis, M. A. (2001). Actions from thoughts. Nature, 409(18), 403–407. Nicolelis, M. A., Ghazanfar, A. A., Faggin, B. M., Votaw, S., & Oliveira, L. M. (1997). Reconstructing the engram: Simultaneous, multi-site, many single neuron recordings. Neuron, 18, 529–537. Paz, R., Boraud, T., Natan, C., Bergman, H., & Vaadia, E. (2003). Preparatory activity in motor cortex reflects learning of local visuomotor skills. Nature Neuroscience, 6(8), 882–890. Pouget, A., & Snyder, L. (2000). Computational approaches to sensorimotor transformations. Nature Neuroscience, 8, 1192–1198.
690
L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia
Reihle, A., Grun, S., Diesmann, M., & Aersten, A. M. H. J. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1952. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Segev, I., & Rall, W. (1998). Excitable dendrites and spines: Earlier theoretical insights elucidate recent direct observations. TINS, 21, 453–459. Serruya, M. D., Hatsopoulos, N. G., Paninski, L., Fellows, M. R., & Donoghue, J. P. (2002). Instant neural control of a movement signal. Nature, 416, 141–142. Shpigelman, L., Singer, Y., Paz, R., & Vaadia, E. (2003). Spikernels: Embedding spiking neurons in inner product spaces. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Smola, A., & Scholkopf, ¨ B. (2004). A tutorial on support vector regression. Statistics of Computing, 14, 199-222. Taylor, D. M., Tillery, S. I. H. S. I., & Schwartz, A. B. (2002). Direct cortical control of 3D neuroprosthetic devices. Science, 1829–1832. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 515–518. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Watkins, C. (1999). Dynamic alignment kernels. In P. Bartlett, B. Scholkopf, ¨ D. Schuurmans, & A. Smola (Eds.), Advances in large main classifiers (pp. 39– 50). Cambridge, MA: MIT Press. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(16), 361–365. Received August 8, 2003; accepted August 5, 2004.
LETTER
Communicated by Ad Aertsen
Memory Capacity of Balanced Networks Yuval Aviel
[email protected] Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel
David Horn
[email protected] School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
Moshe Abeles
[email protected] Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, Israel
We study the problem of memory capacity in balanced networks of spiking neurons. Associative memories are represented by either synfire chains (SFC) or Hebbian cell assemblies (HCA). Both can be embedded in these balanced networks by a proper choice of the architecture of the network. The size wE of a pool in an SFC or of an HCA is limited from below √ and from above by dynamical considerations. Proper scaling of wE by K, where K is the total excitatory synaptic connectivity, allows us to obtain a uniform description of our system for any given K. Using combinatorial arguments, we derive an upper limit on memory capacity. The capacity allowed by the dynamics of the system, αc , is measured by simulations. For HCA, we obtain αc of order 0.1, and for SFC, we find values of order 0.065. The capacity can be improved by introducing shadow patterns, inhibitory cell assemblies that are fed by the excitatory assemblies in both memory models. This leads to a doubly balanced network, where, in addition to the usual global balancing of excitation and inhibition, there exists specific balance between the effects of both types of assemblies on the background activity of the network. For each of the memory models and for each network architecture, we obtain an allowed region (phase √ space) for wE / K in which the model is viable. 1 Introduction An interesting property of neural networks is their ability to function as memory devices. In this mode, memories are embedded in the synaptic connections of the network, to be recalled later. Recalling is typically done by external ignition of part of the desired memory. The system, using the Neural Computation 17, 691–713 (2005)
c 2005 Massachusetts Institute of Technology
692
Y. Aviel, D. Horn, and M. Abeles
given hint, should then settle on an associated memory; hence, it is called an associative memory model. Reading out the network’s activity, it is possible to decide if the input memory exists, and if it does, then what its constituents are. One method of embedding memories is that of attractor neural networks (Amit, 1989), where an initial input may lead to dynamical flow into an attractor, which serves as a Hebbian cell assembly (HCA) (Hebb, 1949) representing the recalled memory through elevated firing rates of its neurons. Another method is that of storing memories in spatiotemporal patterns of activity, such as synfire chains (SFC) (Abeles, 1991). The memories are encoded in the precise firing patterns produced by the given external input. As this method involves fine temporal structure, it is capable of dynamic binding (Bienenstock, 1995). Applying binding (von der Malsburg & Schneider, 1986) to HCAs, one has to resort to other temporal means (Horn, Sagi, & Usher, 1991). A network memory model should allow a neuron to participate in more than one memory, that is, have memories stored in a distributed manner. In this case, a neuron may receive input also when it is not supposed to fire. This input is due to the overlap between memories and should be treated as noise. Moreover, background activity in the absence of any memory retrieval should also be allowed and may be regarded as being due to some other noise source. One should keep in mind that the statistics of an active cortical tissue is one of high variability. An irregular spike train is observed on the individual neuron level, together with an asynchronous activity on the global level (Shadlen & Newsome, 1994). Among the sources of that variability are the 50% inputs from other cortical areas (Braitenberg & Schuz, 1991) and probabilistic synaptic release (Huang & Stevens, 1997). A network model capable of displaying stationary noisy activity of this kind is the balanced network (BN) model. A network is said to be balanced (Shadlen & Newsome, 1994; van Vreeswijk & Sompolinsky, 1998) if each neuron in the network receives equal amounts of excitation and inhibition. Its membrane potential will then fluctuate around some mean value; the firing process is noise driven and therefore irregular (Abeles, 1982; Gerstein & Mandelbrot, 1964). BNs have been shown (Brunel, 2000) to mimic the in vivo firing statistics of cortical tissue, and it is therefore plausible that cortical neurons receive balanced input. BNs have also been shown (Brunel, 2000; van Vreeswijk & Sompolinsky, 1998) to have a stable asynchronous state (AS). Generally, BNs assume sparse and random connectivity. It is possible to embed memories in the connections of a BN, but this would violate the random connectivity assumption. The main consequence of introducing ordered connectivity in an otherwise random connectivity matrix is the appearance of a new critical point beyond which the AS is unstable. This was studied in detail in Aviel, Mehring, Horn, and Abeles (2003).
Memory Capacity of Balanced Networks
693
Having set the background, we will now turn to our memory models of choice. A milestone in the theory of attractor neural network is the Hopfield model (Hopfield, 1982), capable of storing and retrieving binary vectors by a clever construction of the connectivity matrix of the neural network. The models of Hopfield, as well as some predecessors (Willshaw, Buneman, & Longuet-Higgins, 1969) and followers (Tsodyks & Feigelman, 1988), are, however, based on binary rather than spiking neurons (but see Treves, 1990, for a Hopfield model of linear threshold neurons). All attractor neural networks serve as some kind of implementation of the idea of Hebbian cell assemblies (HCA) (Hebb, 1949), that is, they contain neuronal ensembles that represent a memory by elevated firing rates. The cell assembly may be characterized by denser intra-assembly connectivity that facilitates a sustained activity on partial assembly excitation (Braitenberg, 1978). The Hopfield model is a step backward from the biological complexity of HCAs, but it allows detailed analysis. Gerstner (Gerstner & van Hemmen, 1992) showed how the Hopfield model may be incorporated in a network of spiking neurons. A network that reacts to external inputs with reproducible firing patterns (over neurons and time) is said to code its input by spatiotemporal patterns. The model we use for producing spatiotemporal patterns is the synfire chain (SFC) (Abeles, 1982). (For other interesting models, see Izhikevich, Gally, & Edelman, 2004; Levy, Horn, Meilijson, & Ruppin, 2001; and Miller, 1996.) The SFC dictates a well-defined connectivity pattern among neurons in the form of feedforward connections between pools of neurons. Each of the wE neurons in a pool receives L connections from neurons in the previous pool, thus creating a chain of pools. Other additional input connections as well as outputs are allowed. If L is large enough, then a synchronized firing volley of most of the neurons in a pool may form a wave of activity that propagates along the chain (Diesmann, Gewaltig, & Aertsen, 1999). Also if L is large enough and if the igniting volley is synchronized and strong enough, the waves are stable in the presence of background noise (Diesmann et al., 1999). To avoid terminological confusion, the feedforward connectivity schemes are referred to henceforth as chains, and the synchronized volley propagating along a chain as a synfire wave, or simply a wave. A wave can propagate in a synchronized manner along a chain, or it can lose its synchrony, dissolving into the background activity. A wave is said to be stable if it remains as a synchronized volley for more than 100 ms. Having introduced all the ingredients—asynchronous activity at the global level, attractors or spatiotemporal patterns at the assemblies’ level, and irregular spiking at the neuronal level—we may state our mission: we seek a balanced network of spiking neurons that can serve as a high-capacity memory device. To establish the goal, we use BNs of integrate-and-fire (IF)
694
Y. Aviel, D. Horn, and M. Abeles
neurons, similar to the model described in Brunel (2000). As memory models, we use either a HCA or SFC. HCAs embedded in a recurrent network were suggested as a model (Amit & Brunel, 1997) of the persistent activity observed in delayed matchto-sample experiments performed in the inferotemporal and prefrontal cortex area (Miyashita & Chang, 1988). Within this model, the network, in the absence of memory stimulation, exhibits sustained asynchronous activity. After learning, an elevated activity within an HCA is obtained if a familiar stimulus is presented briefly. Wang (1999) found that in order to realize a stable, low-rate, persistent activity coexisting with a stable resting state, recurrent excitation should be primarily mediated by kinetically slow synapses of the NMDA type. Later, Compte, Brunel, Goldman-Rakic, and Wang (2000) examined the synaptic mechanisms of selective persistent activity underlying spatial working memory in the prefrontal cortex. Their model reproduces the phenomenology of the occulomotor delayed-response experiment of Funahashi, Bruce, and Goldman-Rakic (1989). Brunel and Wang (2001) studied the effects of external input and neuromodulation on persistent activity in a working memory model. (Additional reviews on HCA can be found in Sommer and Wennekers, 2000, 2001.) In Aviel, Mehring et al. (2003) we studied the embedding of SFCs in a BN of IF neurons. The main obstacle that we found is the conflict between constraints due to embedded memories, on one hand, and the AS stability, on the other hand. The conflicting demands are not easily resolved. This is in agreement with the study of Mehring, Hehl, Kubo, Diesmann, and Aertsen (2003), where a similar setup was used with added topological, cortex-like connectivity. The authors report a fairly small parameter regime where SFC can be embedded and recalled without destabilizing the AS. None of these publications paid special attention to the question of high capacity. The focus of this article is capacity. We load the system with memory patterns and look for conditions under which we can (1) maximize the load, (2) keep the asynchronous state stable, and (3) recall every memory in a stable manner. Obtaining maximal capacity is, of course, a desirable goal. Many articles have dealt with this kind of question in networks of binary neurons. While far from biological reality, binary neurons lend themselves to analytical examination. The Hopfield model, for example, is useful because it can be analyzed theoretically. In particular, its behavior with respect to memory load is analyzed in Amit, Gutfreund, and Sompolinsky (1985), where it is shown that the maximal number of uncorrelated memory patterns, Pmax , is linear in N, the number of neurons: Pmax = αc N, with αc = 0.14. Applying results obtained in a binary neural network to neural networks of spiking neurons is not straightforward. Spiking neurons involve membrane time constants and nonlinear resetting, which lead to much richer system dynamics. Little is known on the capacity of neural networks of spiking neurons.
Memory Capacity of Balanced Networks
695
Sommers and Wennekers (2001) obtained a high-capacity limit of HCA in a symmetrically coupled network of 100 Pinsky-Rinzel neurons. An inhibitory loop provided negative feedback leading to dynamic threshold control that can improve capacity. Their network operated in the oscillatory regime and obtained high capacity. Here we limit ourselves to the asynchronous regime within a BN, which introduces other types of constraints. The capacity of SFCs has also been investigated. Bienenstock (1995) used an r-winners-take-all model and applied to it signal-to-noise analysis. Hermann, Hertz, and Prugel-Bennett (1995) and Hertz (1999) reduced the IF model to a binary model so statistical mechanics methods can be used. Both arrived at similar conclusions; again Pmax = αc N, but now αc ∼ = 8. Here Pmax is the maximal number of pools. In section 2, our model is presented in detail, in section 3 the scaling of the system is discussed, in section 4 the results are reported, and in section 5 we discuss the results. 2 The Model We employ three levels of modeling in our system. On the microscopic level, a neuronal model is required, for which we use an IF model. On the intermediate level, we model memory patterns through synfire chains and by Hebbian cell assemblies. On the macroscopic level, we use a balanced network and discuss its modification into a doubly balanced network. For simulations we used the SYNOD environment (Diesmann, Gewaltig, & Aertsen, 1995) with the Paranel kernel (Morrison, Mehrin, Geisel, Aertsen, & Diesmann, 2004). The neuronal model was integrated with time steps of 0.1 ms. 2.1 Single Neuron Model. Following Lapique (Tuckwell, 1988), we use an IF model in which the ith neuron’s membrane potential, Vi (t), obeys the equation: τ
dVi (t) = −Vi (t) + RIi (t), dt
(2.1)
where Ii (t) is the synaptic current arriving at the soma and R is the membrane resistance. Spikes as well as postsynaptic currents are modeled by delta functions; hence, the input is written as RIi (t) =
j
f
Jij δ(t − tj − τdelay ),
(2.2)
f tj
where the first sum is over different neurons, and the second sum represents f f their spikes arriving at times t = tj − τdelay . tj is the emission time of the f th spike by neuron j, and τdelay is a transmission delay, which we assume
696
Y. Aviel, D. Horn, and M. Abeles
here to be the same for any pair of neurons. The sum is over all neurons that project their output to neuron i, both local and external afferents. The strength of the synapse that neuron j forms on neuron i is Jij . When Vi (t) reaches the firing threshold θ, an action potential is emitted by neuron i, and after a refractory period τrp , during which the potential is insensitive to stimulation, the depolarization is reset to Vreset . The following parameters were used in all simulations: the transmission delay τdelay = 1.5 ms, the threshold θ = 20 mV, the membrane time constant τ = 10 ms, the refractory period τrp = 2.5 ms, the resetting potential Vreset = 0 mV, and the membrane resistance R = 40 M . The inhibitory and excitatory neurons have identical parameters.
2.2 Memory Model. Two memory models are explored, constructed on the basis of either the HCA or SFC. HCAs form an associative memory model in a biologically plausible network. An assembly is a group of wE randomly chosen excitatory neurons. The assembly is distinguished by a dense interassembly connectivity. A neuron in an assembly receives L connections from other neurons in the assembly. The dense connectivity allows for a sustained high firing rate once an assembly is ignited. A neuron will typically participate in more than one assembly. In that case, it will have high connectivity with all these assemblies. If two neurons happen to participate together in two assemblies, we assign two different synapses to the connectivity induced by the two memories. All excitatory (inhibitory) synaptic connections are assumed to be of equal strength, J (JI ), and a stronger bond between a pair of neurons is brought about by multiple synapses. Each neuron will be assumed to have a fixed synaptic resource, made of K excitatory (inhibitory) “synaptic units” J (JI ). This constraint on synaptic resources will impose an upper bound on the number of memory patterns one can embed in the network. SFCs (Abeles, 1991) are feedforward connections among neuronal pools, with a fixed number of excitatory neurons, wE , in each pool. wE is also referred to as the chain width. To wire an SFC, we randomly pick pools of wE neurons and connect them in a feedforward manner, thus obtaining a chain of pools. Converging and diverging connections form a link between two consecutive pools. Similar to HCAs, a neuron will typically participate in more than one pool. In contrast to HCAs, each neuron in a pool receives Lspecific connections from the previous pool, not from the pool to which it belongs. If wE is large enough, then a synchronized firing volley of most of the neurons in a pool may propagate along the chain, forming a synfire wave, a stable wave of activity (Diesmann et al., 1999). The wave signals an activity of a specific memory object. The wave can be (de-) synchronized with other waves, allowing for a mechanism of dynamically (un-) binding objects (Hayon, 2002).
Memory Capacity of Balanced Networks
697
In these two memory models, we have three parameters: wE , the size of the memory pattern (either an assembly or a pool); L, the interpattern connectivity; and P, the number of memories (assemblies or pools). Ignition of a memory pattern is performed by increasing the external input to all members of the pattern during 5 ms. An HCA pattern will be said to be stable if, after ignition, it manifests a sustained elevated firing rate for at least 100 ms. A synfire wave will be said to be stable if a wave propagates along the chain for at least 100 ms. Both patterns have a minimal value of wE , wmin , below which patterns are unstable. The minimum is due to the condition that the firing of the assembly or pool guarantees the continuing firing of the same assembly or the next pool. In case of SFCs (Aviel, Mehring et al., 2003; Tetzlaff, Buschermohle, ¨ Geisel, & Diesmann, 2003), w has also a maximal value, above which spontaneous emergence of waves occur. This is also likely to be the case for HCAs. Once wiring of all memory patterns is done, more connections are added such that if an excitatory neuron has fewer than K excitatory (KI inhibitory) synapses, random sources are chosen from the network until it reaches exactly K excitatory (KI inhibitory) synapses. The resulting connectivity is a mixture of random connectivity and ordered connectivity. 2.3 Network Model. The excitatory population consists of NE excitatory neurons, and the inhibitory one consists of NI ≡ γ NE inhibitory neurons. A sparse connectivity is required to adhere as closely as possible to biological values and to induce a source of randomness. In addition to the external input (K excitatory afferents), each neuron in the network receives exactly K excitatory and KI = γ K inhibitory afferents from the excitatory and inhibitory populations, respectively. In our simulations, we use K = εNE , with ε = 0.1. Our formalism allows other relations as well, for example, K = Const when NE grows. (This will be elaborated in section 5.) If a neuron of population y (either E or I) innervates a neuron of population x (either E or I), its synaptic strength Jxy is defined as follows: √ J ≡ JxE = J0 / K,
JI ≡ JxI = −gJ0 / KI ,
(2.3)
√ √ where J0 is a constant. Note that JI = −g/ γ · J; hence, the constant g/ γ is the relative strength of the inhibitory synapses. A Poisson process with rate vext K simulates the external input. vext ≡ v · vthre , where vthre is the minimal rate needed to emit a spike within τ milliseconds (on the average) in a neuron that gets balanced input (Brunel, 2000). The network parameters used in this article are γ = 1/4, ε = 0.1, g = 5, J0 = 10, and v = 1/20. The total input to a neuron is therefore nearly balanced. There is a small bias toward an excess of excitation, which controls the firing rates.
698
Y. Aviel, D. Horn, and M. Abeles
3 Scaling Scaling is a crucial issue in modeling complex systems. It tells us how parameters should be modified (scaled) while changing the number of elements (or the size) of a system. In our case, the system is the network with its connectivity matrix, the external input, and the single-neuron dynamics. The number of elements is the number of neurons or synapses in the network. Complex systems often show complex behavior that depends on the parameters of the system. For proper scaling, the behavior may only weakly depend on the size of the system. In this regard, it makes sense to consider the thermodynamic limit, where the number of elements goes to infinity. In this section, we propose a scaling scheme that allows us to capture the essential behavior of the model for all values of K. We start by scaling the synaptic weight, J. The proportionality factor of the synaptic weight is typically chosen in model calculations as a function of the number of synapses, K, J = J0 /Kβ ,
(3.1)
where β gets values 0, 0.5, or 1. The mean field approach using β = 1 leads to weak coupling in the high K regime. Examples of such studies are the Hopfield model (Hopfield, 1982) and phase-variable analysis (Hansel, Mato, & Meunier, 1995). Constant synaptic strength, β = 0, was also used in the literature. Brunel (2000), for example, used fixed synapses together with the assumption that the constant postsynaptic potential (PSP) contribution is small relative to the distance between threshold and resting potential. Also in this regime, the synaptic coupling is weak, and one can model the barrage of incoming PSPs as gaussian noise. Since the contribution of each PSP is small, the overall change in the membrane potential is smooth, and diffusion approximation can be used. Following van Vreeswijk and Sompolinsky (1998), we propose using β = 0.5. The reason underlying this choice comes from requiring the firing rate of each neuron to be independent of the number of synapses, K. The same logic applies to the inhibitory synaptic strength JI . Choosing β = 0.5, we obtain equation 2.3. Our system operates under balanced conditions; the mean input to a neuron is near zero, and fluctuations drive the spiking process. Let hx be the field generated by synapses of population x (x = E or I) felt by some neuron, and let us assume that the system is in an asynchronous state (AS). Under these conditions, all synaptic inputs can be modeled by a Poisson process with rate v. To simplify calculations, we assume that both populations fire at the same rate v. Using equation 2.3, the mean and variance can be described as follows: √ µ ≡ hE − hI = v · (JK − JI KI ) = J · v · K(1 − g γ )
Memory Capacity of Balanced Networks
= J0 · v ·
√
699
√ K(1 − g γ )
(3.2)
σ 2 = var(hE ) + var(hI ) = v · (J2 K + JI2 KI ) = J2 · v · K(1 + g2 ) = Jo2 v(1 + g2 ).
(3.3)
√ From equation 3.2, we see that g γ controls the balance between the two √ populations. The mean of the field is zero if g γ = 1, and therefore independent of K. But also the variance is independent of K. This would √ not have been the case for β = 0 or 1. On average, therefore, an input of K excitatory PSPs produces a spike. The minimum rate of external input that is needed in order to emit a spike within τ ms (on average) in a neuron that √ does not get other inputs (Brunel, 2000) is defined to be vthre ≡ θ/(τ · J · K) = θ/(τ · J0 ). Again, vthre is independent of K. √ In our simulations, g > 1/ γ , so there is a potential excess of inhibition, but there is also an additional excitatory external input. Even if the total input is not exactly √ balanced, the feedback between the population, which is on the order of K according to equation 3.2, leads to an appropriate change of the firing rates that stabilize the AS (van Vreeswijk & Sompolinsky, 1998). Another hint for setting β to 0.5 comes from our previous study (Aviel, Mehring √ et al., 2003), where we demonstrated that wE has to obey wE = Const· K. In this case, an input from wE synapses, JwE , should lead to firing with √ high probability. It means that JwE ∼ = O(1), that is, J is proportional to 1/ K. √ As a consequence, we set √ the pattern size wE = Cw K, and the intrapattern connectivity L = CL K, where Cw and CL are the scaling prefactors. The scaling of the external rate and √ the synaptic weights is given by vext ≡ v · vthre = v · θ/(τ · J0 ) and J ≡ J0 / K respectively, as discussed in section 2. We use the values in Table 1. Table 1: Parameter Values. Parameter
J0 Cw CL v
Values HCA
SFC
10 3.3 0.75Cw 0.05
10 4 Cw 0.05
The usefulness of the scaling of pattern connectivity and external rate will become evident from the discussion that follows. In the next section we discuss properties of the resulting balanced network.
700
Y. Aviel, D. Horn, and M. Abeles
4 Results 4.1 Balanced Network. In Aviel, Mehring et al. (2003), we studied the embedding of SFCs in the synaptic connectivity matrix of a balanced network. There, as we do here, we enforced two constraints: the AS has to be the stable background mode of the system, and the synfire wave has to propagate on top of it in a stable manner. Based on simplified models and simulations, we concluded that these two constraints could be met if K is large enough (K > Const · w2min ). In this article, we take the network load α ≡ P/NE into consideration, using synaptic scaling. This allows further refinement of our statements. Brunel’s network (2000) serves here as the starting point, in which we embed HCA or SFC patterns. Apart from introducing ordered patterns in its connectivity matrix, we also use a particular scaling procedure. Synaptic √ weights, instead of being constants, are now scaled like 1/ K (van Vreeswijk & Sompolinsky, 1998). This in turn leads to vthre = θ/(τ · J0 ) = Const. We verified by simulations that indeed the mean firing rate (in the AS) is linearly related to vext and is only weakly dependent on NE . The analysis of the appendix in Aviel, Mehring et al. (2003) supports square root scaling. We have shown there that wE is limited by two constraints: wmin < wE < wmax ,
(4.1)
√ where wmax ≡ Cb K. The upper bound is due to requiring stability of the AS, and the lower bound is posed by wave stability demand. We can estimate wmin as the mean number of excitatory PSPs needed to evoke an action potential with high probability. Assuming a normal distribution of the membrane potential, wmin can be approximated by twice the distance between threshold and V in units of J, where V is the mean membrane potential and J is the synaptic unit: wmin = 2
θ − V
. J
(4.2)
Substituting equation 4.2 in equation 4.1, √ we notice that wE is sandwiched between an upper bound that is of order K and a lower bound that is of order Kβ . Unless β ≤ 0.5, wE has no valid √ value in the limit of large K. By choosing β = 0.5 and wmin ≡ Ca K accordingly, equation 4.1 can be rewritten as Ca < Cw < Cb .
(4.3)
Under these conditions, we expect the SFC to be stable in our system regardless of the number of synapses K.
Memory Capacity of Balanced Networks
701
In Figure 1a, we can see a wave propagating along a chain of 750 pools in a BN of 15,000 excitatory neurons. If the chain width exceeds a critical value or if the network load exceeds capacity, global oscillations appear. In Figure 1b, we increase the number of pools embedded in the network. This leads to the emergence of global oscillations after wave ignition. We also embedded HCAs according to the protocol described in section 2. In a lightly loaded system, as in Figure 2a, an evoked assembly sustained its activity for hundreds of milliseconds without provoking strong global oscillations. As in the SFC case, Figure 2b shows that the global oscillations become prominent as the load increases. A surprisingly high capacity is obtained in both cases. In order to quantify the emergence of global oscillations, the values of the population rate, that is, number of spikes of a population in 1 ms, are considered. An average and standard deviation (SD) of these values are taken over a time window of 300 ms in order to compute a population’s coefficient of variance (CV). The CV is the SD divided by the mean, and in our case it measures the amount of synchrony of the population activity. CV near one means an SD that is close to the mean, which is in agreement with an AS. High CV is a result of a high SD, which implies global synchronous (aperiodic) oscillations in our model. Low CV (< 0.5) signifies high firing rates. Low CV will become apparent in the next section, where the mild oscillations in the background activity during wave propagation will be addressed. In Figure 3, the CV of the population rate is plotted as a function of network load, P/NE , for various network sizes. We present statistics for two cases: pre- and postpattern ignition. In the former, the time window is taken prior to pattern ignition (time 200–500 ms), and in the latter, the time window is taken postignition (500–800 ms). All curves exhibit transitions near critical points. We define the critical point of the network load by Pc , and we define the capacity of the network as αc ≡ Pc /NE . The capacity obtained here is 0.1 and 0.065 for HCA and SFC, respectively. One should note an important difference between HCAs and SFCs. Whereas when HCAs are embedded, the pre- and postignition curves are indistinguishable, the SFCs shows a significantly higher network CV after their ignition. 4.2 Doubly Balanced Network. In the background activity of Figure 1, oscillation can be seen during wave propagation. In this section, we provide a mechanism that gets rid of this problem. The solution we arrived at involves architectural modifications to the original SFC structure. Let us attach a pool of randomly chosen inhibitory neurons, a shadow pool, to each excitatory pool in a synfire chain. A neuron in an excitatory pool projects its output not only to the next pool in the chain, but also to all neurons in its shadow pool. A neuron in the shadow pool does not project its output in any ordered manner, but diffuses its output randomly to the rest
702
Y. Aviel, D. Horn, and M. Abeles
A
K=1500 P=750 w=136 d=0.0
Neuron number
2500 2000 1500 1000 500 0 200
300
400
500
600
Time [ms]
700
800
B
K=1500 P=1050 w=136 d=0.0
Neuron number
2500 2000 1500 1000 500 0 200
300
400
500
600
Time [ms]
700
800
Figure 1: Raster plots of an SFC embedded in a BN with K = 1500. (a) α = 0.05, (b) α = 0.07. Neurons that participate in more than one pool may appear more than once on the raster plot, whose y-axis is ordered according to pools and represents every second neuron in each pool. The first 40 pools of the SFC are presented on the y-axis.
Memory Capacity of Balanced Networks
703
A
K=1500 P=750 w=128 d=0.0 2500
Neuron number
2000 1500 1000 500 0 200
300
400
500
600
Time [ms]
700
800
B
K=1500 P=1575 w=128 d=0.0 2500
Neuron number
2000 1500 1000 500 0 200
300
400
500
600
Time [ms]
700
800
Figure 2: Raster plots of HCAs embedded in a BN with K = 1500. (a) α = 0.05. (b) α = 0.105. The twelfth pattern is externally ignited at time t = 500 ms for 5 ms. Neurons that participate in more than one HCA may appear more than once on the raster plot, whose y-axis is ordered according to HCAs, and represents every second neuron in each pattern. Forty HCAs are presented on the y-axis.
704
Y. Aviel, D. Horn, and M. Abeles
A 4
3.5
Population CV
3
2.5
2
1.5
1
0.5 0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
P/NE
B 4
3.5
Population CV
3
2.5
2
1.5
1
0.5 0.05
0.055
0.06
0.065
0.07
0.075
0.08
P/N
E
Figure 3: BN population CV as a function of α for various values of K. (a) HCAs are embedded. (b) An SFC is embedded. Post- and preignition curves are indicated by o and + markers, respectively. In both cases, statistics were gathered from the 300 ms pre- and postignition. Curves for three values of K (500, dotted curve; 1000, dashed curve; and 1500, solid line) are superimposed. The transitions to the oscillatory mode occur near the same α value, αc (0.1 for HCAs, 0.07 for SFCs preignition, and 0.65 for SFC postignition) for all K values.
Memory Capacity of Balanced Networks
705
Pool
Excitatory chain Inhibitory (shadow) chain
Figure 4: A modified synfire chain. Excitatory neurons (open circles) construct the traditional excitatory chain, characterized by the full feedforward connections. Inhibitory neurons (filled circles) form the shadow pools. Each shadow pool receives excitation from its associated excitatory pool. A neuron can participate more than once in the chain. Additional excitatory (solid line) and inhibitory (dashed lines) inputs and outputs are allowed (marked with curved lines).
of the network, as in a completely random network. Similar connectivity, but for different reasons, was suggested in Hayon (2002) and Mehring et al. (2003). A sketch of the connectivity scheme of a modified synfire chain is shown in Figure 4. These inhibitory pools do not carry specific information down the chain, as is the case for the excitatory pools, but rather echo a synchronized activity of their attached excitatory pool. The role of the shadow pools is to guarantee a correct amount of inhibition during the propagation of the wave. The excess of converging connections in the network, due to the embedded chain, induces an excess of correlated excitation, which may lead to global excitation (Aviel, Mehring et al., 2003). The sole purpose of the shadow pools is to cancel that excess of excitation. An analogous shadow pattern is associated with every HCA in an HCA model. ˜ E . This leads to the facThe size of a shadow pattern is defined as wI ≡ dw tor d, representing the relative strength of inhibitory to excitatory currents, due to a pattern or pool, affecting a neuron that is connected to both: √ −JI wI gJ0 Kd˜ gd˜ d≡ = = √ , x ∈ {E, I}. √ JwE γ J0 KI √ In other words wI = d( γ /g)wE . In the simulations reported below, we use d = 2 for HCAs and d = 1 for SFCs, that is, wI = 0.2 · wE and wI = 0.1 · wE , respectively. These values were chosen after a search for optimal d values that lead to the smoothest background behavior after pattern ignition.
706
Y. Aviel, D. Horn, and M. Abeles 3
Population CV
2.5
2
SFC
HCA
1.5
1
0.5 0.05
0.06
0.07
0.08
0.09 P/N
0.1
0.11
0.12
0.13
E
Figure 5: DBN preignition population CV as a function of α, for three values of K (500, dotted curve; 1000, dashed curve; and 1500, solid line). The transition in SFC (+) occurs before that of HCA (the square marks) and is independent of K. Capacity is 0.115 and 0.07 for HCA and SFC, respectively.
We refer to this new type of network as a doubly balanced network (DBN). To avoid confusion, we will refer to the previous case, d = 0, as a singly balanced network (SBN). In Figure 5, we repeat the simulations of Figure 3, only this time we simulate DBNs. Since the pre- and postignition curves overlap in the DBN, we present only the preignition curve. The two cases are plotted on the same scale, so that the difference in capacity can be appreciated. As evident, the capacity of the DBN is superior to that of the SBN in both cases. 4.3 Capacity. In this section, we show that the maximal number of patterns that can be embedded in the network is limited by combinatorial considerations of synaptic resources and is proportional to N. That such a limit has to exist follows from simple counting of all excitatory synapses. On one hand, we know there are KN such synapses available. On the other, we know that each HCA, or pair of consecutive pools in an SFC, use up w2 P of them. Hence, Pw2 < KN, or α = N < wK2 = C12 . w A more careful analysis can take care of the accounting of synapses in the way they are assigned within our model. We divide the K excitatory synapses of each neuron into chunks of L synapses. A neuron of population x (E or I) can participate in at most m ≡ K/L patterns due to synaptic constraints. The total number of chunks, Nx m, sets an upper bound on the
Memory Capacity of Balanced Networks
707
number of patterns, since embedding P patterns requires wx P chunks. Hence we get wx Pmax ≤ m · Nx ,
(4.4)
where Pmax is the maximal number of patterns or pools. Next, defining , we find that αx ≡ PNmax x αx ≤
√ m K/CL K = . wx wx
(4.5)
To leading order in NE , this turns into √ K/CL K (4.6) √ NE = (Cw CL Dx )−1 NE − O( NE ), Dx Cw K √ where Dx ≡ d/(g/ γ ) if x = I, or 1 for x = E. Thus, we conclude that synaptic combinatorial considerations lead to a maximal number of patterns Pmax . If DI ≥ 1, then γ αI < αE , and the inhibitory neurons set the maximum value of P. Otherwise, if DI = 1, the excitatory neurons set Pmax : αx Nx =
Pmax = αmax NE αmax ≡ min (Cw CL )−1 , (Cw CL DI )−1 .
(4.7)
In our DBN, as well as the SBN where DI < 1, the excitatory neurons determine the limit to be Pmax = (Cw CL )−1 NE . Substituting the parameters of our HCA in equation 4.5, we get Pmax ∼ = 0.12NE . In Figure 6, Pc of HCAs is plotted as a function of NE . A linear relation can be observed up to values of NE as high as 45,000. Furthermore, the proximity of the dynamical capacity to the combinatorial one is evident. Results for cases of SBN are also plotted for comparison. As already noted, their dynamical capacity is lower than that of DBNs. 4.4 Allowed Phase Space. We are now in a position to return to the issues of scaling and inquire about the dependence of Cb of equation 4.3 on network load. We fix α and progressively increase wE in an SBN until global oscillation appears. As in Figures 3 and 5, the curves display a critical point, wc . This criticality √ was discussed in Aviel, Mehring et al. (2003), where we found wc = Cb K. In Figure 7, we repeat these simulations for various values of α and K. Each simulation (Cw , α, K) is repeated five times to give an indication of the trial error. We also plot a crude estimate, based on simulations, of Ca and the combinatorial capacity upper limit, αmax , according to equation 4.7.
708
Y. Aviel, D. Horn, and M. Abeles
Figure 6: Combinatorial and dynamical capacities of HCAs as a function of NE for DBN (o) and SBN (∗). The solid line is αmax · NE .
Figure 7: The allowed region of Cw , the shaded gray area, is limited by Ca from below and Cb from above. The latter, represented by markers with error bars, is estimated from numerical simulations for three different values of K. The markers are slightly shifted with respect to each other—K = 500 (circles) to the left and 1500 (squares) to the right—to allow easy interpretation. These values lie below the constraint Cw CL α < 1 that led us (see equation 4.7) to the combinatorial upper bound αmax .
Memory Capacity of Balanced Networks
709
In the phase portrait illustrated in Figure 7, values of Cw and α that are above the combinatorial upper bound (dashed-dotted line) are forbidden due to the synaptic constraints of our model. Combinations of Cw and α that are below the combinatorial upper bound but outside the shaded area lead to instability of the AS. Below Ca , pattern activities dissolve into the background; hence, they are unstable. The only regime that allows stability of both the AS and the pattern activity is the shaded region in the figure. As the network load increases, Cb and Ca approach each other, shrinking the allowed regime of Cw , in which embedding is possible. In a DBN, the Cb values are higher, closer to the combinatorial upper bound. This enlarges the stability regime of the assemblies in a BN to the extent that it enables realizing the combinatorial upper bound in some cases (Aviel, Horn, & Abeles, 2003a). When embedding SFCs in the BN, the phase portrait stays qualitatively simiv, but Cb values are lower, leading to lower dynamical capacity than in the case of HCAs. 5 Discussion In this letter, we studied embedding of memory patterns in a balanced network of IF neurons. Two types of memories were used: Hebbian cell assemblies (HCAs; Hebb, 1949) and synfire chains (SFCs; Abeles, 1991). We propose a scaling behavior of synaptic weights and other parameters that render our model invariant to changes in K, the synaptic size of the network. This square root scaling allows for high capacity of both HCA and SFC. We emphasized the scaling of variables with K, but it is usually NE that one varies. Using the relation K = εNE , these two variables are linearly related. Note, however, that K may also be assumed to be constant, or vary at a different rate from NE , and our results will still be valid as long as K/NE 1. In particular, the increase of the maximal number of patterns Pmax with NE is valid even if K is constant. We distinguished between the traditional balanced network, which we termed singly balanced network (SBN), and our new doubly balanced network (DBN). In a DBN, memory patterns embedded in the excitatory-toexcitatory part of the connectivity matrix project also onto their shadow patterns. The shadow patterns are merely random pools of inhibitory neurons that receive inputs from their associated excitatory patterns. Counteracting emerged excitatory correlations by inhibitory ones imposes another type of balance. Hence, it is a doubly balanced network. We have shown in Aviel, Horn, and Abeles (2003b) that an optimal choice of d, the ratio of inhibitory to excitatory correlation currents, achieves the desired effect. If d is too small, background oscillations appear, as was the case of the SBN. If d is too large, the induced inhibition kills the synfire waves. The proper value, around
710
Y. Aviel, D. Horn, and M. Abeles
d = 1, allows for optimal performance. The DBNs have the advantage that their background activity during memory recall stays asynchronous even for high memory loads. Another indication of the stabilizing affect of the shadow patterns is given in Aviel et al. (2003b), where it is shown that if a strong, synchronized input is used to ignite a wave in an SBN, rather than a step current, then global oscillations are inevitable. With a DBN, on the other hand, waves are possible on top of asynchronous background activity. Introducing double balance is only one way of stabilizing the AS. Other ways involve more sophisticated neuronal models (Wang, 1999), or introducing variability through nonhomogeneous neuronal population or through different connectivity schemes. For example, models such as Amit and Brunel (1997) and van Vreeswijk and Sompolinsky (1998) used a variable number of synapses on each neuron. This leads to a broad distribution of firing rates across the population, which in turn introduces variability that helps to stabilize the AS. In our work, however, each neuron receives exactly the same number of synapses (Brunel, 2000). Here we suggest a stabilizing mechanism that is directly related to the cause of the instability. Clearly, additional stabilizing mechanisms, such as variable number of synapses, variable transmission delays, or more sophisticated neuronal models, will only help to increase the AS’s stability. In the current-based IF model used here, the synaptic currents are independent of the membrane voltage. While this significantly simplifies the model, it is also different from the more biologically plausible conductancebased IF model. These two models show appreciable differences if operated under balanced conditions (Wang, 1999; Lerchner, Ahmadi, & Hertz, 2004). Under balanced input conditions, the conductance-based IF model has an effective membrane time constant that vanishes for large K. The currentbased IF model does not change its effective membrane time constant. This can be a subject for further investigation. A capacity limit αc = 0.12 of HCAs calls for comparison with analytic bounds obtained for binary models, like αc = 0.14 in the Hopfield model (Amit et al., 1985). The comparison must be done with care, as the two types of neuronal models are qualitatively different. Hertz (1999) has argued that a capacity limit obtained in a network of IF neurons should be multiplied by τ/2 to compare it with a network of binary neurons. Hence, the αc = 0.115 obtained here is equivalent to αc = 0.57 in a binary model. It is not surprising that the last number is higher than 0.14, since our model’s memory patterns are sparse, as, for example, in the Tsodyks and Feigelman (1988) model, where larger capacities were achieved. An example of an IF network with high capacity was demonstrated by Sommer and Wennekers (2001). However, their network was not balanced, and memories were retrieved from oscillatory modes. To the best of our knowledge, the model presented here is the first to obtain such a high capacity in an asynchronous mode of a network of spiking neurons.
Memory Capacity of Balanced Networks
711
Finally, let us touch some of the interesting open issues. The firings of the neurons in the ignited pool are at a much higher rate and higher regularity than those reported by delayed match-to-sample experiments (Miyashita & Chang, 1988). In our model, neurons in the ignited pool receive much more excitatory than inhibitory input and therefore do not operate in the balanced-input regime anymore. Ways to reduce the ignited state’s firing rate have been suggested (Amit & Brunel, 1997; Wang, 1999). It will be interesting to incorporate these studies with the DBN approach in order to better fit experimental data. In this work, we recall only one pattern at a time. We require the ignited pattern to be stable (i.e., sustained high firing rate for HCA or sustained propagation of synfire wave) for at least 100 ms without provoking global oscillations. The next step is to ask how many concurrent patterns can be successfully recalled. This is yet another type of capacity. Preliminary results show that three HCAs can be recalled in a K = 1500 model, one of which will survive the competition and exhibit sustained activity for hundreds of milliseconds. Our model uses binary synapses—a synapse either exists, with the strength of one synaptic unit, or is absent. We strengthen bonds if the two neurons participate together in more memories. This construction leads to the combinatorial maximum capacity. One may speculate that different synaptic arrangements, for example, along the spirit of the Willshaw model (Willshaw et al., 1969), will lead to even higher dynamical capacity. Acknowledgments This work was supported in part by grants from GIF and DIP. References Abeles, M. (1982). Local cortical circuits—An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Amit, D.J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Amit, D.J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D.J, Gutfreund, H., & Sompolinsky, H. (1985). Storing infinite numbers of patterns in a spin-glass model of neural networks. Physical Review Letters, 55, 1530–1533. Aviel, Y., Horn, D., & Abeles, M. (2003a). The doubly balanced network of spiking neurons: A memory model with high capacity. In S. Becker, S. Thrun, & K.Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press.
712
Y. Aviel, D. Horn, and M. Abeles
Aviel, Y., Horn, D., & Abeles, M. (2003b). Synfire waves in small balanced networks. Neurocomputing, 58–60, 123–127. Aviel, Y., Mehring, C., Horn, D., & Abeles, M. (2003). On embedding synfire chains in a balanced network. Neural Computation, 15(6), 1321–1340. Bienenstock, E. (1995). A model of the neocortex. Network: Computation in Neural Systems, 6, 179–224. Braitenberg, V. (1978). Cell assemblies in the cerebral cortex. In R. Heim & G. Palm (Eds.), Theoretical approaches to complex systems. Berlin: Springer. Braitenberg, V., & Schuz, A. (1991). The anatomy of the cortex: Statistics and geometry. Berlin: Springer. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Brunel, N., & Wang, X.-J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. Journal of Computational Neuroscience, 11, 63–85. Compte, A., Brunel, N., Goldman-Rakic, P.S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Diesmann, M., Gewaltig, M.O., & Aertsen, A. (1995). SYNOD: An environment for neural systems simulations. Rehovot, Israel: Weizmann Institute of Science. Diesmann, M., Gewaltig, M.O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402(6761), 529–533. Funahashi, S., Bruce, C.J., & Goldman-Rakic, P.S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. Journal of Neurophsiology, 61, 331–349. Gerstein, G.L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41–68. Gerstner, W., & van Hemmen, L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Hansel, D., Mato, G., & Meunier, C. (1995). Synchronization in excitatory neural networks. Neural Computation, 7, 307–337. Hayon, G. (2002). Modeling compositionality in biologival neural networks by dynamic binding of synfire chains. Unpublished doctoral dissertation, Hebrew University, Jerusalem. Hebb, D.O. (1949). The organization of behavior. New York: Wiley. Herrmann, M., Hertz, J.A., & Prugel-Bennett, A. (1995). Analysis of synfire chains. Network: Comput. Neural Syst., 6, 403–414. Hertz, J.A. (1999). Modeling synfire networks. In G. Burdet, P. Combe, & O. Parodi (Eds.), Neuronal information processing—From biological data to modelling and application. Carg`ese, France. Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities. PNAS, 79, 2554–2558. Horn, D., Sagi, D., & Usher, M. (1991). Segmentation, binding and illusory conjunctions. Neural Computation, 3, 510–525. Huang, P.E., & Stevens, F.C. (1997). Estimating the distribution of synaptic reliabilities. Journal of Neurophsiology, 78(6), 2870–2880.
Memory Capacity of Balanced Networks
713
Izhikevich, E.M., Gally, J.A., & Edelman, G.M. (2004). Spike-timing dynamics of neuronal groups. Cerebral Cortex, 14, 933–944. Lerchner, A., Ahmadi, M., & Hertz, J. (2004) High conductance states in a mean field cortical network model. Neurocomputing, 58–60, 935–940. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in cell assembly of spiking neurons. Neural Networks, 14, 815–824. Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (2003). Activity dynamics and propagation of synchronous spiking in loacally connected random networks. Biological Cybernetics, 88, 395–408. Miller, R. (1996). Neural assemblies and laminar interactions in the cerebral cortex. Biological Cybernetics, 75(3), 253–261. Miyashita, Y., & Chang, H.S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Morrison, A., Mehring, C., Geisel, T., Aertsen A., & Diesmann, M. (2004). Advancing the boundaries of high connectivity network simulation with distributed computing. Manuscript submitted for publication. Shadlen, M.N., & Newsome, W.T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4(4), 569–579. Sommer, F.T, & Wennekers, T. (2000). Modeling studies on the computaional function of fast temporal structure in cortical circuit activity. Journal of Physiology Paris, 94(5–6), 473–488. Sommer, F.T, & Wennekers, T. (2001). Associative memory in networks of spiking neurons. Neural Networks, 14(6–7), 825–834. Tetzlaff, T., Buschermohle, ¨ M., Geisel, T., & Diesmann, M. (2003). The spread of rate and correlation in stationary cortical networks. Neurocomputing, 52–54, 949–954. Treves, A. (1990). Graded-response neurons and information encodings in autoassociative memories. Physical Review A, 42(4), 2418. Tsodyks, M.V., & Feigelman, M.V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhysics Letters, 6(2), 101–105. Tuckwell, H.C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10(6), 1321–1371. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail party processor. Biological Cybernetics, 54, 29–40. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. Journal of Neuroscience, 19(21), 9587– 9607. Willshaw, D.J., Buneman, O.P., & Longuet-Higgins, H.C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Received October 6, 2003; accepted August 3, 2004.
LETTER
Communicated by Irwin Sandberg
On the Capabilities of Higher-Order Neurons: A Radial Basis Function Approach Michael Schmitt
[email protected] ¨ Mathematik, Ruhr-Universit¨at Lehrstuhl Mathematik und Informatik, Fakult¨at fur Bochum, D–44780 Bochum, Germany
Higher-order neurons with k monomials in n variables are shown to have Vapnik-Chervonenkis (VC) dimension at least nk + 1. This result supersedes the previously known lower bound obtained via k-term monotone disjunctive normal form (DNF) formulas. Moreover, it implies that the VC dimension of higher-order neurons with k monomials is strictly larger than the VC dimension of k-term monotone DNF. The result is achieved by introducing an exponential approach that employs gaussian radial basis function neural networks for obtaining classifications of points in terms of higher-order neurons.
1 Introduction Higher-order neurons are simple but powerful extensions of linear neuron models. They introduce the concept of nonlinearity by incorporating monomials, that is, products of input variables, as an intermediate step— effectively a hidden layer—of computation. A common way of specifying a higher-order neuron is to determine the number of input variables and the number of monomials, but leaving the order of the monomials open. Thus, one obtains a powerful neuron model that is capable of computing with unlimited degrees of nonlinearity. The computational capabilities that arise from this model, however, are far from being completely understood. One of the well-studied theoretical notions used to characterize the computational richness of neural networks is the Vapnik-Chervonenkis (VC) dimension. More generally, the VC dimension of a function class quantifies its classification capabilities (Vapnik & Chervonenkis, 1971). It indicates the cardinality of the largest set for which all possible binary-valued classifications can be obtained using functions from the class. The VC dimension is also well founded as a measure for the sample complexity of learning (see, e.g., Anthony & Bartlett, 1999). It yields estimates for the number of examples needed by a learning algorithm to output functions that have low generalization errors. Neural Computation 17, 715–729 (2005)
c 2005 Massachusetts Institute of Technology
716
M. Schmitt
We establish here a new lower bound on the VC dimension of higherorder neurons. We show that the higher-order neuron with k monomials in n variables has VC dimension at least nk + 1. The largest lower bound that has been known previously is derived from the lower bound for boolean formulas in k-term monotone disjunctive normal form (DNF), that is, disjunctions of at most k monomials without negations. This bound has been obtained by Littlestone (1988). In particular, Littlestone has shown that the class of k-term monotone l-DNF formulas (i.e., with monomials containing at most l variables) has VC dimension at least lklog(n/m), where l ≤ m ≤ n, and k ≤ m1 . Using l = n/4 and m = n/2, for instance, this yields the lower bound nk/4 for the VC dimension of k-term monotone DNF and, hence, of higher-order neurons with k monomials, where k has to satisfy the given constraints. The new bound that we provide here for higher-order neurons supersedes this previous bound in a threefold way. First, it improves the bound from k-term monotone DNF formulas in value. Second, it releases k from the constraints through n in that the new bound holds for every n and k—in particular, for values of k that are larger than the number of monotone monomials. Finally, nk + 1 is even larger than the VC dimension of the class of k-term monotone DNF formulas itself. We show that the difference between both dimensions is larger than k log(k/e) + 1. So far, a considerable number of results and techniques for VC dimension bounds have been provided in the context of real-valued function classes (see, e.g., Bartlett & Maass, 2003). For specific subclasses that can be computed by higher-order neurons, tight bounds have been calculated. Karpinski and Werther (1993) have shown that univariate polynomials with at most k terms have a VC dimension proportional to k.1 Further, the VC dimension of the class of monomials over the reals is equal to n (see Ehrenfeucht, Haussler, Kearns, & Valiant, 1989; Schmitt, 2002c, for lower and upper bound, respectively). There is also a VC dimension result known for n-variate ddegree polynomials (see, e.g., & Lindenbaum, 1998). This class Ben-David has VC dimension equal to n+d . However, as the class contains polynomid n+d als consisting of d terms and the bound k on the number of monomials restricts the number of variables in terms of k, this result entails for higherorder neurons (without a constraint on the degree) a lower bound not better than the bound due to Littlestone (1988). Previous work has established techniques for deriving lower bounds on the VC dimension for quite general types of real-valued function classes.
1 Strictly speaking, Karpinski and Werther (1993) studied a related notion—the socalled pseudo-dimension. Following their methods, it is not hard to obtain this result for the VC dimension (see also Schmitt, 2002a).
On the Capabilities of Higher-Order Neurons
717
Building on results by Lee, Bartlett, & Williamson (1995), Erlich, Chazan, Petrack, & Levy (1997) provide powerful means for obtaining lower bounds for parameterized function classes.2 An essential requirement for using these techniques, however, is that the function class is “smoothly” parameterized, a fact that does not apply to the exponents of polynomials. The lower bound method of Koiran and Sontag (1997) for various types of neural networks, generalized by Bartlett, Maiorov, and Meir (1998) to neural networks with a given number of layers, cannot be employed for higher-order neurons either. This technique is constrained to networks where each neuron computes a function with finite limits at infinity, a property monomials do not have. Further, Koiran and Sontag (1997) designed a lower-bound method for networks consisting of linear and multiplication gates. However, the way these networks are constructed, with layers consisting of products of linear terms,3 does not give rise to higher-order neurons even when the number of layers is restricted. We provide a completely new approach to the derivation of lower bounds on the VC dimension of higher-order neurons. First, we establish the lower bound nk + 1 on the VC dimension of a specific type of radial basis function (RBF) neural networks (see, e.g., Haykin, 1999). The networks considered here have k gaussian units as computational elements and satisfy certain assumptions with respect to the input domain and the values taken by the parameters. The bound for these networks improves a result of Erlich et al. (1997) in combination with Lee et al. (1995) who established the lower bound n(k − 1) for RBF networks4 with restrictions on neither inputs nor parameters. Then we use our result for RBF networks to obtain the lower bound on the VC dimension of higher-order neurons. Thus, RBF networks open a new way to assess the classification capabilities of higherorder neurons. This gaussian RBF approach has also proven to be helpful in a different context dealing with the roots of univariate polynomials (Schmitt, 2004). Higher-order neurons are a special case of a particular type of neural networks, the so-called product unit neural networks (Durbin & Rumelhart, 2 A parameterized function class is given in terms of a function having two types of variables: input variables and parameter variables. The function class is obtained by instantiating the parameter variables with, in general, real numbers. Neural networks are prominent examples for parameterized function classes. l 3 Precisely, such a layer uses products of the form (x − ai ) where it is crucial that i=1 there is no bound on l. 4 These results and the one presented here concern RBF networks with uniform width, that is, where all units have equal width. This constraint is not a fundamental restriction since, as shown by Park and Sandberg (1991), these networks still have universal approximation capabilities. As far as the VC dimension is concerned, however, better lower bounds are known for more general types of RBF networks (Schmitt, 2002b).
718
M. Schmitt
1989). It immediately follows from the bound for higher-order neurons established here that the VC dimension of product unit neural networks with n input nodes and one layer of k hidden nodes (that is, nodes that are neither input nor output nodes) is at least nk + 1. Concerning upper bounds for the VC dimension of higher-order neurons, there are two relevant results. The bound O(n2 k4 ) due to Karpinski and Macintyre (1997) is the smallest bound known for higher-order neurons with unlimited degree (see also Schmitt, 2002c). The higher-order neuron with n variables, k monomials, and degree no larger than d has VC dimension no more than 2nk log(9d) (Schmitt, 2002c). The derivation of the new lower bound not only narrows the gap between upper and lower bounds, but also gives rise to subclasses of degree-restricted higher-order neurons for which the bound is optimal up to the factor 2 log(9d). We introduce definitions and notation in section 2. Section 3 provides geometric constructions that are required for the derivations of the main results presented in section 4. Finally, in section 5, we compare the new bound with the upper bound for k-term monotone DNF formulas and show that the new bound exceeds the VC dimension of this class. 2 Definitions A higher-order neuron (or sigma-pi unit) with k monomials in n variables computes the functions b
b
b
b
a0 + a1 x11,1 · · · xn1,n + · · · + ak x1k,1 · · · xnk,n
(2.1)
with real coefficients a0 , . . . , ak (the output weights of the neuron with bias a0 ) and nonnegative integer exponents b1,1 , . . . , bk,n . The coefficients and the exponents are the adjustable parameters of the neuron. For given n and k, the function class constituted by all functions in equation 2.1 is also known as the class of k-sparse n-variate polynomials. If the exponents b1,1 , . . . , bk,n are allowed to be arbitrary real numbers, each monomial becomes a product unit, and we obtain a product unit neural network with one hidden layer of k product units. We use bold symbols to indicate vectors, such as x and ci ; nonbold symbols are reserved for scalar variables. Let · denote the Euclidean norm. A radial basis function neural network (RBF network, for short) with n input nodes computes functions that map from Rn to R and can be written as x − c1 2 x − ck 2 + · · · + w , w0 + w1 exp − exp − k σ2 σ2 where k is the number of RBF units. This particular type of network is also known as a gaussian RBF network. Each exponential term corresponds to the function computed by a gaussian RBF unit with center ci ∈ Rn , where
On the Capabilities of Higher-Order Neurons
719
n is the number of variables, and width σ ∈ R \ {0}. The width is a network parameter that we assume here to be equal for all units, that is, we consider RBF networks with uniform width. Further, w0 , . . . , wk are the output weights, and w0 is also referred to as the bias of the network. The Vapnik-Chervonenkis (VC) dimension of a class F of real-valued functions is defined via the notion of shattering. A set S ⊆ Rn is said to be shattered by F if every dichotomy of S is induced by F , that is, if for every pair (S− , S+ ), where S− ∩ S+ = ∅ and S− ∪ S+ = S, there is some function f ∈ F such that sgn ◦ f (S− ) ⊆ {0} and sgn ◦ f (S+ ) ⊆ {1}. Here sgn: R → {0, 1} denotes the sign function, satisfying sgn(x) = 1 if x ≥ 0, and sgn(x) = 0 otherwise. The VC dimension of F is then defined as the cardinality of the largest set shattered by F . (It is said to be infinite if there is no such set.) The VC dimension of a neuron or a neural network is equated with the VC dimension of the class of functions computed by the neuron or the neural network, respectively. Finally, we make use of the geometric notions of a ball and a hypersphere. A ball in Rn is given in terms of a center c ∈ Rn and a radius ρ ∈ R as the set B(c, ρ) = {x ∈ Rn : x − c ≤ ρ}. A hypersphere is the set of points on the surface of a ball, that is, S(c, ρ) = {x ∈ Rn : x − c = ρ}. 3 Ancillary Constructions In the following, we provide the geometric constructions that are the basis for the main result in Section 4. The idea pursued here is to represent classifications of sets using unions of balls, where a point is classified as positive if and only if it is contained in some ball. In order to be shattered, the sets are chosen to satisfy a certain condition of independence with respect to the positions of their elements. The points are required to lie on hyperspheres such that each hypersphere is maximally determined by the set of points. In other words, removing any point increases the set of possible hyperspheres that contain the reduced set. The following definition makes this notion of independence precise. Definition 1. A set Q ⊆ Rn of at most n + 1 points is in general position for hyperspheres if the system of equalities p − c = η,
for all p ∈ Q,
(3.1)
720
M. Schmitt
in the variables c = (c1 , . . . , cn ) and η has a solution and, for every q ∈ Q, the solution set is a proper subset of the solution set of the system p − c = η,
for all p ∈ Q \ {q}.
(3.2)
Given a set of points that satisfies this definition and lies on a hypersphere, we next want to find a ball such that one of the points lies outside the ball while the other points are on its surface. We show that this can be done provided that the set is in general position for hyperspheres. Moreover, the ball can be chosen with the center and the radius as close as possible to the center and the radius of the hypersphere that contains all points. Lemma 1. Suppose that Q ⊆ Rn is a set of at most n + 1 points in general position for hyperspheres, and let q ∈ Q. Further, let c ∈ Rn , η ∈ R be a solution of the system p − c = η,
for all p ∈ Q.
(3.3)
Then, for every ε > 0, there exists a solution c(ε) ∈ Rn , η(ε) ∈ R of the system p − c(ε) = η(ε),
for all p ∈ Q \ {q},
q − c(ε) > η(ε) satisfying c − c(ε) < ε and |η − η(ε)| < ε. Proof. Without loss of generality, we may assume that η > 0. (If η = 0, then we have |Q| = 1, and the statement is trivial.) Since c and η solve the system 3.3, c and ϑ = η2 − c 2 are a solution of the system p 2 − 2pc = ϑ,
for all p ∈ Q.
(3.4)
Because Q is in general position for hyperspheres, the solution set of the system 3.4 is a proper subset of the solution set of the system p 2 − 2pc = ϑ,
for all p ∈ Q \ {q}.
(3.5)
According to facts from linear algebra, there exist a ∈ Rn and α ∈ R such that for every λ = 0, we have with c + λa and ϑ + λα a solution of the system 3.5 that does not solve the system 3.4. For a given ε > 0, choose λ(ε) ∈ R \{0} such that |λ(ε)| is sufficiently small to satisfy λ(ε)a < ε and | ϑ + c 2 − ϑ + λ(ε)α + c + λ(ε)a 2 | < ε. (3.6)
On the Capabilities of Higher-Order Neurons
721
It is obvious that the second inequality can be met due to the fact that the equation ϑ + c 2 = η holds, which we get from the definition of ϑ and the assumption η > 0. Since c + λ(ε)a and ϑ + λ(ε)α solve equation 3.5 but not 3.4, it follows that q 2 − 2q(c + λ(ε)a) = ϑ + λ(ε)α, which, using q 2 − 2qc = ϑ from equation 3.4, is equivalent to −2λ(ε)qa = λ(ε)α. Due to this inequality, we can choose the (not yet specified) sign of λ(ε) such that −2λ(ε)qa > λ(ε)α. Again with q 2 − 2qc = ϑ, it follows that q 2 − 2q(c + λ(ε)a) > ϑ + λ(ε)α, and, therefore, q − (c + λ(ε)a) 2 > ϑ + λ(ε)α + c + λ(ε)a 2 . Hence, defining c(ε) = c + λ(ε)a and η(ε) =
ϑ + λ(ε)α + c + λ(ε)a 2 ,
we obtain q − c(ε) > η(ε). Furthermore, the inequalities 3.6 imply that c − c(ε) < ε and |η − η(ε)| < ε hold as claimed. We now apply the previous result to show that any dichotomy of a given set of points can be obtained using balls. As the set may generally be a subset of some larger set, we also ensure that the balls do not enclose any additional point. Further, we guarantee that this can be done with all centers remaining positive, a condition that will turn out to be useful in the following section. We say here that a vector is positive if all of its components are larger than zero. Lemma 2. Let Q ⊆ Rn be a set of n points in general position for hyperspheres, and let P ⊆ Rn be a finite set with Q ⊆ P. Assume further that there exists a positive center c ∈ Rn and a radius η ∈ R such that Q ⊆ S(c, η), P ∩ B(c, η) = Q.
722
M. Schmitt
Then for every R ⊆ Q, there exists a positive center d ∈ Rn and a radius ζ ∈ R such that R ⊆ S(d, ζ ), P ∩ B(d, ζ ) = R. Proof. Clearly, it is sufficient to consider sets R that are proper subsets of Q. Without loss of generality, we may assume that |R| = |Q| − 1. The general case then follows inductively. Suppose that q ∈ Q, and let R = Q \ {q}. According to lemma 1, for every ε > 0 there exist c(ε), η(ε) satisfying p − c(ε) = η(ε),
for all p ∈ Q \ {q},
(3.7)
q − c(ε) > η(ε),
(3.8)
c − c(ε) < ε,
(3.9)
|η − η(ε)| < ε.
(3.10)
Obviously, property (3.7) implies that R ⊆ S(c(ε), η(ε)). Property (3.8) states that q ∈ B(c(ε), η(ε)). Since the assumption P ∩ B(c, η) = Q implies that for every p ∈ P \ Q the constraint p − c > η holds, properties (3.9) and (3.10) entail the condition p − c(ε) > η(ε) for all sufficiently small ε. Thus, for any such ε, we get the assertion P ∩ B(c(ε), η(ε)) = R. Further, as c is positive, property (3.9) ensures that c(ε) is positive for some sufficiently small ε. Hence, the claim follows for d = c(ε), ζ = η(ε). 4 VC Dimension Bound for Higher-Order Neurons Before getting to the main result, we derive the lower bound nk + 1 for the VC dimension of a restricted type of RBF network. For more general RBF networks, results of Erlich et al. (1997) and Lee et al. (1995) yield the lower bound n(k − 1). The following theorem is stronger not only in the value of the bound, but also in the assumptions that hold. The points of the shattered set all have the same distance from the origin, the centers of the RBF units are rational numbers, and the width can be chosen arbitrarily small. Theorem 1. Let n ≥ 2, k ≥ 1, and ρ > 0 be given. There exists a set P ⊆ S(0, ρ) ⊆ Rn of nk + 1 points and a real number σ0 > 0 so that P is shattered by the RBF network with k hidden units, positive rational centers, and any width 0 < σ ≤ σ0 .
On the Capabilities of Higher-Order Neurons
723
Figure 1: The points of the shattered set are chosen from the intersections of the hypersphere S(0, ρ) with the surfaces of pairwise disjoint balls B(ci , ηi ). All balls have their centers in the positive orthant. There is one additional point s not contained in any of the balls.
Proof. Suppose that B(c1 , η1 ), . . . , B(ck , ηk ) are pairwise disjoint balls with positive centers c1 , . . . , ck ∈ Rn such that for i = 1, . . . , k, the intersection S(ci , ηi ) ∩ S(0, ρ) is nonempty and not a single point. (An example for n = 2 and k = 3 is shown in Figure 1.) For i = 1, . . . , k, let Pi ⊆ S(ci , ηi ) ∩ S(0, ρ) be a set of n points in general position for hyperspheres. (Note that Pi is constrained to lie on two different hyperspheres. This still allows choosing Pi in such a way that it is in general position since Pi contains n (and not n + 1) points, so that the set of possible centers for Pi yields a line.) Further, let s ∈ S(0, ρ) be some point such that s ∈ B(ci , ηi ), for i = 1, . . . , k. We claim that the set P = {s} ∪ P1 ∪ · · · ∪ Pk , which has nk + 1 points, is shattered by the RBF network with the postulated restrictions on the parameters. Assume that (P− , P+ ) is some arbitrary dichotomy of P where s ∈ P− . (We will argue at the end of the proof that the complementary case can be + treated by reversing signs.) Let (P− i , Pi ) denote the dichotomy induced on Pi . By construction, every Pi satisfies
Pi ⊆ S(ci , ηi ) and P ∩ B(ci , ηi ) = Pi .
724
M. Schmitt
Hence by lemma 2, instantiating the set Q with Pi and the set R with P+ i , it follows that there exist positive centers di and radii ζi such that + P+ i ⊆ S(di , ζi ) and P ∩ B(di , ζi ) = Pi ,
for i = 1, . . . , k. Moreover, the centers di can be replaced by rational centers d˜ i that are sufficiently close to di , such that every point of P lying outside the ball B(di , ζi ) is outside the ball B(d˜ i , ζ˜i ) for some ζ˜i ∈ R close to ζi , and every point of P lying on the hypersphere S(di , ζi ) is contained in the ball B(d˜ i , ζ˜i ). Thus, every p ∈ P satisfies p ∈ B(d˜ i , ζ˜i ) if and only if p ∈ P+ i ,
(4.1)
for i = 1, . . . , k. Clearly, since the centers di are positive, the rational centers d˜ i can be chosen to be positive as well. The parameters of the RBF network are specified as follows. The ith unit is associated with the ball B(d˜ i , ζ˜i ). Assigned to it is d˜ i as the center and as 2 output weight the value exp(ζ˜i /σ 2 ) (where σ will determined below) so that the unit contributes the term
ζ˜i2 x − d˜ i 2 exp exp − σ2 σ2 to the computation of the network. From assertion 4.1, we obtain that every p ∈ P \ P+ i satisfies the constraint p − d˜ i > ζ˜i . Thus, for every sufficiently small σ > 0 and every p ∈ P \ P+ i , we achieve that
p − d˜ i 2 − ζ˜i2 1 exp − (4.2) < 2 σ k is valid for i = 1, . . . , k. On the other hand, for every p ∈ P+ i , condition 4.1 implies p − d˜ i ≤ ζ˜i , which entails
p − d˜ i 2 − ζ˜i2 exp − σ2
≥1
(4.3)
On the Capabilities of Higher-Order Neurons
725
for every σ > 0. Finally, we set the bias term equal to −1. It is now easy to see that the dichotomy (P− , P+ ) is induced by the parameter settings. If p ∈ P− , then, according to inequality 4.2, the weighted output values of the units and the bias sum up to a negative value. In the case p ∈ P+ , we have p ∈ P+ i for some i, and by inequality 4.3, the weighted unit i outputs a value of at least 1, while the other units output positive values, so that the total network output is positive. The construction for the case that classifies s as positive works similarly. We invoke lemma 2 substituting P− i for R and derive the analogous version of assertion 4.1 with P+ replaced by P− i i . Then it is obvious that if the weights defined above are equipped with negative signs and 1 is used as the bias, the network induces the dichotomy as claimed. We observe that σ may have been chosen such that it depends on the particular dichotomy. To complete the proof, we require σ0 to be small enough so that inequality 4.2 holds for σ ≤ σ0 on all points and dichotomies of P. One assumption of the theorem can be slightly weakened. It is not necessary to require that s ∈ S(0, ρ). Instead, every point not contained in any of the balls B(ci , ηi ) can be selected for s. However, the restriction is required for the application of the theorem in the following result, which is the main contribution of this article. For its proof, we recall the definition of a product unit neural network in section 2. Theorem 2. For every n, k ≥ 1, the higher-order neuron with k monomials in n variables has VC dimension at least nk + 1. Proof. We first consider the case n ≥ 2. By theorem 1, for ρ > 0, let P ⊆ Rn , P ⊆ S(0, ρ), be the set of cardinality nk + 1 that is shattered by the RBF network with k hidden units and the stated parameter settings. We show that P can be transformed into a set P that is shattered by the higher-order neuron with k monomials. The weighted output computed by unit i in the RBF network on input p ∈ P can be written as
p − ci 2 wi · exp − σ2
p 2 − 2pci + ci 2 = wi · exp − σ2 2pci p 2 + ci 2 exp = wi · exp − σ2 σ2 2 2 ρ + ci = wi · exp − σ2 2p1 ci,1 2pn ci,n · exp · · · exp , σ2 σ2
726
M. Schmitt
where we have used the assumption P ⊆ S(0, ρ) for the last equation and pj , ci,j to denote the jth components of the vectors p, ci , respectively. Consider a product unit network with one hidden layer, where unit i has output weight ρ 2 + ci 2 wi = wi · exp − σ2 and exponents 2ci,j /σ 2 for j = 1, . . . , n. On inputs from the set P = {(ep1 , . . . , epn ): (p1 , . . . , pn ) ∈ P}, this product unit network computes the same values as the RBF network on P. Moreover, the exponents of the product units are positive rationals. According to theorem 1, for some σ0 , any width 0 < σ ≤ σ0 can be used. Therefore, we may choose σ 2 = 1/ l for some natural number l that is sufficiently large and a common multiple of all denominators occurring in any ci,j , so that the exponents become integers. With these parameter settings, we have a higher-order neuron with k monomials that computes on P the same output values as the RBF network on P. As this can be done for every dichotomy of P, it follows that P is shattered by the higher-order neuron with k monomials. For the case n = 1, we again use the RBF technique and ideas from Schmitt (2002a, 2004). Clearly, the set M = {0, . . . , k} can be shattered by an RBF network with k + 1 hidden units and zero bias. For each i ∈ M, we employ an RBF unit with center i. Given a dichotomy (M− , M+ ), we let the output weight for unit i be −1 if i ∈ M− and 1 if i ∈ M+ . If the width σ is small enough, the output value of the network has the requested sign on every input i ∈ M. Now, let σ be the smallest width sufficient for all dichotomies of M. Then 2 x (x − 1)2 w0 exp − 2 +w1 exp − +··· σ σ2 (x − k)2 · · ·+wk exp − ≥0 σ2 is, by multiplication with exp(x2 /σ 2 ), equivalent to 2x − 1 2kx − k2 exp w0 + w1 exp + · · · + w ≥ 0. k σ2 σ2 The latter can be written as 2x 1 +· · · w0 +w1 exp − 2 exp σ σ2
2 k 2kx · · ·+wk exp − 2 exp ≥ 0. σ σ2
On the Capabilities of Higher-Order Neurons
727
Substituting y = exp(2x/σ 2 ), this holds if and only if 2 1 k w0 + w1 exp − 2 y + · · ·+wk exp − 2 yk ≥ 0. σ σ Thus, for every dichotomy of M, we obtain a dichotomy of the set M = {e2i/σ : i = 0, . . . , k} 2
induced by a higher-order neuron with k monomials. In other words, M is shattered by this neuron. 5 Comparison with k-Term Monotone DNF A boolean formula that is a disjunction of up to k monomials without negations can be considered as a polynomial restricted to boolean inputs. The previously best-known lower bound for the VC dimension of higher-order neurons with k monomials was the bound for k-term monotone DNF due to Littlestone (1988). By deriving an upper bound for the latter class and applying theorem 2, we show that the VC dimension for higher-order neurons with k monomials is strictly larger than for k-term monotone DNF. We use “log” to denote the logarithm of base 2. Corollary 1. Let n ≥ 1 and 3 ≤ k ≤ 2n . The VC dimension of the higher-order neuron with k monomials in n variables exceeds the VC dimension of the class of k-term monotone DNF in n variables by more than klog(k/e) + 1. Proof. A k-term monotone DNF formula corresponds to a collection of up to k subsets of the set of variables. For n variables, there are no more n than ki=0 2i such collections. The known inequality di=0 mi < (em/d)d , where 1 ≤ d ≤ m, (see, e.g. Anthony & Bartlett, 1999, theorem 3.7) yields k n
2 i=0
i
H(S) + 1, it follows that there exists a better codebook than codebook B. Note that in both codebooks, each stimulus is assigned a different codeword, so if there existed a channel in which the response to each stimulus was the corresponding codeword, P(c) would be 1.0. Thus, the source coding theorem tells you the fewest number of bits required to encode a variable so that it can be decoded without error. A third interpretation of H(S), inspired by the description of H(S) as a measure of uncertainty about S, is that it is a measure of the difficulty an ideal observer would have estimating S. The flaws in this interpretation are discussed in the next section. 4.1.2 Comparing Entropy and Ideal Observers. H(S) is a function only of P(S). Hence, to directly compare H(S) and P(c), we consider the behavior of an ideal observer faced with the task of estimating S given only P(S). Given P(S), the ideal observer picks the most likely stimulus sˆmax as its estimate of S, and P(c) =P(smax ) (Duda et al., 2001). H(S) is sometimes interpreted as a measure of the difficulty in estimating S on a single trial: the higher the entropy, the less likely that an ideal observer will correctly estimate S (Alkasab et al., 1999). Formally: H(S1 ) ≥ H(S2 ) ⇔ P(c)1 ≤ P(c)2 ,
(4.4)
where S1 and S2 are two random variables and P(c)i is ideal observer performance in estimating Si on the basis of P(Si ). That equation 4.4 is 65 18 17 incorrect can be shown by counterexample. Let P(S1 ) = { 100 , 100 , 100 } and 50 49 1 P(S2 ) = { 100 , 100 , 100 }. In this case, H(S1 ) = 1.28 > 1.07 = H(S2 ) and P(c)1 = 0.65 > 0.50 = P(c)2 . The rest of this section provides a more general analysis of the relationship between P(c) and H(S). We first review previous results that show how to calculate the range of H(S) values that are consistent with a given value of P(c) (Tebbe & Dwyer, 1968; Kovalevsky, 1968). Toward this end, we define MP(c) as the set of all M-element probability distributions for which ideal observer performance is P(c). Since entropy is invariant under permutations
Quantifying Stimulus Discriminability
751
Figure 6: Some members of the set MP(c) = 40.5 . The empty circle indicates Pmin (40.5 ), the distribution in 40.5 with the minimum entropy (H = 1 bit). The filled circle indicates Pmax (40.5 ), the distribution with the maximum entropy (H = 1.8 bits).
of elements of P(S) we can, without loss of generality, sort the probability values in each distribution in MP(c) in descending order (Feder & Merhav, 1994). That is, by construction, all distributions in MP(c) satisfy P(c) = P(s1 ) ≥ P(s2 ) ≥ · · · ≥ P(s M ).
(4.5)
Figure 6 shows an example of MP(c) for the case in which P(c) = 0.5 and M = 4. Each distribution in MP(c) = 40.5 has the same maximal element P(s1 ) = P(smax ) = 0.5, but the probability mass in the remaining three elements can arbitrarily vary as long as the constraints provided by equation 4.5 are satisfied. H(S) is sensitive to this variability, but P(c) is not. In fact, the distributions in MP(c) take on a range of H(S) values. We denote the distribution in MP(c) that has the minimum entropy Pmin (MP(c) ) and the distribution with the maximum entropy Pmax (MP(c) ). We denote the corresponding entropy values h min (MP(c) ) and h max (MP(c) ), respectively. Previous papers (Tebbe & Dwyer, 1968; Kovalevsky, 1968) show that Pmax (MP(c) ) and h max (MP(c) ) are
P(e) P(e) Pmax (MP(c) ) = P(c), ,..., M−1 M−1
(4.6)
h max (MP(c) ) = H(C) + P(e)log(M − 1),
(4.7)
where C is the random variable with outcomes {c, e} defined in section 3.1. Equation 4.6 holds because once P(c) is fixed, the way to maximize entropy is to distribute the residual P (e ) = [1 − P(c)] probability mass evenly to the remaining M − 1 elements of the distribution. Equation 4.7 follows when equation 4.6 is substituted into equation 4.2. In Figure 6, Pmax (40.5 ) is indicated with a filled circle, and this distribution has entropy h max (40.5 ) = 1.8 bits. In Figure 7A, h max (MP(c) ) is plotted as a function of P(c) for M = 2,
752
E. Thomson and W. Kristan A
(1)
1
A 2,h
H( )/H( |rj) [bits]
0.5
M=2 0 2
k=3
(2)
A4,h
k=2
1
M=4
k=1
0 6
(3) A 64,h
4 2 0
M=64 0
0.4
0.2
0.6
0.8
1
P(c)/P(c|rj) B
H=3 bits
0.4 0
PC
g
(M, H)
0.6
0.2 0
H=0.5 bits 0
2 8
25
50
100
M Figure 7: Comparison of H(S) and P(c). (A) Plot of the upper and lower bounds on H(S) as a function of P(c) for M = 2, M = 4, and M = 64 (panels 1, 2, and 3, respectively). The upper bound, h max (MP(c) ), is indicated by a dashed line, and the lower bound, h min (MP(c) ), is indicated by a solid line. The sets of allowable {P(c), H(S)} points are labeled A M,h . Note that A2,h (panel 1) is a line. In panel 2, the arch-shaped line segments corresponding to different values of k are indicated, as are the points corresponding to h max (40.5 ) and h min (40.5 ) from Figure 6 (filled and empty circles, respectively). (Recall that the integer k describes the minimum number of stimuli that must have probability P(c) when the goal is to minimize H(S); see the text). In panel 3, the lower (0.14) and upper (0.69) bounds on P(c) when H(S) = 3 bits are indicated with filled circles and connected with a dashed line. As discussed in section 4.2, the graphs also plot the relationship between P(c|r j ) and H(S |r j ), so these quantities are included on the x- and y-axes, respectively. (B) Plot of PCrange (M,H) as a function of M shows that PCrange (M,H) increases exponentially with M. The filled circles show PCrange (M,3), and the unfilled circles plot PCrange (M,0.5). The lines are the best saturating exponential fit to the points (see text). The large filled circle on the PCrange (M,3) line indicates PCrange (64,3), the range delineated by the dashed line in panel 3 in Figure 7A.
Quantifying Stimulus Discriminability
753
4, and 64 (dotted lines). It can be seen that at a given value of H(S), the maximum possible P(c) value is specified by h max (MP(c) ), so h max (MP(c) ) provides an upper bound on P(c). Note that for a given P(c) value, h max (MP(c) ) increases with log(M − 1), the only term in equation 4.7 that depends on M. h max (MP(c) ) increases with M because for a larger M, there exist a greater number of elements across which the residual probability mass P(e) can be spread. At the other extreme, to minimize H(S), the residual probability mass P(e) must be concentrated as much as possible while satisfying the constraints in equation 4.5. It is as if there are M cups, each of which can hold P(c) liters of water, and the goal is to distribute a single liter of water to the cups while filling as few of the cups as possible. For a given P(c) value, there exists an integer k that indicates the minimum number of cups that must be completely filled when following such a strategy. For the case in which M = 4 and P(c) = 0.5 (see Figure 6), Pmin (40.5 ) is { 12 , 12 , 0, 0}, so k = 2 and no water is distributed to the third or fourth cups. Consider also the case in which M = 4 and P(c) = 0.4. In this case, Pmin (40.4 ) = {0.4, 0.4, 0.2, 0}, k is again two, but there remain 0.2 liters of water once the second cup is full, and this water must be distributed to the third cup (i.e., cup k + 1). Mathematically, this procedure of concentrating probability mass leads to the following equations for Pmin (MP(c) ) and h min (MP(c) ) (Tebbe & Dwyer, 1 1968; Kovalevsky, 1968). For each P(c) value between M and 1, there exists 1 an integer k between 1 and M − 1 such that k+1 ≤ P(c) ≤ k1 , and Pmin (MP(c) ) = {P(c)1 , . . . , P(c)k , 1 − k P(c), 0, . . . , 0}
(4.8)
h min (MP(c) ) = − [k P(c)log(P(c)) + (1 − k P(c)) log(1 − k P(c))].
(4.9)
As can be seen in equation 4.8, k corresponds to the case in which k out of the M stimuli are assigned a probability of P(c). The remaining [1 − k P(c)] probability mass is assigned to stimulus k + 1, and stimuli k + 2 through M are assigned probabilities of zero. Figure 7A plots h min (MP(c) ) as a function of P(c) for M = 2, 4, and 64 (solid lines). Each arch-shaped line segment in Figure 7A corresponds to a different value of k (in Figure 7A.2 the line segments are labeled with their corresponding k values). It can be seen in the figure that h min (MP(c) ) provides a lower bound on P(c). That is, for a given value of H(S), the lowest possible value of P(c) is given by equation 4.9. Note that h min (MP(c) ) does not depend on M. Increasing M, the number of possible stimuli, does not affect the outcome when the goal is to concentrate the probability mass to those stimuli as much as possible. For instance, if P(c) is 0.5, then no matter how large M is, only the first two stimuli are assigned nonzero probabilities (e.g., P(60.5 ) = { 12 , 12 , 0, 0, 0, 0}). The upper and lower bounds h min (MP(c) ) and h max (MP(c) ) circumscribe the set A M,h of all possible points {P(c), H(S)} (see Figure 7). We note four general features of A M,h . First, its upper and lower bounds, h max (MP(c) ) and
754
E. Thomson and W. Kristan
h min (MP(c) ), are tight (Feder & Merhav, 1994). That is, for a given M and P(c), there exists a distribution P(S) that actually equals h max (MP(c) ) and h min (MP(c) ). These distributions are given by equations 4.6 and 4.8, respectively. Second, for a given value of P(c), the range of possible H(S) values increases with log(M − 1). For the case in which there are only two stimuli (M = 2), H(S) is uniquely specified by P(c). This is because once P(s1 ) is fixed, there is no freedom to vary the remaining probability mass: P(s2 ) must equal 1 − P(s1 ). As M increases, h max (MP(c) ) increases with log(M − 1) without bound. This is because the only term in equations 4.7 and 4.9 that depends on M is log(M − 1), which is in the equation for h max (MP(c) ). Third, while equation 4.4 is incorrect (i.e., higher entropy does not imply a decrement in ideal observer performance), it is possible to infer the range of P(c) values consistent with a given value of H(S). That is, given H(S), it is possible to calculate upper and lower bounds on P(c). This is because both h max (MP(c) ) and h min (MP(c) ) are one-to-one, strictly monotonic functions of P(c) (Feder & Merhav, 1994). Hence, both h max (MP(c) ) and h min (MP(c) ) have inverses that provide upper and lower bounds of P(c), respectively. While closed-form analytical solutions for the inverses do not exist, the bounds can be numerically estimated with accuracy limited only by machine precision.7 For example, numerical methods applied with M = 64 and H(S) = 3 bits show that P(c) can vary between 0.14 and 0.69, bounds marked with filled circles in Figure 7A, panel 3.8 In this case the range of P(c) values, defined as the difference between the maximum and minimum P(c) value, is 0.55. Fourth, the range of P(c) values consistent with a given H(S) is a saturating exponential function of M. If we define the maximum and minimum P(c) values consistent with a given H(S) as PCmax (M, H) and PCmin (M, H), respectively, then the range of P(c) values at a given entropy is given by PCrange (M, H) = PCmax (M, H) − PCmin (M, H). Figure 7B plots PCrange (M, 0.5) and PCrange (M, 3.0) as functions of M (filled and open circles, respectively). As can be seen in the figure, PCrange (M, H) is a saturating exponential function of M that saturates at 1 − PCmin (M, H). We expected this saturating exponential based on the following two considerations. First, the range of H(S) values at a given P(c) increases logarithmically with M (see above), and this is due to the fact that h max (MP(c) ) increases logarithmically
7 Equations 4.7 and 4.9 are of the form y = x + log(x), for which there exists no analytical solution for the inverse. 8 We carried out the calculation of the lower bound on P(c) as follows. A set of h max (MP(c) ) values was obtained over the range P(c) = 14 to 1 using equations 4.7 and 4.9. Then a cubic spline was fit to these numbers using P(c) as the dependent variable (de Boor, 1978). Then, given any value H(S), P(c)max can be estimated using the spline. A similar method was used to calculate the upper bound on P(c).
Quantifying Stimulus Discriminability
755
with M. Because the inverse of the log is an exponential function, and PCmax (M, H) grows with the inverse of h min (MP(c) ), PCrange (M, H) should grow exponentially with M. However, since P(c) can never be greater than 1.0, PCrange (M, H) must also be bounded from above by 1 − PCmin (M, H), so we also expected the curve to saturate at that value. While there do not exist analytical solutions for PCrange (M, H) (see the previous paragraph), the points are indeed well fit by a saturating exponential function constrained to have a maximum of 1 − PCmin (M, H)(see Figure 7B).9 4.1.3 Multiple Question Ideal Observers and Entropy. If an ideal observer makes an error on the first guess and is given another chance to guess S, then it will pick the stimulus with the maximum probability in the conditional distribution P(S |r j , S = smax( j) ). In that case, the analysis from the previous section applies, though with the number of stimuli effectively reduced to M − 1. More generally, after G guesses, the stimulus set is effectively reduced to M − G stimuli, where G can vary between zero and M − 1. In contrast, if the goal is to guess the value of a random variable using the fewest number of yes-or-no questions (i.e., the game of 20 Questions; Cover & Thomas, 1991), then the questions that receive yes and no answers can be represented by ones and zeros, respectively. Then, by the source coding theorem, H(S) describes the best possible performance L min (Cover & Thomas, 1991). In 20 Questions, the optimal question sequence will cut down the range of possible outcomes to quickly specify the stimulus. For instance, if there are eight equiprobable stimuli, then the initial question should be of the form, “Is it above four?” The ideal observer, on the other hand, must always guess a single outcome, that is, the ideal observer must always ask questions of the form, “Was it an 8?” 4.2 Response-Conditional Entropy and the Ideal Observer. The entropy of S, given that r j is observed, is known as the response-conditional entropy H(S |r j ), which is defined as (Cover & Thomas, 1991) H(S |r j ) = −
M
P(si |r j ) log(P(si |r j )).
(4.10)
i=1
Note that equation 4.10 is simply equation 4.2 with the conditional distribution P(S |r j ) substituted for P(S). Hence, all the properties of H(S) (see section 4.1.1) extend to H(S |r j ). For instance, in addition to measuring how evenly spread the conditional distribution P(S |r j ) is, H(S |r j ) provides bounds on the minimum average number of binary digits required to encode S once r j is known. The lines are the least-squares fits to saturating exponentials of the form (1 − M−T β PCmin (M))(1 − e −( α ) ) + PCra nge (T), where T is the smallest value of M for which the given H(S) is defined and α/β are free parameters. 9
756
E. Thomson and W. Kristan
Since H(S |r j ) is a function of P(S |r j ), we compare H(S |r j ) to the performance of an ideal observer of r j (i.e., P(c |r j )), which is also a function of P(S |r j ). Recall that an ideal observer of r j will pick the most likely stimulus from the conditional distribution P(S |r j ) (Rule 1). Because H(S |r j ) is mathematically equivalent to H(S) (i.e., H(S |r j ) is a function of an ordinary M-element probability distribution, and the ideal observer picks the stimulus with the maximum probability from this distribution), H(S |r j ) has the exact same properties with respect to P(c |r j ) that H(S) has with respect to P(c) (see section 4.1.2). For example, for a given P(c |r j ), there is a range of possible H(S |r j ) values bounded by h max (MP(c | r j ) ) = H(C |r j ) + P(e |r j ) log(M − 1)
(4.7 )
h min (MP(c | r j ) ) = −[k P(c |r j ) log P(c |r j ) + (1 − k P(c |r j )) log(1 − k P(c |r j )).
(4.9 )
We label equations 4.7 and 4.9 because they are simply equations 4.7 and 4.9 with P(c) replaced by P(c |r j ) and MP(c) replaced by MP(c | r j ) (see Figure 7). Also note that the set of possible {P(c |r j ), H(S |r j )} points is identical to the set AM,h of possible points {P(c), H(S)} discussed in section 4.1 (see Figure 7). We use the label AM,h for both sets of points, letting the context make clear whether we are referring to entropy or response-conditional entropy. 4.3 Specific Information and the Ideal Observer. How much information does a specific response r j provide about the stimulus set S? In information theory, this is quantified by the specific information between S and r j (DeWeese & Meister, 1999). Specific information is formally defined as I (S, r j ) = H(S) − H(S |r j ) = −
M
P(si ) log(P(si ))
i=1
+
M
P(si |r j ) log(P(si |r j )).
(4.11)
i=1
Since H(S) is the original entropy of S, and H(S |r j ) is the entropy of S remaining after r j is observed, it follows that I (S, r j ) is a measure of how much entropy about S is removed upon observation of response r j . (For examples, see Figure 8A). Interestingly, I (S, r j ) can be negative (DeWeese & Meister, 1999), which occurs when P(S |r j ) is more evenly spread than P(S) (e.g., rows 1 and 2 in Figure 8A). Conversely, I (S, r j ) is positive when P(S |r j ) is more concentrated than P(S) (e.g., rows 3 and 4 in Figure 8A). Interpreted in terms of coding theory (see section 4.1.1), I (S, r j ) indicates how many fewer binary digits, on average, are needed to encode S once r j is known. This number is negative when more digits are required to encode S once r j is observed.
Quantifying Stimulus Discriminability
757
Figure 8: Examples that compare P(c) and specific information. (A) Each row of the table contains a conditional distribution P(S |r j ). Columns 2–5 contain the corresponding ideal observer performance P(c |r j ), response-conditional entropy H(S |r j ), and specific information I (S, r j ), respectively. The calculation of I (S, r j ) assumes that P(S) is as given with an entropy H(S) of 1.25 bits. (B) Scatter plot of the four {P(c |r j ), I (S, r j )} pairs from the rows in the table in A. The point from row j of the table in A is denoted P I j . The absolute upper and lower bounds on I (S, r j ) as a function of P(c |r j ) are superimposed for reference (dashed and solid lines, respectively).
While specific information has not received much attention from neuroscientists (but see DeWeese & Meister, 1999), it is a potentially useful measure that can be used to determine whether certain classes of neuronal responses transmit more information than others. For instance, is I (S, r j ) greater for spike trains with higher firing rates? How is I (S, r j ) related to ideal observer performance? While this question has not been discussed in the literature, we extend the previous results to address it. Since I (S, r j ) measures the amount of information carried by a particular response r j about the stimulus set S, we compare I (S, r j ) to the performance of an ideal observer that has observed a particular response r j , that is, P(c |r j ). Before examining the general relationship between I (S, r j ) and P(c |r j ), we use the examples in Figure 8A to highlight three features of their relationship. First, surprisingly, a response can carry negative information about S and be a better predictor of S than a response that carries positive information about S (e.g., compare rows 1 and 3). Second, P(c |r j ) does not
758
E. Thomson and W. Kristan
uniquely specify I (S, r j ) (compare rows 2 and 3). The converse also holds (rows 3 and 4). Third, associated with each conditional distribution P(S|r j ) is a point {P(c |r j ), I (S, r j )} that indicates the ideal observer performance and specific information for that distribution. Figure 8B plots the points that correspond to the four conditional distributions in Figure 8A. The bounds on the set of allowable such points, AM,i , are superimposed on Figure 8B for comparison. The derivation of AM,i , the set of allowable {P(c |r j ), I (S, r j )} points, is a natural extension of the results from sections 4.1 and 4.2. Assume that we know P(S) (a reasonable assumption, as in most discrimination tasks P(S) is controlled by the experimenter). Then H(S), the first term in equation 4.11 is fixed. Hence, I (S, r j ) varies only with the response-specific entropy H(S |r j ) (see section 4.2), the second term in equation 4.11. As shown in section 4.2 {P(c |r j ), H(S |r j )} must lie in the set AM,h , circumscribed by h max (MP(c | r ) ) and h min (MP(c | r ) ), bounds that are defined in equations 4.7 and 4.9 , respectively. It follows from basic properties of inequalities that for a given P(c |r j ) the lower and upper bounds on specific information are: i min (P(S) P(c | r ) ) = H(S) − h max (MP(c | r ) )
(4.12)
i max (P(S) P(c | r ) ) = H(S) − h min (MP(c | r ) ).
(4.13)
Figure 9 shows examples of i min (P(S) P(c | r ) ) and i max (P(S) P(c | r ) ) for different values of M and H(S). Four consequences of equations 4.12 and 4.13 deserve mention. First, if M (the number of stimuli) is fixed and H(S) varies, the shape of AM,i does not change, but is merely shifted vertically with H(S) (see Figure 9). Second, the greater H(S) is, the fewer possible negative I (S, r j ) values there are. If H(S) is maximized and equals log(M), then I (S, r j ) is always greater than or equal to zero (see Figure 9). This is because in such a case, P(S |r j ) cannot be more evenly spread than P(S), which is the necessary condition for the existence of a negative I (S, r j ). At the other extreme, as H(S) → 0, most distributions P(S |r j ) have greater entropy than the highly concentrated P(S), so there exists a greater number of possible negative values of I (S, r j ). Third, the range of permissible I (S, r j ) values at a particular P(c |r j ) increases with log(M − 1). This is because i max (P(S) P(c | r ) ) does not depend on M, and for a given P(c |r j ), i min (P(S) P(c | r ) ) decreases with log (M − 1). As with H(S), at a given value of H(S |r j ), the range of permissible P(c |r j ) values is a saturating exponential function of M (see section 4.1.2). Fourth, because i min (P(S) P(c | r ) ) and i max (P(S) P(c | r ) ) are invertible functions, it is possible to use numerical methods to calculate the range of P(c |r j )—values consistent with a given I (S, r j ) (see section 4.1.2). 4.4 Equivocation and the Ideal Observer. If we average the responseconditional entropy H(S | r j ) (see section 4.2) over all R, we obtain the
Quantifying Stimulus Discriminability
759
Figure 9: Comparison of I (S, r j ) and P(c). (A) Plot of upper and lower bounds on I (S, r j ) as a function of P(c) (dashed and solid lines, respectively). In this example, M = 2, so the upper and lower bounds are the same and the dashed lines are not visible. The three sets of bounds shown correspond to cases in which stimulus distributions with different H(S) values are chosen, and these entropy values are indicated next to the corresponding three lines. δ stands for an arbitrarily small real number. (B) Same as A, but with M = 4.
equivocation (Cover & Thomas, 1991) H(S | R) =
N j=1
P(r j )H(S | r j ) = −
M N
P(si , r j ) log(P(si | r j )).
(4.14)
j=1 i=1
H(S | R) is often described as the average uncertainty remaining in S once R is given (Ash, 1965). It describes the average minimum number of binary digits required to encode the value of S once R is specified. Before describing the general relationship between H(S | R) and P(c), we consider the examples in Figure 10A, which show four different joint
760
E. Thomson and W. Kristan
Figure 10: Examples comparing P(c) with H(S | R) and I (S,R). (A) The second column of the table contains four joint probability distributions. Columns 3–5 contain the corresponding values of P(c) (see equation 3.3), H(S | R) (see equation 4.14), and I (S,R) (see equation 4.17), respectively. (B) Scatter plot of the points {P(c), H(S | R)} corresponding to the rows from the table: the point labeled PHi corresponds to row i. The absolute upper and lower bounds on H(S | R) are overlaid for comparison.
distributions (M = 2, N = 3). For each distribution, P(S) = { 12 , 12 }, so H(S) = 1 bit. Given the joint distribution P(S,R), H(S | R) and P(c) can be calculated by substituting the appropriate terms into equations 4.14 and 3.3, respectively. The examples highlight two features of the relationship between H(S | R) and P(c). First, H(S | R) is not uniquely specified by P(c) (rows 1 and 2). The converse also holds (rows 1 and 4). These examples illustrate that H(S | R) and P(c) measure quite different features of P(S,R). P(c) remains unchanged as long as the sum of the maximal elements in the columns remains the same, and the nonmaximal elements of P(S,R) can arbitrarily vary within this constraint. Equivocation, on the other hand, increases as the entropy of this nonmaximal probability mass is increased. For example, the gray area in
Quantifying Stimulus Discriminability
761
Figure 4 shows the probability mass that contributes to P(e). Provided that the maximal elements remain unchanged, H(S | R) will increase as this gray area is spread out and decrease as the gray area becomes more concentrated. Second, higher equivocation does not imply lower P(c) (rows 1 and 3). That is, just because a variable R1 removes more uncertainty about S than another variable R2 (i.e., H(S | R1 ) < H(S | R2 )), this does not imply that an ideal observer can better estimate S on the basis of R1 . More generally, just as with H(S), there exist upper and lower bounds on H(S | R) as a function of P(c). For a given M and P(c), we denote the upper and lower bounds on H(S | R) as Hmax (MP(c) ) and Hmin (MP(c) ), respectively. Previous papers (Tebbe & Dwyer, 1968; Kovalevsky, 1968) show that Hmax (MP(c) ) = h max (MP(c) ).
(4.15)
Equation 4.15 shows that the upper bound on equivocation is the same as the upper bound on entropy. Figure 11A plots Hmax (MP(c) ) for the cases M = 2, 4, and 64 (dashed lines). Obviously, Hmax (MP(c) ) has the exact same properties as h max (MP(c) ) discussed in section 4.1.2. Equation 4.15 is equivalent to Fano’s inequality (Cover & Thomas, 1991), and provides an upper bound on P(c). That is, it delineates the best P(c) consistent with a given equivocation. On the other hand, it is possible that the actual P(c) of a distribution with a given value of H(S | R) is much lower than the upper bound provided by equation 4.15. Previous papers derive Hmin (MP(c) ) (Tebbe & Dwyer, 1968; Kovalevsky, 1968), which provides the lowest P(c) consistent with a given 1 equivocation. We present their result without proof. For P(c) between M and 1, there exists an integer k (the same k that was introduced in equation 1 4.9) such that k+1 ≤ P(c) ≤ k1 , and k+1 1−k Hmin (MP(c) ) = log(k) + k(k + 1) log P(e) + . (4.16) k k Just as was the case with h min (MP(c) ), Hmin (MP(c) ) is broken up into M − 1 line segments. Each segment is the straight line that connects the end points of the M − 1 arch-shaped segments that delineate h min (MP(c) ), the lower bounds on entropy (see section 4.1.2). The lower bounds on equivocation are shown in Figure 11A for the M = 2, 4, and 64 cases (solid lines), and we include the lower bounds on entropy [h min (MP(c) )] in the same figure for comparison (light gray lines). Note that Hmin (MP(c) ) is not a function of M, so at a given value of P(c), increasing M does not change the lower bound on equivocation. Hmax (MP(c) ) and Hmin (MP(c) ) circumscribe the set AM,H of allowable {P(c), H(S | R)} points, which are indicated in Figure 11A for M = 2, 4, and 64.10 We highlight four features of AM,H . First, Feder and Merhav (1994) give
10
Formally, AM,H is the convex hull of AM,h (Tebbe & Dwyer, 1968; Kovalevsky, 1968).
762
E. Thomson and W. Kristan
Figure 11: Comparison of P(c) and H(S | R). (A) Plot of the upper and lower bounds on H(S | R) as a function of P(c) for M = 2, M = 4, and M = 64 (panels 1, 2, and 3, respectively). The upper bound (Hmax (MP(c) )) is indicated by a dashed line and the lower bound (Hmin (MP(c) )) by a solid line. The corresponding lower bounds on entropy (h min (MP(c) )) are overlaid in gray for comparison. The sets of allowable {P(c), H(S | R)} pairs are labeled AM,H . (B) Example of how P(S) further constrains what values of P(c) are consistent with a given H(S | R) (M = 3 in this example). In this case, we assume that P(smax ) = 0.7, so P(c) cannot be less than 0.7 (see section 3.1). The bounds on H(S | R) with this additional constraint included are shown in black, and the bounds provided by equations 4.15 and 4.16 alone are shown in gray for comparison.
algorithms for generating joint distributions that actually attain the bounds, so the bounds are tight.11 11 As a caveat, note that for a fixed input distribution P(S), these bounds are not necessarily tight (see section 4.5.2).
Quantifying Stimulus Discriminability
763
Second, just as was the case with H(S), for a given P(c) the range of possible H(S | R) values increases with log(M − 1). This is because the only term from equations 4.15 and 4.16 that varies with M is the log(M − 1) term that equation 4.15 inherits from equation 4.7 . As a corollary, the range of P(c) values consistent with a given H(S | R) increases exponentially with M. Third, although H(S | R) does not uniquely map onto P(c), the fact that Hmax (MP(c )) and Hmin (MP(c) ) are invertible (Feder & Merhav, 1994) implies that it is possible to calculate upper and lower bounds on P(c) for a given H(S | R) (see section 4.1 for details). Also, if P(S) is known, then an additional constraint can be used to further narrow the range of P(c) values consistent with a given equivocation. Namely, since P(c) must be greater than or equal to P(smax ) (see section 3.1), we can eliminate all P(c) values below P(smax ). As illustrated in Figure 11B, this constraint can considerably tighten the range of P(c) values consistent with a given H(S | R). Fourth, AM,H does not depend on the response distribution P(R) or the number of possible responses N. If the goal is to make inferences between H(S | R) and P(c), this is a very useful property, as it allows us to avoid two practical problems. First, it is in general a very difficult problem to obtain unbiased estimates of N and P(R). This is partly because when R is a variable describing a neuronal response, N is usually quite large and most outcomes have a very low probability of occurring. In such cases, estimates of N and P(R) suffer from biases due to undersampling (Paninski, 2004; Orlitsky, Santhanam, & Zhang, 2003). Second, if Hmax (MP(c) ) and Hmin (MP(c) ) depended on N, then for each representation of the neural response, we would have to recalculate the upper and lower bounds on H(S | R) in order to calculate the corresponding bounds on P(c). The independence of AM,H from N and P(R) circumvents both of these problems. 4.5 Mutual Information and the Ideal Observer. Typically, researchers calculate equivocation as an intermediate step in the estimation of mutual information, the information-theoretic quantity most often used by neuroscientists. The mutual information (also called transinformation and transmitted information) between random variables S and R is the average (over R) specific information:
I (S,R) =
N
P(r j ) I (S,r j ) = H(S) − H(S | R) =
j=1
× log
M N
P(si ,r j ) . P(si )P(r j )
P(si ,r j )
j=1 i=1
(4.17)
I (S,R) is the standard measure of how much information a variable R transmits about variable S (Ash, 1965; Cover & Thomas, 1991).
764
E. Thomson and W. Kristan
4.5.1 Three Interpretations of I(S,R). Almost universally, I (S,R) is interpreted as a measure of the average amount of uncertainty removed by R about S (Ash, 1965; Cover & Thomas, 1991). Operationally, what does this mean? In this section, we discuss three interpretations of I (S,R) that often motivate the use of I (S,R) by neuroscientists. First, I (S,R) indicates how many fewer binary digits are required, on average, to encode S once R is specified. This follows from the fact that the specific information I (S, r j ) quantifies how many fewer digits are required to encode S once r j is known (see section 4.3), and I (S,R) is the average of I (S,r j ) over all R . Second, I (S,R) is a direct measure of the degree of statistical dependence between S and R (Schneidman, Bialek, & Berry, 2003). If S and R are independent variables, then for all stimulus-response pairs, P(si ,r j ) = P(si )P(r j ) (Yates & Goodman, 1999). This equality implies that all arguments of the log in equation 4.17 are one, so I (S,R) is zero. Interestingly, if observed frequencies are used to estimate the probabilities in equation 4.7, then I (S,R) is one-half the log-likelihood test statistic (G 2 ) when the null hypothesis is that S and R are independent (Forbes, 1995). A third interpretation of I (S,R) is that it measures stimulus discriminability, or how well the stimulus can be predicted given the neuronal response (Alkasab et al., 1999; Arabzadeh et al., 2004; Buracas & Albright, 1999; Li et al., 2004; Paz & Vaadia, 2004; Petersen et al., 2002; Pola et al., 2003; Theunnissen & Miller, 1991; for exceptions, see Oram, Foldiak, Perret, & Sengpiel, 1998; Treves, 1997). That is, researchers often assume that I (S,R) can be used as a surrogate for ideal observer performance, P(c). For example, one paper claims, “Mutual information quantifies how well an ideal observer of neuronal responses can discriminate between all the different stimuli based on a single trial” (Pola et al., 2003, p. 37). Even in time-series analysis, predictive information is defined as the mutual information between past and future events (Bialek, Nemenman, & Tishby, 2001), again suggesting that higher mutual information implies improved predictability. When examined quantitatively, this third interpretation of I (S,R) is shown to be incorrect. Let us assume P(S) is fixed, R1 and R2 are different representations of the neural response (e.g., PSTHs at different bin widths), and P(c)i is ideal observer performance when observing response variable Ri . The third interpretation is equivalent to I (S,R1 ) ≥ I (S,R2 ) ⇔ P(c)1 ≥ P(c)2 .
(4.18)
That is, if R1 carries more information about S than R2 , then an ideal observer would be better at estimating S on the basis of R1 than R2 . That equation 4.18 is false has been known since 1965 when a single counterexample was published (Wagner, 1965). In the next section we examine the general relationship between I (S,R) and P(c).
Quantifying Stimulus Discriminability
765
Figure 12: Examples comparing I (S,R) to P(c). Each point labeled P Ii is the {P(c), I (S,R)} point corresponding to row i of the table in Figure 10A. The upper and lower bounds on I (S,R) as a function of P(c) are overlaid in gray for comparison. In this example, M = 2.
4.5.2 Comparing I (S,R) and P(c). We highlight certain features of the relationship between I (S,R) and P(c) using previous examples: the mutual information corresponding to the four joint distributions in Figure 10A is shown in column 5 of the table in Figure 10A. Figure 12 illustrates the {P(c), I (S,R)} points associated with each of these joint distributions. We note two features of the relationship between P(c) and I (S,R) from these examples. First, equation 4.18 is incorrect, as can be seen by comparing rows 1 and 3. Second, P(c) is not associated with a unique I (S,R) (rows 1 and 2). The converse is also true (rows 1 and 4). These examples suggest that there is a range of I (S,R) values consistent with a given P(c), and vice versa. To address their relationship more generally, we derive upper and lower bounds on I (S,R) as a function of P(c).12 Assume P(S), and hence H(S), is fixed by the experimenter. For a given P(c), there exist the following upper and lower bounds on I (S,R):
(4.19) Imin P(S) P(c) = H(S) − Hmin (Mp(c) )
(4.20) Imax P(S) P(c) = H(S) − Hmax (Mp(c) ), where Hmax (MP(c) ) and Hmin (MP(c) ) are as defined in equations 4.12 and 4.13 (see section 4.4). The proof is as follows. Since the first term in equation 4.17 [H(S)] is constant, I (S,R) varies only with H(S | R). However, as Treves (1997) addresses a similar question under the assumption that M = N, that is, the number of responses equals the number of stimuli. Our results relax that assumption and produce tighter upper bounds. 12
766
E. Thomson and W. Kristan
described in section 4.4, we know that all points {P(c), H(S | R)} must lie in AM,H , the region whose lower and upper bounds are given by equations 4.12 and 4.13, respectively. Equations 4.19 and 4.20 follow from this fact and basic properties of inequalities. Clearly the bounds imposed on I (S,R) by equations 4.19 and 4.20 can be tightened further. For one, I (S,R) is always nonnegative (Cover & Thomas, 1991). Second, P(c) cannot be less than P(smax ) (see section 3.1). Plots of the bounds on I (S,R) with these additional two constraints added are shown in Figure 13 for different stimulus distributions P(S) (solid bold lines). The bounds provided by equations 4.19 and 4.20 alone are shown in light gray. We mention three facts about the relationship between P(c) and I (S,R). First, since Imin (P(S) P(c) ) and Imax (P(S) P(c) ) are one-to-one, monotonically increasing functions of P(c), both functions have inverses that can be estimated using the methods discussed in section 4.1. Hence, given an estimate of mutual information I (S,R), it is possible to calculate the range of P(c) values consistent with that estimate. As with all other quantities discussed so far, the range of P(c) values consistent with a given I (S,R) increases exponentially with M, the number of stimuli. Second, neither Imin (P(S) P(c|r ) ) nor Imax (P(S) P(c|r ) ) depends on P(R) or N. We discussed the benefits of this fact in section 4.4. Finally, using numerical optimization, we have generated many joint distributions P(S,R) that reach the bounds Imin (P(S) P(c) ) and Imax (P(S) P(c) ). Four such distributions are provided in Figure 10A. However, we have not proved in general that the bounds are tight. That is, for an arbitrary P(S) and P(c), it is not guaranteed that there exists a distribution P(R,S) such that I (S,R) = Imin (P(S) P(c) ) or Imax (P(S) P(c) ). We conjecture that the bounds are not generally tight. Aside from numerical optimization procedures we have implemented in which the bounds were not reached (data not shown), this conjecture is based on the following reasoning. Given P(S) and P(c), there must exist channel matrices P(R | S) and response distributions P(R) such that P(R | S)T P(S) = P(R). For this equation to hold and the bounds on I (S,R) to be tight, four constraints must be satisfied: the rows of P(R | S) must sum to 1, as must the elements of P(R), the resultant joint distribution P(R,S) must satisfy equation 3.3, and P(S,R) must reach the bound on I (S,R) given in equation 4.19 or 4.20. Even if N > M > 1, we think it is unlikely that these constraints can be satisfied for an arbitrary P(S), P(c) pair. An interesting area for future research would be to use these constraints to derive even tighter bounds on I (S,R). 4.6 Channel Capacity and the Ideal Observer. The capacity of a channel P(R | S) is defined as (Cover & Thomas, 1991) C(P(R | S)) = max I (S,R). P(S)
(4.21)
Quantifying Stimulus Discriminability
767
Figure 13: Comparison of I (S,R) and P(c). (A) Plots of Imax (P(S) P(c) ) and Imin (P(S) P(c) ) as functions of P(c) (dashed and solid gray lines, respectively). In these examples, M = 3. Panels 1 and 2 plot bounds on mutual information under two different assumptions about the value of the stimulus distribution P(S), and these P(S) values are shown in each panel. The corresponding entropy (H(S)) and maximum stimulus probability (P(smax )) values are also shown in each panel. The black lines show the constraints that are added to the bounds due to the nonnegativity of I (S,R) and the fact that P(c) cannot be less than P(smax ) (see the text). (B) Same as in A, except M = 64.
That is, the capacity is the maximum possible mutual information between S and R using the specified channel, where the maximum is calculated over the set of all possible input distributions. We use PC (S) to denote the stimulus distribution that solves this maximization problem. In general,
768
E. Thomson and W. Kristan
Figure 14: Examples comparing C(P(R | S)) to P(c). As discussed in the text, although C(P2 (R | S)) is greater than C(P1 (R | S)) (i.e., C2 > C1 ), an ideal observer of both channels achieves the same performance (P(c) = 2/3).
calculating the channel capacity is a difficult problem that is often solved using numerical optimization techniques (Cover & Thomas, 1991). As with the other information-theoretic quantities, we would like to know whether the fact that a channel has a higher capacity than another implies that an ideal observer of R would better be able to predict S using that channel. More precisely, it would be interesting to know whether the following is true: C(P1 (R | S)) > C(P2 (R | S)) ⇔ P1 (c) > P2 (c),
(4.22)
where Pi (R | S) is channel i and Pi (c) is ideal observer performance using channel i. Also, we stipulate that the P(c) values are calculated when the input distribution P(S) is set to PCi (S), the stimulus distribution that satisfies equation 4.21 for channel i. To our knowledge, nobody has published analytical results that bear on equation 4.22. Also, one must be cautious in interpreting counterexamples to equation 4.22 because different channels will likely have different input distributions that cause the channel to reach capacity, and it is not clear that a direct comparison of P(c) in such cases is appropriate. The channels shown in Figure 14 provide a counterexample to equation 4.22, and we have constructed the example so that in both channels,
Quantifying Stimulus Discriminability
769
PC (S) is the same. The first channel is a symmetrical channel, so C(P(R | S)) and PC (S) can be calculated with known formulas (Cover & Thomas, 1991).13 By substituting the second channel’s values into equation 4.17, the mutual information of the second channel can be simplified to I (S,R) = 13 H(S), which is maximized when P(S) = {1/2 ,1/2 }. While the results of the previous sections do not suggest an obvious way to calculate upper and lower bounds on C(P(R | S)) as a function of P(c), the example in Figure 14 suggests that such bounds likely exist. 5 Extension to Continuous Distributions 5.1 Background and Definitions. The results in section 4 depend on the assumption that S and R are discrete sets. In practice, this is not a significant limitation because continuous distributions are always effectively binned and discretized since we can never represent them with infinite precision. However, for conceptual completeness, we extend the previous analysis to continuous distributions. This extension requires that we modify both the definitions of the information measures and the measure of estimation error. The upshot of the analysis is that if S is continuous, the information measures provide even fewer constraints on the error measure than when S is discrete. The entropy of a continuous distribution S, also known as the differential entropy, is defined analogously to entropy for the discrete case, with the sum replaced by an integral (Cover & Thomas, 1991): h(S) = −
∞
−∞
P(s) log(P(s))ds.
(5.1)
The remaining information measures such as equivocation are also defined analogously to the discrete case (Cover & Thomas, 1991). The term differential entropy is used instead of entropy because the differential entropy does not have the same mathematical properties as entropy. For example, the differential entropy changes when S is multiplied by a scaling factor a (i.e., h(a S) = h(S) + log(|a |)) (Shannon & Weaver, 1949). Also, h(S) can be negative (Shannon & Weaver, 1949). For example, the differential entropy of a uniform distribution defined over the interval [−w/2, w/2] is log(w), which is negative for w < 1 (Cover & Thomas, 1991). A second change we make to accommodate continuous variables is in our evaluation of estimation error. This is because when sˆ is a point estimate of the stimulus, it follows from equation 3.2 that P(c|r j ) = 0, so P(c) = 0. Hence,
13
For a symmetrical channel, PC (S) is the uniform distribution and the capacity is log(M) − H(P(R | si)), where H(P(R | si )) is the entropy of an arbitrary row of the channel (Cover & Thomas, 1991).
770
E. Thomson and W. Kristan
P(c) is an inappropriate measure of the success in estimating the value of a continuous variable S and needs to be replaced by an error term that increases with the distance between s and sˆ . The most common such measure is the squared error between s and sˆ , (s − sˆ )2 (Yates & Goodman, 1999). The average (over S) squared error, or mean squared error, is a useful overall measure of the quality of sˆ as an estimate of S (Yates & Goodman, 1999): MSE(S) =
∞
−∞
P(s)(s − sˆ )2 ds.
(5.2)
When using MSE(S) to evaluate an estimate of S, the goal is to use an estimate sˆ that will minimize MSE(S), or the minimum mean squared error estimator of S. The minimum mean square estimator of any random variable S (discrete or continuous) is the mean of S, which we denote as µS (Yates & Goodman, 1999). As a corollary, the variance of S, σ 2S , is the actual value of the minimum mean squared error, which we denote E(S): E(S) =
∞
−∞
p(s)(s − µS )2 ds = σ 2S .
(5.3)
The first equality is just equation 5.2 with the minimum mean squared error estimator µS substituted for sˆ , and the second equality is the definition of the variance of S (Yates & Goodman, 1999). While not technically an ideal observer, the minimum mean-squared error estimator is a close relative because it minimizes an error function. In sum, to extend the results from section 4 to continuous distributions, we must first replace the entropy-based information measures with the differential entropy-based measures and then replace P(c) with E(S). In the following section, we briefly show how to calculate the bounds on each information measure as a function of E(S). Note that this idea that such bounds could be derived was suggested but not carried out in Feder & Merhav (1994). 5.2 Comparing Information Measures and Minimum Mean Squared Error. The results from section 4 assume that S and R are discrete sets. The extension to the case where R is continuous and S is discrete is trivial, as none of the results depends on the assumption that R is discrete. However, for reasons described in the previous section, the extension to the case of continuous S requires a significant extension of the analysis, to which we now turn. 5.2.1 Differential Entropy and E(S). For a given value of E(S), there is an upper bound on h(S) given by Cover and Thomas (in press):
Quantifying Stimulus Discriminability
h max (S) =
1 log(2π e E(S)). 2
771
(5.4)
Figure 11A plots this relationship.14 Equation 5.4 places a lower bound on E(S) for a given value of h(S) (Cover & Thomas, in press), as can be seen in Figure 15A. We prove that there exists no upper bound on E(S) for a given value of h(S). Assume that there exists such an upper bound U on E(S) when h(S) takes on some arbitrary value h. To generate a counterexample, let [−w/2, w/2] be the interval of a uniform distribution that is constructed to satisfy h = log(w), as depicted in Figure 15B.1. We then split this uniform distribution into two segments of width w/2, such that there is a distance D between these two segments (see w Figure 15B.2). Because E(S) = 12 + 14 (D2 + Dw), which can be verified by substituting the equation for the split distribution into equation 5.3, there exists a number D such that E(S) > U . By indirect proof, we have shown that there exists no upper bound U on the error for a given entropy. The set B of possible {h(S), E(S)} points for continuous distributions includes the curve delineated by equation 5.4 and all points to the right of the curve (see Figure 15A). It follows that for a given E(S), there is no lower bound on h(S), or: h min (S) = −∞
(5.5)
In sum, h(S) can vary between − ∞ and h max (S) at a given E(S). 5.2.2 Response-Conditional Entropy and E(S | r j ). If we know that response r j occurred and the resulting conditional distribution of S is P(S | r j ), then the same analysis as in section 5.2.1 applies. That is, the minimum MSE estimator of S given r j is µS|r j, the mean of P(S | r j ), and the error E(S | r j ) in this case is σS2|r j , the variance of P(S | r j ) (Yates & Goodman, 1999). In this case, equations 5.4 and 5.5 apply, with S replaced by S|r j . 5.2.3 Specific Information and E(S|r j ). Following a proof strategy exactly analogous to that in section 4.3, it can be shown that I (S,r j )max = ∞
(5.6)
I (S,r j )min = − h max (S),
(5.7)
bounds, which are illustrated in Figure 15C. 14 Note that equation 5.3 is stated in Cover and Thomas (in press) without proof. Proof of equation 5.4: In the set of all continuous distributions with fixed variance σ 2 , the
gaussian has the maximum differential entropy, which is h gauss (S) = 12 log 2π eσ 2 (Cover 2 & Thomas, 1991). Also, since the minimum error estimator has error E(S) = σ (see section 5.1), this implies that the maximum differential entropy consistent with a certain value of E(S) is given by equation 5.3.
772
E. Thomson and W. Kristan
Figure 15: Comparison of continuous information measures with minimum mean squared error estimators. (A) Plot of the maximum differential entropy (h max (S)) as a function of the minimum mean squared error E(S) (see equation 5.7). The set of points to the right of, and including, the curve is the set of permissible {E(S), h(S)} points. The same equation also describes the maximum response-conditional entropy and equivocation as a function of minimum meansquared error (see the text). (B) Geometrical representation of the proof that there is no upper bound on error for a given entropy. Panel 1 shows a uniform distribution with width w. When this probability mass is split into two sections of width w/2, and these sections are separated by distance D (panel 2), h(S) remains the same but the error increases with D2 (see the text). (C) Plot of the lower bound on specific information (gray line) and mutual information (black line) as a function of E(S |r ) and E(S), respectively. There is no upper bound on the information measures: the points to right of, and including, the lines are the set of permissible {E(S), I (S,R)} points (see the text).
Quantifying Stimulus Discriminability
773
5.2.4 Equivocation and E(S). The average minimum mean squared error E(S) when response variable R is available to help estimate S is the mean (over R) of E(S | r j ): E(S) =
N
p(r j )E(S | r j ).
(5.8)
j=1
If R is continuous, the sum can be replaced by the appropriate integral. The bounds on h(S | R) are identical to the bounds on h(S). To prove this, we use a mathematical technique borrowed from the papers that derived the bounds on equivocation in the case of discrete S (Tebbe & Dwyer, 1968; & Kovalevsky, 1968).15 First, recall from section 5.2.2 that B, the set of permissible {E(S | r j ), h(S | r j )} points, is fixed by equations 5.4 and 5.5. Second, note that {E(S), h(S | R)} =
N
P(r j ){E(S | r j ), h(S | r j )},
(5.9)
j=1
where the sum can be replaced by an integral if R is continuous. The right-hand side of equation 5.9 is simply a convex combination of points of the form {E(S | r j ), h(S | r j )}, which from section 5.2.2 we know must be in the set B. Hence, the set B∗ of permissible {E(S), h(S | R)}, pairs must lie in the convex hull of B. However, B is already a convex set, so B = B∗ (Boyd & Vandenberghe, 2004). In other words, the bounds on h(S | R) as a function of E(S) are identical to the bounds on h(S | r j ) as a function of E(S | r j ), that is, h max (S | R) = h max (S)
(5.10)
h min (S | R) = h min (S) = −∞.
(5.11)
Bounds are shown in Figure 15A. 5.2.5 Mutual Information and E(S). Using the results from section 5.2.4, it is possible to use an argument analogous to that in section 4.5 to show Imax (S,R) = h(S) − h min (S) = ∞
(5.12)
Imin (S,R) = h(S) − h max (S).
(5.13)
The fact that I (S,R) ≤ h(S) (Cover & Thomas, 1991) provides an additional constraint and the set of permissible {E(S), I (S,R)} points with all of these constraints included is shown in black in Figure 15C.
15 For an introduction to the concepts from convex analysis used in the following proof, see Boyd & Vandenberghe (2004).
774
E. Thomson and W. Kristan
6 Discussion 6.1 Measuring Stimulus Discriminability. An animal that cannot discriminate successfully on single trials will not survive long in natural conditions. A frog, for example, will not get a second chance to catch a fly if it misses on its first try. As discussed in section 3.2, ideal observer performance, P(c), is a useful and natural measure of single-trial stimulus discriminability. Researchers also often use information measures with the goal of quantifying discrimination performance. In section 4, we showed that this is typically not justified. In particular, rather than there being a one-to-one relationship between information-theoretic quantities and P(c), there is typically a range of permissible P(c) values associated with a given information measure. In section 5 we showed that when the analysis is extended to continuous stimulus distributions, the problems with the information measures are only exacerbated, as they provide no upper bounds on estimation error. If the goal is to make inferences from information measures to P(c), a caveat should be noted: as the number of stimuli (M) increases, the range of P(c) values associated with a given information-theoretic quantity increases exponentially (see section 4). Hence, it would be prudent to pick a stimulus set that is small enough to significantly narrow the range of permissible P(c) values. While the most conservative option is to use only two stimuli (M = 2), in some cases this will not be desirable because the stimulus space will not be adequately sampled. 6.2 Information Theory in Neuroscience. We have shown that to quantify stimulus discriminability, the tools of ideal observer analysis are preferable to those of information theory. Information theory, however, is helpful for answering questions that ideal observer analysis cannot address. We list three. We do not intend the list to be exhaustive, but are simply mentioning those applications that follow most naturally from the results in section 4. First, as we discussed in section 4.5.1, mutual information I (S,R) measures the degree of statistical dependence between random variables S and R. It is a useful measure because it is completely general: it will detect any deviation from independence whether it is due to linear correlation or some nonlinear dependency.16 Ideal observer analysis does not directly quantify such dependencies. But even if two random variables are dependent, the tools of ideal observer analysis are required to determine how well one can be predicted from the other. Second, there clearly exist psychophysical tasks that should be evaluated using information measures rather than P(c). For instance, if the goal is to
16 Of course, there exist other tests for statistical dependence (e.g., the χ 2 test), and it is up to the researcher to decide which statistic is appropriate for his or her data.
Quantifying Stimulus Discriminability
775
evaluate a subject’s performance in the game of 20 Questions, then H(S) indicates the best possible performance L min (see section 4.1.3). Third, natural selection may have discovered a coding strategy that uses the fewest number of elementary symbols required to encode certain classes of stimuli. That is, natural selection may have solved the optimization problem of reaching L min (see section 4.1.1). Such a hypothesis would be challenging to test empirically (e.g., it requires determining the set of elementary symbols used in the neural code, a problem discussed extensively in Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000). Regardless of such practical difficulties, information theory contains the quantitative resources to address such questions about neural coding, resources that ideal observer analysis does not provide. 6.3 Open Questions and Future Directions. We finish by discussing five outstanding questions that are theoretically interesting and, to our knowledge, have not been addressed by previous work. First, it would be interesting to extend the discussion in section 4.6 by generating general bounds on channel capacity as a function of ideal observer performance. Also, an extension of the result to capacity in continuous channels would be useful. Second, we have analyzed quantities that depend on entropy rather than entropy rates (Cover & Thomas, 1991; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998). However, Feder and Merhav (1994) show that for stationary processes, the results for H(S) described in section 4.1 also apply to entropy rates, so we expect our results to generalize to information rates (Strong et al., 1998). That is, we expect that a neuron that transmits information at a higher rate than another neuron is not necessarily the better predictor of a time-varying stimulus. However, this idea should be formally developed. Third, since spike trains are nonstationary (Berry & Meister, 1998), it would be interesting to determine how neurally inspired nonstationarities in P(S) and P(R | S) would affect the results in sections 4 and 5. Fourth, by picking as our point of comparison the ideal observer, we have treated all errors equally.17 This would be inappropriate in some instances. For example, if the goal is to estimate the spatial location of the stimulus, an estimate close to the actual location is better than an estimate that is completely off. In such cases, instead of judging an estimate as categorically right or wrong, the estimate should be evaluated by an error term that increases with the distance between sˆ and s, such as the mean squared error (see section 5.1). We implicitly addressed this concern in section 5.2, in which we used the minimum mean squared error as a cost
17 Technically, we have used a 0/1 loss function, also known as the Hamming distance between s and sˆ .
776
E. Thomson and W. Kristan
function. We showed that using the mean squared error as the cost function only amplifies the discrepancies between ideal observers and the information measures. However, a more general consideration of the relationship between arbitrary cost functions (Schneidman et al., 2003) and the information measures deserves analysis. Fifth, one of the virtues often stressed of the information-theoretic approach to neural coding is that the information measures are nonparametric (Paz & Vaadia, 2004; Peterson et al., 2002). Ironically, it is partly because we freed the proofs in sections 4 and 5 from assumptions about P(S) and P(R | S) that it was possible to demonstrate the differences between the information measures and ideal observers. On the other hand, under certain assumptions (e.g., the responses to different stimuli are univariate gaussians with identical variances), the relationship between P(c) and the informationtheoretic quantities is one-to-one. We would like to know the minimal assumptions required about P(S) and P(R | S) for ideal observer performance to be uniquely specified by the information measures. We hope that the work presented here will act as an impetus for such an analysis so that we can know under what conditions it is possible to make unambiguous inferences between encoding and decoding measures in the nervous system. Acknowledgments Thanks to Kevin Briggman, E. J. Chichilnisky, Philip Gill, Dan Kersten, Samar Mehta, Joy Thomas, Jonathan Victor, the Sejnowski lab, Diane Whitmer, and an anonymous reviewer. Thanks especially to Barbara Leary. References Alkasab, T. K., Bozza, T. C., Cleland T. A., Dorries, K. M., Pearce, T. C., White, J., & Kauer, J. S. (1999). Characterizing complex chemosensors: Information-theoretic analysis of olfactory systems. TINS, 22, 102–108. Arabzadeh, E., Panzeri, S., & Diamond M. E. (2004). Whisker vibration information carried by rat barrel cortex neurons. J. Neurosci., 24, 6011–6020. Ash, R. B. (1965). Information theory. New York: Dover. Berry, M. J., & Meister, M. (1998). Refractoriness and neural precision. J. Neurosci., 18, 2200–2211. Bialek, W., Nemenman, I., & Tishby, N. (2001). Predictability, complexity, and learning. Neural Computation, 13, 2409–2463. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Comp., 12, 1531–1552. Britten, K., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neurosci., 12, 4745–4767.
Quantifying Stimulus Discriminability
777
Buracas, G. T., & Albright, T. D. (1999). Gauging sensory representations in the brain. TINS, 22, 303–309. Cover, T. J., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Cover T. J., & Thomas J. (In press). Elements of information theory (2nd ed.). New York: Wiley. de Boor, C. (1978). A practical guide to splines. New York: Springer-Verlag. DeWeese, M. R., & Meister, M. (1999). How to measure the information gained from one symbol. Network: Comput. Neural Syst., 10, 325–340. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley. Feder, M., & Merhav, M. (1994). Relations between entropy and error probability. IEEE Transactions on Information Theory, 40, 259–266. Forbes, D. A. (1995). Classification-algorithm evaluation: Five performance measures based on confusion matrices. J. Clin. Monit., 11, 189–206. Geisler, W. S. (1989). Ideal observer theory in psychophysics and physiology. Physica Scripta, 39, 153–160. Geisler, W. S. (2003). Ideal observer analysis. In L. M. Chalupa & J. S. Werner (Eds.), Visual neurosciences. Cambridge, MA: MIT Press. Golic, J. (1987). On the relationship between the information measures and the Bayes probability of error. IEEE Transactions on Information Theory, 33, 681–693. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Knill, D. C., & Kersten, D. (1991). Ideal perceptual observers for computation, psychophysics, and neural networks. In R. J. Watt (Ed.), Pattern recognition by man and machine (pp. 83–97). New York: Macmillan Press. Kovalevsky, V. A. (1968). Character readers and pattern recognition. New York: Spartan Press. Lawson, J. L., & Uhlenbeck, G. E. (1950). Threshold signals. New York: McGraw-Hill. Lettvin, J., Maturana, H., McCulloch, W. S., & Pitts, W. (1959). What the frog’s eye tells the frog’s brain. Proc. IRE, 47, 1940–1951. Li, W., Piech, V., & Gilbert, C. D. (2004). Perceptual learning and top-down influences in primary visual cortex. Nat. Neurosci., 7, 651–657. Liu, Z., Knill, D. C., & Kersten, D. (1995). Object classification for human and ideal observers. Vision Research, 35, 549–568. MacKay, D., & McCulluch, D. S. (1952). The limiting information capacity of a neuronal link. Bull. Math. Biophys., 14, 127–135. Oram, M. W., Foldiak, P., Perret, D. I., & Sengpiel, F. (1998). The “ideal homunculus”: Decoding neural population signals. TINS, 21, 259–265. Orlitsky, A., Santhanam, N. P., & Zhang, J. (2003). Always good Turing: Asymptotically optimal probability estimation. Science, 302, 427–431. Paninski, L. (2004). Estimating entropy on m bins given fewer than m samples. IEEE Transactions on Information Theory, 50, 2200–2203. Paz, R., & Vaadia, E. (2004). Learning-induced improvement in encoding and decoding of specific movement directions by neurons in the primary motor cortex. PLoS Biol, 2(2), e45. Available online: http://www.plosbiology.org/plosonline/? request=get-document&doi=10.1371%2F journal.pbio.0020045. Perkel, D. H., & Bullock, T. H. (1968). Neural coding. Neurosciences Research Bulletin, 6, 221–348.
778
E. Thomson and W. Kristan
Petersen, R. S., Panzeri, S., & Diamond, M. E. (2002). Population coding in somatosensory cortex. Curr. Opin. Neurobiol., 12, 441–447. Pola, G., Thiele, A., Hoffmann, K. P., & Panzeri, S. (2003). An exact method to quantify the information transmitted by different mechanisms of correlational coding. Network: Comput. Neural Syst., 14, 35–60. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Schneidman, E., Bialek, W., & Berry, M. J. (2003). Synergy, redundancy, and independence in population codes. J. Neurosci., 23, 11539–11553. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Tebbe, D. L., & Dwyer, S. J. (1968). Uncertainty and the probability of error. IEEE Transactions on Information Theory, 14, 516–518. Theunissen, F. E., & Miller, J. P. (1991). Representation of sensory information in the cricket cercal sensory system. II. Information theoretic calculation of system accuracy and optimal tuning-curve widths of four primary interneurons. J. Neurophys., 66, 1690–1703. Treves, A. (1997). On the perceptual structure of face space. BioSystems, 40, 189–196. Victor, J. D. (1999). Temporal aspects of neural coding in the retina and lateral geniculate: A review. Network, 10, R1–66. Wagner, T. J. (1965). Some remarks concerning uncertainty and the probability of error. IEEE Transactions on Information Theory, 11, 144–145. Yates, R. D., & Goodman, D. J. (1999). Probability and stochastic processes: A friendly introduction for electrical and computer engineers. New York: Wiley.
Received May 13, 2004; accepted September 4, 2004.
REVIEW
Communicated by Mark Plumbley
Nonlinear Complex-Valued Extensions of Hebbian Learning: An Essay Simone Fiori
[email protected] Facolt`a di Ingegneria dell’Universit`a di Perugia, Polo Scientifico e Didattico del Ternano, I-05100, Terni, Italy
The Hebbian paradigm is perhaps the best-known unsupervised learning theory in connectionism. It has inspired wide research activity in the artificial neural network field because it embodies some interesting properties such as locality and the capability of being applicable to the basic weight-and-sum structure of neuron models. The plain Hebbian principle, however, also presents some inherent theoretical limitations that make it impractical in most cases. Therefore, modifications of the basic Hebbian learning paradigm have been proposed over the past 20 years in order to design profitable signal and data processing algorithms. Such modifications led to the principal component analysis type class of learning rules along with their nonlinear extensions. The aim of this review is primarily to present part of the existing fragmented material in the field of principal component learning within a unified view and contextually to motivate and present extensions of previous works on Hebbian learning to complex-weighted linear neural networks. This work benefits from previous studies on linear signal decomposition by artificial neural networks, nonquadratic component optimization and reconstruction error definition, neural parameters adaptation by constrained optimization of learning criteria of complex-valued arguments, and orthonormality expression via the insertion of topological elements in the networks or by modifying the network learning criterion. In particular, the learning principles considered here and their analysis concern complex-valued principal/minor component/subspace linear/nonlinear rules for complexweighted neural structures, both feedforward and laterally connected. 1 Introduction The basic assumptions of connectionism on neural learning are reasonably nonrestrictive and rather appealing from an engineering point of view:
r
The locality principle, which states that the modifications in the connections depend on the activities of pre- and postsynaptic neurons and do not depend on the activity levels of the other neurons
Neural Computation 17, 779–838 (2005)
© 2005 Massachusetts Institute of Technology
780
S. Fiori
r r
The principle stating that the modification of synapses is slow compared with the characteristic time of neuron dynamics The forgetting principle, stating that if either the pre- or postsynaptic neurons or both are silent, then no synaptic changes take place except for exponential decay
Classical contributions to this analysis are given in Bechtel and Abrahamsen (1993) and Linsker (1992). Perhaps the most influential work in the history of connectionism is the contribution of the neuropsychologist Donald Hebb. In his famous book, Hebb (1949) presented a theory of behavior based on the physiology of the nervous system. He reduced the types of physiological evidence to two main categories: the existence and properties of continuous cerebral activity and the nature of synaptic transmission in the central nervous system. Hebb combined these principles in order to develop a theory of how learning occurs within an organism. As a matter of fact, Hebbian learning and cell/synapses creation/death are kinds of learning for which there is neural evidence. In Hebbian learning, the weights between neurons are adjusted so that each weight better captures the relationship between the units. The Hebbian learning rule specifies that the weight of a connection between two units should be varied in proportion to the product of their activation. It states that the connections between two neurons might be strengthened if the neurons fire simultaneously. As a result, if a unit fires when presented with a pattern, the weights from the active inputs are strengthened, so that the unit will respond to the same pattern even better in the future. In a network of many units, however, a mechanism is necessary to prevent different neurons from encoding the same features. In particular, a way to encourage sparse representations between units is to promote unit decorrelation through lateral inhibition (Barlow, 1989, 1998, 2001; Foldi´ ¨ ak, 1990). Lateral inhibition is a distinguishing feature of some artificial unsupervised neural networks. It provides a mechanism that makes a network of neurons hierarchical and with which the neurons compete for the right to generate a response to the incoming stimuli. This kind of competition enables the abstract neuronal units to form receptive fields that are sensitive to stimuli coming from different regions of the input space (Spratling & Johnson, 2002). Mutual inhibition between units may be achieved by training the lateral connections via an anti-Hebbian rule: whenever two units in a layer are active simultaneously, the connection between them becomes more inhibitory, so that their correlation is decreased, and their joint activity will be discouraged in the future (Foldi´ ¨ ak, 1990). For the standard weight-and-sum neuron model, Hebbian learning makes the synapses be correlators. In some circumstances, the plain Hebbian learning continually strengthens a unit’s weights without bound (Baldi & Hornik, 1995; Bechtel & Abrahamsen, 1993; Miller & MacKay, 1994). Within networks endowed with the appropriate structure,
Nonlinear Complex-Valued Extensions of Hebbian Learning
781
however, the unboundness problem is not present (consider, for instance, the recurrent error correction model discussed in Harpur, 1997). A number of years ago, Amari (1977) investigated a neural theory of association and concept formation, and Oja (1982) studied a stabilized version of the classical Hebb rule. They observed that under fair conditions, the stabilized learning rule makes the neuron capture the most powerful eigenvector of the input covariance matrix or, in other terms, the rule makes the neuron able to extract the principal component from the input multivariate random signal. Since Amari and Oja’s pioneering work, several new learning algorithms have been proposed for extending the one-unit principal component neural system to complete neural networks. The classical contribution in the field of principal component networks may be traced back to Sanger (1989), who used an online version of the well-known Gram-Schmidt orthogonalization algorithm. Also, Rubner and Tavan (1989) and Diamantaras and Kung (1996) introduced a linear neural network endowed with lateral inhibitory connections for achieving output decorrelation. Their algorithm is often referred to as adaptive principal component extractor (APEX). In more recent years, several authors have introduced different principal component analysis (PCA) rules for generalizing classical ones (Abbas & Fahmy, 1994; Bannour & Azimi-Sadjadi, 1995; Costa & Fiori, 2001; Fiori & Piazza, 2000; Fiori, 2001, 2002). A PCA-related argument is principal subspace analysis (PSA), which concerns the computation of the subspace of the input space spanned by the principal eigenvectors (Oja, 1989). Extended discussions on Hebbian, anti-Hebbian, and modified Hebbian learning paradigms and their properties in the context of component and subspace analysis may be found in Atick and Redlich (1993), Fyfe and MacDonald (2002), Harpur (1997), and Weingessel and Hornik (2000). It is likely that one of the reasons for the success of stable Hebbian learning theory is its usefulness for solving many signal processing problems, as illustrated, for instance, in Costa and Fiori (2001), Kung, Diamantaras, and Taur (1994), Luo and Unbehauen (1997), and Palmieri and Zhu (1995), where extracting the first principal components is shown to be of prime importance. Nevertheless, it has been clearly demonstrated that computing the last principal components of a data sequence—those principal components endowed with the smallest (nonzero) power—may be very useful as well (Gao, Ahmed, & Swamy, 1994; Klemm, 1987; Mathew & Reddy, 1994; Schmidt, 1986; Xu, Oja, & Suen, 1992). The extraction of the last principal components is commonly referred to as minor component analysis (MCA). Nonlinear versions of standard Hebbian or anti-Hebbian learning rules, namely, those rules based on the optimization of nonquadratic cost functions, may make the involved linear neural networks capable of performing the independent component analysis (ICA) of incoming signals (for a review of ICA in the neural network field, see, e.g., Bell & Sejnowski, 1995). An early
782
S. Fiori
and pervasive physiological example of such behavior was presented by Barlow and Foldi´ ¨ ak (1989). They recalled that the information about the basic tastes (sweet, salty, bitter, and sour) is not carried on by separate fibers; instead, each fiber carries a mixed signal with different relative sensitivities to the four basic tastes. The operation of separating the four mixed quantities, performed by the nervous system, could not be explained by simple decorrelation rule. Thus, Barlow and Foldi´ ¨ ak proposed a modified (anti-Hebbian) rule for synaptic strength tuning based on the assumption of nonnegativity of the taste variables.1 Classical contributions on nonlinear extensions to PCA may be found in Karhunen and Joutsensalo (1994, 1995) and Oja (1994). The aim of this review is to describe extensions of previous work to complex-weighted linear neural networks in a formal way. The guideline for the presented research is complex-valued gradient-based optimization of nonquadratic learning criteria. The derivations that are presented subsume some contributions found in the scientific literature: We recall interesting contributions from advanced statistical theory of learning and try to relate sparsely published valuable works on these topics in a unifying view. These works span the following topics:
r
r
r
r
Linear signal decomposition by artificial neural networks. This widely known theory allows expressing a multivariate signal as a linear combination of features or components that enjoy some special statistical property (like incorrelatedness or independency). The decomposition consists in jointly finding a basis for the signal and the corresponding components. The decomposition is then achieved by components optimization or by reconstruction error minimization. Nonquadratic component optimization and reconstruction error definition. Both rely on the proper selection of involved nonquadratic learning criteria via available suitable interpretations of Hebbian learning, such as the ones suggested by Song, Yilong, and Feng (1998) and by Sudjianto and Hassoun (1995). Neural parameters adaptation by constrained optimization of learning criteria of complex-valued arguments. This is the basis for neural learning with physical or structural constraints, or both, when complexvalued input signals or input-output signal pairs are to be dealt with. The constraints considered here are related to the orthonormality of neural connection matrices. Orthonormality expression via the insertion of topological elements in the networks or by modifying the network learning criterion.
1 This is what is currently referred to as nonnegative independent component analysis (Plumbley, 2003).
Nonlinear Complex-Valued Extensions of Hebbian Learning
783
Topological constraints are given, for example, by lateral connections that force the neurons in a network to encode uncorrelated features. Modified learning criteria are created, for example, by the Lagrange multiplier method for equality constraints. The development of the theory of linear complex-weighted neural networks is supported by pervasive engineering and physiological motivations. In the supervised-learning field, for instance, it was reported (see Michaels & Upadhyaya, 1999) that complex-valued backpropagation models train faster than conventional backpropagation networks, are more resistant to local minima, and exhibit better generalization ability. By extending previous work proposed independently by some researchers (Benvenuto & Piazza, 1992; Georgiou & Koutsougeras, 1992), Nitta (1997, 2000, 2004) reported that complex-valued backpropagation architectures exhibit reduced probability of learning standstill and are able to perform transformations of geometric figures in the complex plane, which their realvalued counterparts are unable to effect. Also, following the pioneering work by Widrow (see, e.g., Widrow & Winter, 1988) on merging neural networks and adaptive filters, recently Hanna and Mandic (2003) presented a complex-valued nonlinear gradient-descent learning algorithm for a nonlinear neural adaptive filter with adaptive complex-valued activation function; such structure is mentioned to be beneficial when dealing with signals that have rich dynamical behavior. Like other applications, Miyauchi, Seki, Watanabe, and Miyauchi (1993) proposed an interpretation of optical flow based on complex-weighted neural networks, while Muezzino ¨ glu, ˇ Guzeli¸ ¨ s, and Zurada (2003) recently proposed a design method for complexweighted multistate Hopfield memory, which appears as a generalization of the conventional Hopfield model that can be an efficient tool to process static integral information. The new method was shown to outperform the generalized Hebbian rule, which has yet constituted the only learning rule for this model, in associating phase-modulated integral information. Also, from a biological point of view, the current interest in pulse-coded neural networks (Levine, Brown, & Shirey, 1999) has created a need for a mathematically compact way of dealing with phase as well as magnitude of neural signals, which may be easily found in the complex-valued representations. In the unsupervised field, a first extension to classical Sanger’s PCA learning theory to the complex plane has been presented by De Castro, De Castro, Amaral, and Franco (1998) with application to optimal image compression in the spectral domain. An application of nonlinear MCA extended to the complex domain has been presented recently in Fiori (2003b), with application to robust beam forming. Robust beam forming is a well-known signal processing technique allowing for performing spatial filtering of a signal source in the presence of spatial noise and other disturbing sources by means of an array of electromagnetic antennas or electromechanical
784
S. Fiori
microphones, provided that the direction of arrival of the primary source is known. A beam former may be realized by a complex-weighted neural unit fed with the Fourier transform of the measured signals, hence its complex-valued nature. A bioengineering application of adaptive beam forming is in a hearing aid (Kates, 1993). A way to train the beam-forming neuron is to force it to solve a correlation-matrix minimal-eigenvalue extraction problem, which may be formulated as an MCA problem (Fiori, 2003b). Further applications of complex-valued principal or minor component analysis have been reviewed and discussed in Luo and Unbehauen (1997). The theory of complex-weighted network learning has led to effective independent component analysis algorithms for complex-valued statistically independent signals (see, e.g., Fiori, 2000, 2003a). Typical applications of complex-valued ICA algorithms are to electromagnetic phenomena analysis and modeling (Desodt & Muller, 1990; Fiori, Faba, Albini, Cardelli & Burrascano, 2003). Further examples of applications of complex-valued ICA algorithms are frequency-domain algorithms for separation of sound signals and for biomedical data analysis (Cichocki & Amari, 2002). Also, Michaels and Upadhyaya (1999) provided an architecture for signal and image processing (focused on capacitance and inductance sensors) and for modeling the timing and pattern of neural signals in biological settings. In particular, they introduced a complex-valued associative memory theory and showed that in its iterative form, its learning procedure is a complexvalued Hebbian adapting algorithm that uses association rather than error correction. Section 2 of this review is devoted to the complex-valued Hebbian learning theories stemming from the optimization of generalized network output power, while section 3 presents those extended Hebbian learning theories stemming from generalized reconstruction error minimization. In particular, section 2.1 presents a learning theory for a linear feedforward architecture, where the constraints on symmetry and hierarchy are embodied in the learning criterion, while section 2.2 introduces a theory for Rubner-Tavan’s laterally connected network structure (Rubner & Tavan, 1989). Both sections present a comparison with the real-valued networks and rules counterparts. Section 2.3 presents a possible way of choosing the nonlinearity involved in the nonquadratic learning criteria based on the Sudjianto-Hassoun fruitful maximum-mismatch principle (Sudjianto & Hassoun, 1995). Section 2.4 presents some formal results pertaining to the convergence of learning rules in the one-unit case, while section 2.5 deals with the multiunit case. Section 3.1 illustrates a learning theory stemming from generalized reconstruction error minimization for a symmetrical network, and section 3.2 discusses some possible ways to select the involved nonquadratic error measures, while section 3.3 considers the hierarchical case and the corresponding choices of involved nonlinearities. Section 4 concludes the review.
Nonlinear Complex-Valued Extensions of Hebbian Learning
785
2 Complex-Valued Hebbian Learning by Nonquadratic Output Optimization This section presents a unified view and presents and discusses generalizations of networks’ architectures and learning rules for principal component and subspace analysis of complex-valued signals. Generalization is achieved by extending the classical architectures to complex-weighted neural networks and by extending the classical learning criteria to nonquadratic optimization objectives by the introduction of nonlinear functions. A possible choice of the involved nonquadratic function is also discussed. In this section, the Lagrange multiplier method is used extensively in order to construct suitable criteria for learning under orthonormality constraints. It is worth recalling how the Lagrange multiplier method for equality constraints works in practice (see Bertsekas, 1996, for details of the theory and specific computations). Suppose that J (w) and L i (w), i = 1, . . . , m and w ∈ R p, are differentiable functions that map R p → R, and the goal is to solve opt J (w) such that L j (w) = 0 , j = 1, . . . , m, namely, to find the optimum (maximum or minimum) of the function J (w) that satisfies the equality constraints L j (w) = 0. This is equivalent to solving the following problem, opt
J (w) −
m
λ j L j (w) ,
j=1
for w and λ j , without restrictions. The variables λ j ∈ R are termed Lagrange multipliers. It is not difficult to recast the above optimization problem in terms of matrix-type variables if necessary. opt In practice, the optimal multipliers λ j may be found analytically through a multiplier elimination method (Bertsekas, 1996). Then the optimal parameter vector w may be found by a gradient-based algorithm, dw ∂ J (w) opt =± , dt ∂w m ∂ J (w) opt def ∂ opt ∂ = λj J (w) − L j (w) , ∂w ∂w ∂w j=1
(2.1) (2.2)
where the + sign denotes maximization and the − sign denotes minimization.
786
S. Fiori
The contents of this section may be anticipated:
r r r r r
Learning principal/minor component/subspace by complex-valued gradient-based optimization of nonquadratic criteria Learning in an orthogonally constrained network by complex-valued gradient-based optimization of nonquadratic criteria and output decorrelation by lateral inhibition Choice of the nonlinear criteria via Sudjianto-Hassoun interpretation of Hebbian learning, illustrated by numerical examples Analysis of static and dynamical properties of one-unit complexvalued PCA and MCA neural systems, illustrated by numerical examples Multiunit complex-valued component and subspace networks—notes on dynamical properties and on the design of learning algorithms that agree with the orthonormality constraints
2.1 Complex-Valued Gradient-Based Optimization of Nonquadratic Criteria. The origin of modern Hebbian learning may be traced back to Oja’s work on PCA learning theory. PCA is appropriate, for example, when measures on a number of observed variables have been obtained, and the good is to develop a smaller number of artificial variables (termed principal components) that will account for most of the variance in the observed variables. The principal components may then be used as predictor or criterion variables in subsequent analyses. In particular, PCA deals with the extraction of the principal component that accounts for the largest fraction of total variance. Oja’s first principal component learning rule was based on a single neuron, described by y(t) = wT (t)x(t), where x(t) ∈ R p represents a stationary multivariate zero-mean random process endowed with finite covariance matrix , w(t) ∈ R p is the neuron’s weight vector, and y(t) ∈ R is the neuron’s output signal. The learning task of Oja’s neuron is to maximize the variance of its response signal, namely, of E[y2 ] = wT w, under the constraint wT w = 1. This learning rule reads dw(t) = E[x(t)y(t) − w(t)y2 (t)] . dt
(2.3)
In the above relations, t denotes continuous time and E[·] denotes statistical expectation (ensemble average).2 This expression clearly reveals the 2 More formally, the ensemble average should be denoted by the symbol E [ f |w], x which denotes conditional expectation of function f (x) with respect to the statistics of x subject to the hypothesis w. For conciseness, it will hereafter be simply written in short notation as E[ f ].
Nonlinear Complex-Valued Extensions of Hebbian Learning
787
presence of the Hebbian term x(t)y(t) and of a stabilizing term; thus, it is also referred to as a stabilized Hebbian learning equation. The Oja rule was initially formulated as an approximated normalized Hebbian learning algorithm by using a normalization step that binds the connection vector w to belong to a unit-radius sphere (Oja, 1982): In the discrete-time counterpart of such Hebbian learning, the learning step-size is usually supposed to be very small; therefore, it is possible to expand by Taylor series the normalized Hebbian rule with respect to the learning step size and to truncate the expansion to the first-order term. This procedure led to the discrete-time version of learning rule (2.3). In first PCA, it is known that the extracted weight vector coincides with the eigenvector of the covariance matrix corresponding to its largest eigenvalue. In general PCA, more than one unit-norm weight vector is sought. As it is known that the eigenvectors of a covariance (hence symmetric) matrix are orthogonal to each other, it is necessary to enforce the orthogonality of network’s weight vectors. Related concepts to PCA are minor component analysis, which consists of extracting the weakest principal components, and principal or minor subspace analysis, which consists in finding (whatever basis of) the linear subspaces spanned by the principal or minor vectors. In order to generalize Oja’s work to the complex domain, a linear neural network may be considered, which is formed by linear complex-weighted units. Let the following learning objective function for the neural net be defined, def
J (wk ) = U(wk ) + L(wk ) , k = 1, ..., m,
(2.4)
where wk ∈ C p represents the weight vector of the k th neuron. The criterion def U(·) contains a nonlinear function of the k th neuron’s output yk = wkH x and is defined as follows, def
U(wk ) = E[ f (wkH x)] ,
(2.5)
where x ∈ C p is the input vector of the network whose values are drawn from a p-dimensional complex-valued input space and the superscript H denotes Hermitian transpose (i.e., transpose plus complex conjugation). The number of neurons of the network is denoted here with m, where m ≤ p. The function f (·) is a real-valued, positive function of complex-valued argument and is of the form + f (ζ ) = g(|ζ |) , ζ ∈ C , g : R+ 0 → R0 . def
(2.6)
The function g(u) is normally required to be differentiable and convex with a minimum in u = 0 in a convenient right-sided neighborhood of the origin.
788
S. Fiori
The choice of the function f (ζ ) as a real function of |ζ | seems reasonable for the following reasons:
r r
r
r
In the classic (Oja’s) case, the objective function contains the neuron’s output variance, which depends on the modulus (i.e., the unsigned value) of the output only. The leading principle in minor, principal component, or subspace analysis is projection power minimization or maximization. In the signal processing literature, the power of a complex-valued signal is defined as the average of its square modulus. This quadratic optimization principle may be extended to robust (nonquadratic) optimization in a natural way, as shown in this and following sections. Ample literature is available on signal processing techniques and algorithms, which are based on transformed-output-modulus optimization principles. A recent review of such techniques in blind signal processing may be found, for instance, in Amari and Cichocki (1998), Chow and Fang (2001), Pandey (2001), and Thirion Moreau and Moreau (2002). Some algorithms described in this article have readily found application to engineering-type signal-processing tasks (e.g., adaptive beam forming, blind separation of digital modulation type signals, and blind localization of electromagnetic sources), which are based on transformed output modulus optimization.
The function L(·) is needed for embodying the necessary constraints of orthonormality of the weight vectors in criterion 2.4, namely: w Hj wk = 0 if j = k , wkH wk = 1 .
(2.7)
Note that each orthogonality condition may be rewritten more conveniently by observing that w Hj wk = 0 if and only if Re{w Hj wk } = 0 and Im{w Hj wk } = 0. Thus, the function L(·) may be expressed as
def
L(wk ) = λkk (wkH wk − 1) +
K (k) j=1, j=k
Re{λk j wkH w j } .
(2.8)
A set of Lagrange multipliers {λk j } has been introduced in order to take into account the condition ensuring the normality of the weight vectors and the pairs of conditions needed for ensuring the orthogonality principle to be met after learning. In fact, the following identity holds: Re{λk j (wkH w j )} = Re{λk j }Re{wkH w j } − Im{λk j }Im{wkH w j }.
Nonlinear Complex-Valued Extensions of Hebbian Learning
789
Therefore, the expression Re{λk j (wkH w j )} is a compact way to express the equality constraints Re{w Hj wk } = 0 and Im{w Hj wk } = 0 by the Lagrange multiplier method (the minus sign between the above terms does not matter because the multipliers are variables to be determined). Note that the multipliers are complex valued, except for the λkk , which are real valued; therefore, the cost function L(·) is by construction real valued. The indexing function K (·) has been introduced in order to take into account two distinct cases:
r r
The symmetric case, where K (k) = m for each neuron, which leads to a generalized principal subspace analyzer The hierarchical case, where K (k) = k, actually leading to a generalized principal component analyzer opt
In order to look for optimal weights wk maximizing criterion 2.4, a gradient steepest-ascent learning algorithm is employed by following the general scheme shown in the equations 2.1 and 2.2. By definition, the gradient of a real-valued function F (w) with respect to a complex-valued vector w is intended as ∂ F (w) def ∂ F (u, v) ∂ F (u, v) = +i , ∂w ∂u ∂v
(2.9)
def √ where i = −1 and u + iv = w; also, F (u, v) : R p × R p → R denotes the real-valued function of two real-valued arguments derived from F (u + iv). First, it is necessary to evaluate the gradient:
∂U(wk ) dg(|yk |) ∂|yk | g (|yk |) =E =E yk x . ∂wk d|yk | ∂wk |yk |
(2.10)
k| = yk x, where the superIn fact, from definition 2.9, it follows that |yk | ∂|y ∂wk script stands for complex conjugation. In the same way, the gradient of the function L(·) is to be evaluated. The expression of the gradient of L(wk ) with respect to wk is found to be
K (k) ∂ L(wk ) = 2λkk wk + λk j w j . ∂wk j=1, j=k
(2.11)
By gathering equations 2.10 and 2.11, we thus obtain K (k) g (|yk |) ∂ J (wk ) =E λk j w j . yk x + 2λkk wk + ∂wk |yk | j=1, j=k
(2.12)
790
S. Fiori
The standard elimination process may be performed in order to get rid of the dependence on the Lagrange multipliers. It consists of solving the following set of equations for λk j , after having fixed the variable k, wrH
∂ J (wk ) ∂wk
= 0 , r = 1, . . . , m ,
(2.13)
under the constraint of orthonormality of the vectors wk expressed by equations 2.7. Tedious but straightforward calculations give, after footer renaming, the result
opt
λk j = −E
g (|yk |) yk yj |yk |
1 1 − δk j 2
,
(2.14)
where δkh denotes Kronecker’s delta. It would be interesting to observe that the multipliers do not enjoy any symmetry properties unless g (u) = u. In opt opt this special case, in fact, (λk j ) = λ jk would hold. By plugging expressions 2.14 into equation 2.12, the formula for the optimal gradient of J is readily found:
∂J ∂wk
opt =E
K (k) g (|yk |) y x− yj w j . |yk | k j=1
(2.15)
Now the optimal gradient may be used in a steepest-ascent algorithm to design a neural optimizing system. In short, by defining the quantities K (k)
dg(u) 1 du u
(2.16)
dwk = Pk E[G(|yk |)yk x] , k = 1, 2, . . . , m . dt
(2.17)
def
Pk = I p −
def
w j w Hj and G(u) =
j=1
with u ∈ R+ , the new learning rule is
The factor E[G(|yk |)yk x] may be interpreted as a complex-valued extended Hebbian term common to each neuron, while the projector Pk is a deflating term that drives each weight vector wk into a different subspace.3 It is worth noting that by choosing as g(·) the function g(u) = 12 u2 , that means having g (u) = u, and thus G(u) = 1, the gradient steepest-ascent 3
The term projector is a bit extended here: unless the vectors wk become orthonormal, the operator Pk does not represent a true projection because it is not idempotent, that is, P2k = Pk .
Nonlinear Complex-Valued Extensions of Hebbian Learning
791
learning equations for a continuous-time PSA/PCA neural network with m outputs become
K (k) dwk yj w j = E yk x − , k = 1, 2, . . . , m . dt j=1
(2.18)
When K (k) = k, it coincides with the learning rule proposed by DeCastro et al. (1998), which rewrites compactly as
d W = E xx H − UT(W H xx H W) W , dt where UT(·) returns the upper-triangular part of the matrix contained within. If, furthermore, x ∈ R p , then the DeCastro-DeCastro-Amaral-Franco rule coincides with Sanger’s learning paradigm (Sanger, 1989). In the generalized case, the function g(·) is assumed different from quadratic, usually by the help of robust statistics theory (Karhunen & Joutsensalo, 1995). An alternative choice of this function will be discussed in section 2.3. 2.2 Output Decorrelation by Lateral Inhibition. Kung and Diamantaras realized a Hebbian learning network with Rubner-Tavan’s laterally connected linear neural network (Rubner & Tavan, 1989), endowed with the unsupervised APEX learning rule (Diamantaras & Kung, 1996). The input-output network equations are y = z + HH y , z = WH x ,
(2.19)
where x ∈ C p , y, z ∈ Cm , W ∈ C p×m , and H ∈ Cm×m . Note that H is strictly upper triangular; thus, the network is hierarchical (but not recurrent). It is in fact worth noting that the network output signal yk may be expressed as yk = wkH x + hkH y ,
(2.20)
for k = 1, . . . , m, with wk ∈ C p and hk ∈ Cm . An exemplary laterally connected network of this kind is depicted in Figure 1. As the network is hierarchical, the vectors hk enjoy the property h k j = 0 for j ≥ k; thus, for instance, h1 = [0 0 · · · 0 0]T , h2 = [h 21 0 · · · 0 0]T , h3 = [h 31 h 32 0 · · · 0 0]T , and so forth. This implies that the quantity hkH y does not depend on wk , but only on w j with j < k. Kung and Diamantaras proved that under reasonable conditions, in the real-valued case, the APEX rule makes the column vectors of the direct connection weight matrix W asymptotically converge to the first principal eigenvectors of the covariance matrix of the input signal, and the lateral connection weight matrix H asymptotically vanishes to zero.
792
S. Fiori
Direct connections weights
Lateral connections weights
Figure 1: An exemplary network exhibiting laterally connections.
Later, Chen and Hou (1992) extended the APEX algorithm to perform PCA of complex-valued random signals. They proved experimentally that the complex-valued APEX algorithm actually allows extracting a number of principal components from a complex-valued signal. The aim of this section is to give an alternative explanation of the complex-valued APEX learning rules, based on gradient optimization and to extend this result to Hebbian learning. In order to design a PCA-type learning procedure for the laterally connected network illustrated in Figure 1, let us define the following criterion function: def
Nk (W, H) = E[ f (wkH x)] + hkH E[yy H ]hk .
(2.21)
The first term on the right-hand side contains the generalized power of the transformed signal zk = wkH x, where again f (ζ ) = g(|ζ |), ζ ∈ C, while the second term contains a linear combination of the cross-correlation values of the outputs. (Generalized power refers to the expectation of a nonquadratic function of network’s output.) By definition of PCA, the first term has to be maximized under the constraint wkH wk = 1, while the second term must be zeroed. Generalized power maximization and decorrelation of output signals may be thought of as separate objectives. These targets may be attained through the following objective function: def
P(W, H) =
m k=1
Nk (W, H) + λk (wkH wk − 1) + E[ψk ](hkH hk ) ,
(2.22)
Nonlinear Complex-Valued Extensions of Hebbian Learning
793
where the functions λk and E[ψk ] are Lagrange multipliers. The choice of denoting with E[ψk ] the Lagrange multipliers corresponding to the constraints on the lateral connection strengths is motivated by the observation that in the considered learning equations, the optimal values of multipliers appear as the statistical expectation of some functions; thus, the adopted notation inherently embodies such features (Fiori & Piazza, 2000). Also, this choice makes the notation easier. H It deserves to be noted that the terms m k=1 E[ψk ](hk hk ) add degrees of freedom to the learning system. Such terms ensure zero lateral forcing at convergence (i.e., hkH hk = 0), and the presence of the free functions ψk affects the learning dynamics and can speed up the convergence of the network (Costa & Fiori, 2001; Fiori & Piazza, 2000) without changing the set of stationary points of the algorithm. The direct connection W adaptation aims at maximizing the power of the transformed signal by maximizing the objective function, equation 2.22, with respect to W only. In order to adapt each wk , the gradient steepestascent rule shown in equations 2.1 and 2.2 may be used. By reasoning as in the previous section for computation of multipliers, the optimal gradient of P with respect to wk is found to be
∂P ∂wk
opt
= E[G(|yk |)(xyk − zk yk wk )] , k = 1, 2, . . . , m ,
(2.23)
with G(·) being defined as in section 2.1. Then the lateral connection weight matrix H is used only in order to minimize the cost function defined in equation 2.22 in order to decorrelate the network’s outputs. By defining the real part and the imaginary part of def (R) def (R) (I ) (I ) yk and hk as yk = yk + i yk and h = hk + ihk , with some mathematical work we find: (R)
∂ yk
(R) ∂hk (R)
∂ yk
(I ) ∂hk
(I )
(R)
= y[k] , (R)
(R) def
(R) (I )
(R) ∂hk
(I )
= y[k] ,
(I )
= y[k] ,
where y[k] = [y1
∂ yk ∂ yk
(I ) ∂hk (R)
y2
(I )
= −y[k] , (R)
(I ) def
(I )
(I )
· · · yk−1 0 · · · 0]T , y[k] = [y1 y2
(I )
· · · yk−1 0 · · · 0]T ,
(R) def
with k > 1 and y[1] = y[1] = [0 · · · 0]T . Hence, we find
∂ Nk = 2E y[k] yk , ∂hk def
(2.24) def
where y[k] = [y1 y2 · · · yk−1 0 · · · 0]T with k > 1 and y[1] = [0 0 · · · 0 0]T . The function P may be iteratively minimized, with respect to the variable
794
S. Fiori
matrix H, by means of a gradient steepest-descent rule, where the gradient of P with respect to hk assumes the expression
∂P = 2E y[k] yk + ψk hk , k = 1, 2, . . . , m . ∂hk It is interesting to note that in view of optimization, there are no theoretical reasons to force functions ψk to assume any particular value (Costa & Fiori, 2001; Fiori & Piazza, 2000). In other terms, unless further constraints are established in the optimization of function 2.22, it is impossible to find optimal values for the multipliers E[ψk ]. On the basis of the previous calculations, the obtained nonlinear complex-valued learning rules for output decorrelation by lateral inhibition read: dwk = E[G(|yk |)(xyk − zk yk wk )] , k = 1, 2, . . . , m , dt
dh k = −E y[k] yk − ψk hk , k = 1, 2, . . . , m . dt
(2.25)
It could be interesting to discuss two possible choices of the functions ψ(·)’s and to relate the obtained learning rules to learning paradigms known from the scientific literature. The first one consists in assuming null the free functions ψk . In order to find a correspondence in the scientific literature with a related learning rule, we consider the classic (quadratic optimization) case, in which the function g(u) is assumed equal to 12 u2 and real-valued signals are dealt with. Then, in the hypothesis that for t large enough, the lateral forcing has vanished (hk ≈ 0), because the neurons have learned to encode uncorrelated features, from the first of equations 2.19, we see that zk ≈ yk . Consequently, the learning rules 2.25 can be approximated by dwk = E[yk (x − yk wk )] , k = 1, 2, . . . , m , dt dh k = −E[y[k] yk ] , k = 1, 2, . . . , m , dt
(2.26)
known as Rubner-Tavan’s learning equations (Rubner & Tavan, 1989). It is easy to recognize a principal component rule for the direct connections and an anti-Hebbian rule for the lateral connections, in the spirit of Hebbian and anti-Hebbian learning recalled in section 1. The second choice consists in assuming ψk = |yk |2 . Again in the quadratic optimization hypothesis and under the asymptotic assumption zk ≈ yk , the
Nonlinear Complex-Valued Extensions of Hebbian Learning
795
model 2.25 closely resembles the learning algorithm by Chen and Hou (1992) (which coincides with the Kung-Diamantaras learning rule in presence of real-valued signals): dwk = E[yk (x − yk wk )] , k = 1, 2, . . . , m , dt dhk = −E[y (y + y h )] , k = 1, 2, . . . , m . k k k [k] dt
(2.27)
It deserves to be noted that the learning equations for wk in equation 2.25 differ from the corresponding Chen-Hou equations because of the presence of the term zk yk instead of yk yk = |yk |2 . This difference may be nonnegligible. The quantity zk yk is a complex number; thus, it embodies a phase factor that |yk |2 does not possess. This difference disappears when m = 1. In this case, the equations coincide with Oja’s first principal component analyzer in the complex domain. 2.3 The Sudjianto-Hassoun Principle and Its Extension to the Complex-Valued Case. In sections 2.1 and 2.2, nonquadratic criteria optimization was dealt with in order to define extended Hebbian learning rules. Here we aim at briefly recalling the Sudjianto-Hassoun interpretation of Hebbian learning and extending this theory to the complex-valued case, because it naturally leads to a possible choice of nonquadratic criterion for learning. Also, an exemplary application to complex-valued independent component analysis will help in clarifying the usefulness of the SudjiantoHassoun principle in extended Hebbian learning. Sudjianto and Hassoun (1995) considered the problem of maximizing def a criterion J (w) = E[S2 (wT x)] subject to the restriction wT w = 1, where T y = w x is the output of a single-unit neural network and S(·) is a generic saturating sigmoidal function, for instance, such that S(y) ∈ [−1, +1]. They noted that maximizing the variance of a saturating function of y leads the model neuron to prefer configurations w that correspond to having the values of S(y) concentrated around the extremes −1 and +1. If the quantity def v = S(y) is perceived as a new random variable with probability density function q V (v|w), q V (v) in short notation, this corresponds to having this distribution U shaped (Sudjianto & Hassoun, 1995). The gradient steepestascent learning rule for this abstract neuronal system is dw = (I − wwT )E[(y)x] , dt def
(2.28)
¯ the probability denwhere (u) = 2S (u)S(u). Let us denote now by q Y (y|w) ¯ and with sity function of the random variable y due to a configuration w
796
S. Fiori
¯ its cumulative distribution function, namely: QY (y|w) def
¯ = QY (y|w)
y −∞
¯ q Y (u|w)du .
¯ ¯ − 1. In this case, it is well known Let us also assume S(y) = 2QY (y|w) (Sudjianto & Hassoun, 1995) that the variable v will be uniformly distributed within [−1, +1]. The central idea developed by Sudjianto and Hassoun is that the learning rule, equation 2.28, will converge to a weight vector surely ¯ since the rule seeks a U-shaped distribution of v, that is, a different from w, distribution that deviates from a uniform one. In order to extend the Sudjianto-Hassoun principle to the complexvalued case, let us consider the cost function def
U(w) = E[S2 (|y|)] ,
(2.29)
for a complex-weighted neuron with output y = w H x, with x and w belonging to C p . Its gradient steepest-ascent maximization under the constraint w H w = 1 yields the learning rule dw y H = (I − ww )E (|y|) x , dt |y|
(2.30)
which closely recalls equation 2.17 for m = 1. The learning rule 2.30 drives function S(|y|) to take on values around its extremes, for instance, +1; therefore, it seeks the configurations w, making the probability density function q V (v) different from a uniform one. Now we may consider for S(|y|) a function like def
Q|Y| (|y|) =
0
|y|
q |Y| (u)du .
By equating equation 2.30 to 2.17, with m = 1, the relationship between q |Y| and g is found to be g (|y|) = 2q |Y| (|y|)
|y| 0
q |Y| (u)du .
(2.31)
The concept of Sudjianto-Hassoun interpretation of Hebbian learning has been just touched on here. An extensive analysis of this interesting and fruitful theory has been recently proposed in Fiori (2003c) in the context of complex-valued independent component analysis. In summary, independent component analysis allows extracting statistically independent source signals from their linear combinations where the
Nonlinear Complex-Valued Extensions of Hebbian Learning
797
mixing operator is unknown and the temporal dynamic of the source signals is also unknown. Another useful interpretation of ICA techniques concerns their feature extraction ability, which offers a mathematically sound way of decomposing signals that exhibit involved dynamics into independent basis signals (Cichocki & Amari, 2002; Hyv¨arinen, Karhunen, & Oja, 2001). Formally, the simpler source observation model is written as x = As, where s ∈ C p denotes the vector of source signals, x ∈ C p denotes the vector of observed signals, and A ∈ C p× p denotes the matrix of mixing coefficients. In the above (noiseless, instantaneous) case, the matrix A may be assumed orthonormal without loss of generality. In fact, it is known that if A is not orthonormal, it is possible to preprocess the observed data via a so-called whitening algorithm that makes it possible to separate out the independent components via a rotation only (Cichocki & Amari, 2002; Hyv¨arinen et al., 2001). In this case, a linear artificial neural network described by y = W H x with W ∈ C p× p orthonormal may be effective in separating the independent source signals out from the observed mixtures. In the case that the sources are typical base-band digital signals used in telecommunications, such as QAM or PSK (QAM stands for quadrature amplitude modulation and PSK for phase shift keying; (Haykin, 1989), we may give a quite appealing interpretation of the way the SudjiantoHassoun principle works. In fact, in the above-mentioned cases, the probability density functions of the generic source signal sk modulus are written as q |sk | (|sk |) =
q |sk |,r δ(|sk | − s¯k,r ) ,
(2.32)
r
where the constant s¯k,r denotes the r :th possible discrete value assumed by the quantity |sk |, the symbol δ(·) denotes Dirac’s delta, and the quantities q |sk |,r denote the probability that |sk | = s¯k,r . Now, the probability density function of each observation is different from peaked (i.e., more similar to a uniform one), so the aim of the separating network is to make the output distributions of the artificial neural network as peaked as possible. This behavior is readily recognized as extending the “make the output distribution be as U shaped as possible” to “make the same distribution as peaked as possible.” These informal considerations offer further justification for the choice of learning rules based on the optimization of criterion functions that depend on the network output moduli only. The above informal considerations have been formall investigated in Bingham and Hyv¨arinen (2000) and Fiori (2000, 2003c). Some contributions on the use of nonclassical principal component techniques to blind source separation have been recently summarized in Fiori (2003c) and the choice of the involved nonlinear functions has been
798
S. Fiori
discussed, for instance, in Hyv¨arinen (1999) and Mathis and Douglas (2002). The main weaknesses of these approaches seem to be:
r r r
The nonlinear functions are generally chosen on the basis of existing contributions from robust statistics and on heuristic observations and findings. Their choice generally relies on the fact that using nonlinear functions adds high-order statistical features to the system but gives little insight into how these features are related to the separation problem. Because of the strong nonlinearity of the learning equations, it is difficult to give any analytical proof of convergence or detailed studies about the features of the employed learning criteria.
In order to illustrate the usefulness of Sudjianto-Hassoun principle in extended Hebbian learning for independent component analysis and to clarify the above calculi with an example, let us suppose that input x contains a complex-valued orthogonal mixture of statistically independent signals (Desodt & Muller, 1990; Laheld & Cardoso, 1994) and that one of these signals is a gaussian noise of the form ν = ν (R) + iν (I ) , where ν (R) and ν (I ) are zero-mean gaussian random variables with variance σ 2 . Then it is known that the modulus |ν| follows the Rayleigh distribution, u u2 q R (u) = 2 exp − 2 (u) , σ 2σ where (u) is the unit step function. Then by formula (2.31), we find (u) = gR
u2 u2 2u exp − − exp − , u≥0. σ2 2σ 2 σ2
Figure 2 depicts the Rayleigh warping function g u(u) for a unit power noise. In this case, it is possible to express the cumulative distribution function as
u2 QR (u) = 1 − exp − 2 2σ
,
whose shape is illustrated in Figure 2. It is interesting to note that the obtained shape for the neuron model’s nonlinear function QR (u) formally differs from the classical hyperbolic-tangent one, even if its shape resembles the usual saturating sigmoid. Also, in this case, the sigmoid’s shape parameter σ has a clear physical meaning. In order to illustrate how the above theory works in practice, we consider a simple 4 × 4 complex-valued independent component analysis case. Figure 3 shows the source signals, s1 , s2 , s3 , and s4 —namely, three QAM16
Nonlinear Complex-Valued Extensions of Hebbian Learning
799
1
0.9
0.8
0.6
R
g’(u)/u , Q (u)
0.7
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
2.5 u
3
3.5
4
4.5
5
Figure 2: Rayleigh warping function g u(u) (solid line) and the corresponding cumulative function QR (u) (dashed line) for σ = 1.
signals and a complex gaussian noise of standard deviation σ = 0.5—and their four linear superpositions x1 , x2 , x3 , and x4 . The same figure also shows the histograms of the moduli |sk | and |xk |. It is clear that as the QAM16 symbols are discrete by definition, their histograms are peaked. This simple rule does not apply, of course, to the gaussian noise. Conversely, the histograms of the observations xk are distributed. Figure 4 shows the whitened signals, denoted here by v1 , v2 , v3 , and v4 , and the estimated source signals obtained through the neural learning algorithm obtained by the extended Hebbian learning rule described in this section, denoted by y1 , y2 , y3 , and y4 . The histograms of the moduli of the complex-valued signals y1 , y2 , y3 look close to peaks, as predicted by the theory. The main contribution of the Sudjianto-Hassoun principle-based learning theory to ICA is to allow designing the proper structure of the nonlinear part of neurons provided that some information is known about the statistical structure of signals that the source signals differ from. This principle may prove very useful in blind signal processing, where the statistical structure of the involved signals is not known in advance, but the statistical structure
800
S. Fiori
Modulus histogram
Signal s
Signal s
40
Imaginary
20
100 0
0
1
0
Signal s
2
3
1
1
1
1
0
0
0
0
−1
−1 −1
Imaginary
4
200
1
0
−1 −1
1
0
0
10
10
0
0
0
0
10
−20 −20
0
20
−10 −10
0
10
−10 −10
200
200
200
200
100
100
100
100
0
0
10 Signal x1
20
0
0
10 Signal x2
20
0
1
−1
1
20
0
0
2
−1 −1
1
10
−10 −10 Modulus histogram
Signal s
2
300
0
10 Signal x3
20
0
0
0
1
0
10
10 Signal x3
20
Figure 3: (Second row) Source signals s1 , s2 , s3 , and s4 (three QAM16 signals and a complex gaussian noise). (Third row) Their four linear superpositions x1 , x2 , x3 , and x4 . (First row) Histograms of the moduli |sk |. (Fourth row) Histograms of the moduli |xk |.
of signals that differ from the useful ones, such as noises, might be easily known. 2.4 Analysis of the One-Unit Complex-Valued PCA/MCA. The learning rule in equation 2.17 for a single neuron in the quadratic complex-valued case particularizes to the adapting law: dw = E[y x − |y|2 w] , dt
(2.33)
with x and w belonging to C p . By defining the Hermitian covariance matrix def = E[xx H ], this differential equation rewrites as dw = w − (w H w)w . dt
(2.34)
Imaginary
Nonlinear Complex-Valued Extensions of Hebbian Learning 2
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
Imaginary
−2 −2
0 Re
2
−2 −2
0 Re
2
−2 −2
0 Re
2
−2 −2
2
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
−2 −2
Histogram
801
0 Re
2
−2 −2
600
600
400
400
0 Re
2
−2 −2
0 Re
2
800
−2 −2
0 Re
2
0 Re
2
2 |y4|
4
300
600
200
400 200 0
200
0
1 |y1|
2
0
100
200 0
1 |y2|
2
0
0
1 |y3|
2
0
0
Figure 4: (Top row) Whitened observations v1 , v2 , v3 , and v4 . (Middle row) Estimated source signals y1 , y2 , y3 , and y4 obtained through the extended Hebbian learning rule enriched by the Sudjianto-Hassoun principle. (Bottom row) Histograms of the moduli |yk |.
This system has a set of stationary points defined by def
E∗ = {w ∈ C p : w − σ w = 0} .
(2.35)
With the exception of the trivial solution w = 0, the elements w ∈ E∗ coincide to the eigenvectors of . The following proposition states the convergence of the system 2.34 to the eigenvector in E∗ corresponding to the largest eigenvalue σ . The proof of this theorem follows from the arguments successfully used in the realvalued case by other authors (Oja, 1992; Sanger, 1989). Theorem 1. Let us suppose Hermitian in equation 2.34 with eigenpairs (σ 1 , q 1 ), (σ 2 , q 2 ), . . . , (σ p , q p ). Suppose then that eigenvalues are distinct and
802
S. Fiori
arranged in descending order, and eigenvectors are normalized so that qkH qk = 1 and w(0) H q 1 = 0. Then there holds lim w(t) = q 1 e iα ,
t→+∞
where α is arbitrary in [0, 2π]. Proof. Let us expand vector w(t) by means of the system’s eigenbasis (Haykin, 1998; Sanger, 1989), which means writing: w(t) = θ1 (t)q1 + θ2 (t)q2 + · · · + θ p (t)q p ,
(2.36)
where scalar functions θk (t) ∈ C are termed principal modes. Plugging equation 2.36 in equation 2.34 yields p dθh (t) h=1
dt
p p p p H θh (t)qh − [θk (t)qk ] [θk (t)qk ] θh (t)qh . qh = h=1
k=1
k=1
h=1
By recalling property qk = σk qk , we have p p p p m dθh (t) H θh (t)qh σh − θk (t) θr (t)σr qk qr θh (t)qh . qh = dt h=1 h=1 k=1 r =1 h=1 Now from the fact that qkH q = δk , it follows that the differential equations for the h th principal modes, with h ≥ 2, read θ˙ h (t) = θh (t)σh −
p
|θk (t)|2 σk θh (t) , h = 2, . . . , p ,
(2.37)
k=1
while it is particularly useful to write in a separate equation the differential law pertaining to the first principal mode θ1 (t), that is, θ˙ 1 (t) = θ1 (t)σ1 − |θ1 (t)|2 θ1 (t)σ1 −
p
|θk (t)|2 σk θ1 (t) .
(2.38)
k=2
Principal modes θh (t) are complex-valued quantities, and their real and (R) (I ) imaginary parts are denoted here as θh (t) and θh (t), respectively. Our aim is now to solve differential system 2.37 + 2.38, which is a coupled nonlinear system of differential equations. In order to decouple the nonlinear differential subsystem 2.37, let us define the new functions: (R)
def
φh (t) =
(R)
θh (t) (R)
θ1 (t)
(I )
def
, φh (t) =
(I )
θh (t) (I )
θ1 (t)
, h = 2, . . . , p .
Nonlinear Complex-Valued Extensions of Hebbian Learning
803
For the real part of the h:th principal mode, there holds (R) (R) (R) (R) θ˙ (t)θ1 (t) − θh (t)θ˙ 1 (t) d (R) , φh (t) = h (R) dt θ (t)2
(2.39)
1
and the system 2.37+2.38, particularized for the real parts, reads: (R) (R) θ˙ h (t) = θh (t)σh −
p
(R)
(2.40)
(R)
(2.41)
|θk (t)|2 σk θh (t) , h = 2, . . . , p .
k=1 (R) (R) θ˙ 1 (t) = θ1 (t)σ1 −
p
|θk (t)|2 σk θ1 (t) .
k=1
By using equations 2.40 and 2.41 within equation 2.39, direct calculations (R) show that the dynamics of the state variables φh (t) looks like (R) (R) φ˙ h (t) = (σh − σ1 )φh (t) , h = 2, . . . , p . (I )
A similar formula holds for φh (t). Thus, we have proven that θh (t) = φh (0)e (σh −σ1 )t θ1 (t) ,
(2.42)
θh (t) = φh (0)e (σh −σ1 )t θ1 (t) .
(2.43)
(R)
(I )
(R)
(R)
(I )
(I )
The above formulas show that subsystem 2.37 has been successfully decoupled, since the dynamics of each principal mode no longer depends on the other modes, apart from the first one. It is useful to note that from equations 2.42 and 2.43, the expression below follows: (R)
(I )
|θk (t)|2 = θk (t)2 + θk (t)2 = e 2(σk −σ1 )t [φk (0)2 θ1 (t)2 + φk (0)2 θ1 (t)2 ] . (R)
(R)
(I )
(I )
Now the aim is to solve the differential equation 2.38 for θ1 (t). It is worth noting that it slightly simplifies if the following variable change is performed: θ1 (t) = c(t)e σ1 t , c(t) ∈ C . After this change, we have |θ1 (t)|2 = |c(t)|2 e 2σ1 t , (R)
(I )
|θh (t)|2 = e 2σk t [φh (0)2 c (R) (t)2 + φh (0)2 c (I ) (t)2 ] ,
804
S. Fiori
where c (R) (t) and c (I ) (t) stand, respectively, for the real part and the imaginary part of the function c(t). Then differential equation 2.38 becomes c(t) ˙ = − |c(t)|2 e 2σ1 t σ1 c(t) −
p
(R)
(I )
e 2σk t [φk (0)2 c (R) (t)2 + φk (0)2 c (I ) (t)2 ]σk c(t) .
(2.44)
k=2
By defining the auxiliary quantities, def
η(R) (t) = −
p
(R)
(2.45)
(I )
(2.46)
e 2σk t σk φk (0)2 − σ1 e 2σ1 t ,
k=2 def
η(I ) (t) = −
p
e 2σk t σk φk (0)2 − σ1 e 2σ1 t ,
k=2
the differential equations for the real and the imaginary parts of c(t) rewrite compactly:
c˙ (R) (t) = η(R) (t)c (R) (t)3 + η(I ) (t)c (I ) (t)2 c (R) (t) , c˙ (I ) (t) = η(I ) (t)c (I ) (t)3 + η(R) (t)c (I ) (t)c (R) (t)2 .
(2.47)
Since η(R) (t) < 0, η(I ) (t) < 0 and at least c(0) = w(0) H q1 = 0, two cases should be considered. CASE 1: Only one of the terms c (R) (0) and c (I ) (0) differs from zero. In this case, system 2.47 simplifies into c(t) ˙ = η(t)c(t)3 , where η(t) stands for η(R) (t) or η(I ) (t). The above differential equation may be solved by
c(t)
c(0)
dc = c3
t
1 1 = −2 2 c(t) c(0)2
η(τ )dτ ⇒
0
t
η(τ )dτ .
0
This readily leads to 1 e −2σ1 t = − 2e −2σ1 t 2 θ1 (t) θ1 (0)2
t
η(τ )dτ .
0
/ From definitions 2.45 and 2.46, it can be seen that under condition σ1 ∈ {σ2 , . . . , σm }, there holds lim 2e −2σ1 t
t→+∞
0
t
η(τ )dτ = −1 .
(2.48)
Nonlinear Complex-Valued Extensions of Hebbian Learning
805
This implies the conclusion lim
t→+∞
1 =1. θ1 (t)2
(2.49)
CASE 2: c (R) (0) = 0 and c (I ) (0) = 0. The hypotheses imply c (R) (t) = 0 and c (I ) (t) = 0. Thus, it is possible to multiply both sides of equations 2.47 by a factor 2/c (R) (t) and 2/c (I ) (t), respectively. This way we have: d log c (R) (t)2 d log c (I ) (t)2 = = 2[η(R) (t)c (R) (t)2 + η(I ) (t)c (I ) (t)2 ] . (2.50) dt dt The first equality in the chain tells us that log c (R) (t)2 = log c (I ) (t)2 − log κ, where κ is a positive constant determined by the initial conditions. Equivalently, we have c (I ) (t)2 = κc (R) (t)2 . This means that equation 2.50 can be easily solved, in that there holds d log c (R) (t)2 = 2[η(R) (t) + κη(I ) (t)]c (R) (t)2 . dt The latter equation is equivalent to c˙ (R) (t) = [η(R) (t) + κη(I ) (t)]c(t)(R) (t)3 . Again we have: 1 (R)
θ1 (t)2
=
e −2σ1 t (R)
θ1 (0)2
− 2e −2σ1 t
t
[η(R) (τ ) + κη(I ) (τ )]dτ ,
(2.51)
0 (I )
θ1 (0)2 1 1 1 = , κ = . (R) |θ1 (t)|2 1 + κ θ (R) (t)2 θ1 (0)2 1
(2.52)
By using twice the result 2.48, it may be shown that lim 2e −2σ1 t
t→+∞
t
[η(R) (τ ) + κη(I ) (τ )]dτ = −(1 + κ).
0
Thus, we arrive at the conclusion lim
t→+∞
1 =1. |θ1 (t)|2
The results 2.49 and 2.53 also imply lim |θk (t)| = 0 , k = 2, 3, . . . , p .
t→+∞
(2.53)
806
S. Fiori
It is worth mentioning that the assumption that all eigenvalues are distinct, while reasonable, for example, in signal processing tasks, is not crucial. The crucial assumption is that the first eigenvalue differs from the others, as suggested by the fact that the result, equation 2.49, holds under condition σ1 ∈ / {σ2 , . . . , σm }. In this case, a proof of a similar consistency result may be based on the convergence proof for the real-valued case provided in theorem 3.3 of Helmke and Moore (1993). When the function g(·) defined in equaiton 2.6 is not quadratic, it is likely that the neural unit just studied behaves in a different way from Oja’s neuron. A study on what happens in the general nonquadratic case for a real-weighted neuron has been carried out by Oja (1994). The analysis is extended here to the complex case. To begin, let us define the following quantities,
g (|y|) L(w) = E y x = E G(|w H x|)(xx H ) w , |y| def
def
G(w) = E[g (|y|)|y|] = E[G(|w H x|)|w H x|2 ] ,
(2.54) (2.55)
with G(u) defined again as in section 2.1. Note that L(w) ∈ C p , while G(w) ∈ C. Then learning rule 2.17 rewrites as dw = L(w) − wG(w) , dt
(2.56)
as in Oja (1994). The stationary points of system 2.56 are among the elements of the set: def
W∗ = {w ∈ C p : L(w) − wG(w) = 0} .
(2.57)
Let us suppose g (u) has null even derivatives at the origin, as in Oja (1994). Thus, the MacLaurin expansion of G(u) looks like 1 G(u) = g (2) (0) + u2 g (4) (0) + higher-order terms (h.o.t.). 6 Replacing the expanded function G(u) into definitions 2.54 and 2.55 gives 1 L(w) = wg (2) (0) + w H E[(xx H )w(xx H )]wg (4) (0) + h.o.t. , 6 1 G(w) = w H wg (2) (0) + E[|w H x|4 ]g (4) (0) + h.o.t. , 6 where again denotes the network input covariance matrix.
Nonlinear Complex-Valued Extensions of Hebbian Learning
807
Therefore, a neuron model with a sigmoidal activation function G(·) instead of a linear one introduces higher-order statistics in learning equations. This can be seen by looking at the set of stationary points W∗ that only in a first-order approximation coincides with E∗ , provided that the |g (n) (0)| are small enough for n > 2. Therefore, in general, the vectors in W∗ deviate from the principal directions in E∗ and thus also depend on higher-order statistical structures than covariance. A widely known problem in the scientific literature is that extending PCA theory to minor component analysis theory is not a straightforward task. This is true even in the complex domain case. In particular, it is not possible to replace maximization of criterion 2.4 J (w) = U(w) + L(w) with respect to w, with its minimization, in that the resulting learning rule, dw ∂J =− = E[−y x + |y|2 w] , dt ∂w
(2.58)
is unstable. The proof of the following proposition follows from the proof of the corresponding real-valued case presented in Fiori (2003b) and Oja (1992). Theorem 2. Let ∈ C p× p be a Hermitian matrix with eigenpairs (σ 1 , q 1 ), (σ 2 , q 2 ), . . . , (σ p , q p ). Let us suppose eigenvalues are distinct and arranged in descending order, and eigenvectors are normalized so that qkH qk = 1 . Then ˙ = −w + (w H w)w with w(0) H q1 = 0 becomes unstable in finite system w time. Proof. The proof essentially follows that of theorem 1. Let us consider the principal modes dynamics: θ˙ h (t) = −θh (t)σh +
p
|θk (t)|2 σk θh (t) , h = 1, . . . , p − 1 ,
k=1
θ˙ p (t) = −θ p (t)σ p +
p−1
|θk (t)|2 σk θ p (t) + |θ p (t)|2 θ p (t)σ p .
k=1
Let us now define the pseudo-modes, (R)
def
φh (t) =
(R)
θh (t) (R)
θ p (t)
(I )
def
, φh (t) =
(I )
θh (t) (I )
θ p (t)
, h = 1, . . . , p − 1 ,
where superscripts (R) and (I ) denote, as usual, the real part and the imaginary part of a complex-valued variable. By means of the pseudo-modes, it
808
S. Fiori
is possible to show that θh (t) = φh (0)e −(σh −σ p )t θ p(R) (t) , (R)
(R)
θh (t) = φh (0)e −(σh −σ p )t θ p(I ) (t) . (I )
(I )
Introducing the complex-valued auxiliary function c(t) so that θ p (t) = c(t)e−σ p t , it is possible to find the resolving differential equation: c(t) ˙ = |c(t)|2 e −2σ p t σ p c(t) −
p−1
e −2σk t [φk (0)2 c (R) (t)2 + φk (0)2 c (I ) (t)2 ]σk c(t) . (R)
(I )
k=1
Again, it is useful to define functions: def
η(R) (t) =
p−1
e −2σk t σk φk (0)2 + σ p e −2σ p t , (R)
k=1 def
η(I ) (t) =
p−1
e −2σk t σk φk (0)2 + σ p e −2σ p t . (I )
k=1
By means of these functions, it is straightforward to show that the real part and the imaginary part of c(t) satisfy system 2.47. Thus, the solution has the form 2.52. Now the time integrals of η(R) and η(I ) are explicitly required. The first one is 2e
2σ p t
t
η(R) (τ )dτ = e 2σ p t
0
p−1
φk (0)2 (e −2σk t − 1) + 1 − e 2σ p t . (R)
k=1
The second integral has a similar form; thus, we may ultimately write 1 = B − A(t)e 2σ p t , |θ p (t)|2 where B is a positive constant and A(t) is an increasing function of the time. Therefore, depending on initial conditions, a finite time t¯ exists such that: 1 = 0. |θ p (t¯)|2 Oja (1994) proposed an algorithm allowing for the extraction of generalized minor component analysis from real-valued signals that overcomes the stability problem. The aim of this part is to formalize the stabilization method within the optimization framework and to extend the theory to the complex case.
Nonlinear Complex-Valued Extensions of Hebbian Learning
809
Let us consider, once again, the problem of minimizing the criterion def
C(w) = E[g(|w H x|)] + λ(w H w − 1) ,
(2.59)
with respect to the weight vector w. The expression for its gradient is ∂C = E[G(|y|)y x] + λw. ∂w
(2.60)
∂C to zero and Thus, the optimal multiplier may be found by setting w H ∂w by solving the resulting equation under the constraint w H w = 1, that is, by solving
wH
∂C ∂w
opt = E[G(|y|)|y|] + 2λopt (w H w) = 0 .
The central point is now that as optimality requires w H w − 1 = 0, the latter condition is equivalent to E[G(|y|)|y|] + 2λopt − σ¯ (w H w − 1)w = 0 , with σ¯ being an arbitrary constant. By solving the previous equation for λopt ∂C opt and replacing the found function in the expression of ∂w , we obtain the stabilizable learning rule: dw = −E[G(|y|)(y x − |y|2 w)] − σ¯ (w H w − 1) . dt
(2.61)
When G(u) = 1, it is possible to prove that the corresponding minor component analysis learning algorithm converges to the expected solution. def
Theorem 3. Let = E[xx H ] be the covariance matrix of the random process x(t) with eigenpairs (σ 1 , q 1 ), . . . , (σ p , q p ) and G(u) = 1 in equation 2.61. Suppose eigenvalues are distinct and arranged in descending order, and eigenvectors are normalized so that qkH qk = 1. If σ¯ > σ 1 then the state vector w of system 2.61 with w(0 ) H q p = 0 asymptotically converges toward q p up to a phase factor. ˙ = −( − σ¯ I)w + Proof. System 2.61 with G(u) = 1 can be rewritten as w ¯ def ¯ are σ¯ − σ p > w H ( − σ¯ I)ww. Define = −( − σ¯ I). The eigenvalues of σ¯ − σ p−1 > · · · > σ¯ − σ1 > 0, while its eigenvectors coincide to eigenvectors ¯ + (w H w)w, ¯ ˙ = w of . Thus, theorem 2.4 applies to system w allowing the conclusion that w asymptotically converges to the last eigenvector q p up to a phase factor.
810
S. Fiori 4
1.5
4
3
3 1
2
2 0.5
0
3
Im{x }
1 Im{x2}
Im{x1}
1
0
−1
0
−1 −0.5
−2
−2 −1
−3
−4 −2
−3
0
2 Re{x1}
4
−1.5 −5
0
5 Re{x2}
10
−4 −5
0 Re{x3}
5
Figure 5: Input data for the first principal and minor component analysis experiment.
It is important to note that this result gives only a sufficient condition for the stability of rule 2.61. This rule embodies a modification that is known as origin shift in linear algebra and that in this context has been shown to arise from the Lagrange multiplier method in a natural way. As a numerical example, let us consider an input random process x ∈ C3 whose covariance matrix has eigenvalues σ1 = 3, σ2 = 2, and σ3 = 1. Such a signal is illustrated in Figure 5. Running the learning rule 2.33, which allows extracting the first principal component from the data, one expects that the first principal mode (θ1 ) tends to one (up to a phase factor), while the second and third principal modes tend to zero, where the principal modes have been defined in theorem 2. These results are confirmed by the left-hand panel of Figure 6. Furthermore, we tried to run the learning rule 2.61 on the same data set in order to extract the first minor component. We tried first with σ¯ = 0, which means using the nonstabilized last minor component analyzer: Simulations show that the rule quickly becomes unstable. Then we tried with σ¯ = 1, as in Oja (1994). Even in this case, the rules look unstable. Finally, we tried to use the sufficient condition provided by theorem 3. Since the largest eigenvalue is 3, we chose σ¯ = 3.5. Simulation results are shown in
Nonlinear Complex-Valued Extensions of Hebbian Learning First principal component analysis
First minor component analysis 1
1
|θ |
|θ1|
1.2
1
0.8
0
2000
4000
0.5
0
6000
0.6
0.6
0
2000
4000
6000
0
2000
4000
6000
0
2000 4000 Iterations
6000
2
|θ |
0.8
|θ2|
0.8
0.4 0.2 0
811
0.4 0.2
0
2000
4000
0
6000
1.5
1.3 1.2 3
|θ |
|θ3|
1 0.5 0
1.1 1
0
2000 4000 Iterations
6000
0.9
Figure 6: Principal modes modules evolution. (Left) First principal component analysis case. (Right) First minor component analysis case.
the right-hand panel of Figure 6. As expected, the first two principal modes converge to zero, while the third mode approaches the value 1. 2.5 Notes on Multiunit Complex-Valued Component and Subspace Networks. By using the fundamental result found by Oja (1982) about the convergence of the neural algorithm, allowing the extraction of the first principal component from a real-valued random process x(t), Sanger (1989) was able to prove the convergence of the generalized Hebbian algorithm (GHA). In particular, Sanger proved that the differential system d W = W − UT(WT W)W , dt def
where = E[xxT ] and W ∈ R p×m , extracts the first m eigenvectors of the covariance matrix up to a sign factor. In the same way, on the basis of the theorem 1, it would not be difficult to show that the multiunit learning rule 2.18, in the hierarchical version allows extracting the first m complex eigenvectors up to a phase factor.
812
S. Fiori
Rule 2.17 was derived via the standard Lagrange multipliers method that is used to enforce the constraints of orthogonality on the weight vectors. However, the multipliers guarantee the constraints to be satisfied only at equilibrium, but in general not during system evolution. (A detailed analysis of the drawbacks inherent in the use of Lagrange multiplier methods for orthogonality constraints has been proposed recently in Douglas, Amari, & Kung, 2000.) In order to illustrate this effect more clearly in the case of nonlinear complex-valued, component or subspace learning, let us reformulate the orthogonality constraints by borrowing some concepts from differential geometry. (A good reference book is, Helmke & Moore, 1993.) Let us define the complex-valued compact Stiefel manifold as def
St( p, m, C) = {A ∈ C p×m |A H A = Im } . Now it is clear that the requirement that the connection pattern W at learning equilibrium is an orthonormal matrix may be equivalently formulated as W ∈ St( p, m, C). As the learning rules of interest here are expressed in k terms of differential equations of the kind dw = Fk , with Fk being a proper dt learning vector field for every abstract neuron, it is convenient to express the orthogonality constraints in terms of differential properties. This may be done by invoking the concept of tangent space of the Stiefel manifold at a point, namely: TW St( p, m, C) = {V ∈ C p×m |V H W + WV H = 0m } . In terms of single weight vectors, in order for the connection pattern to belong to the Stiefel manifold, they should satisfy
dwk dt
H wh + wkH
dwh dt
=0
(2.62)
for every pair (k, h) ∈ [1, ..., m]2 . In order to verify if this property is satisfied, it is sufficient to replace the expressions 2.17 in equations 2.62. This would lead to the following equations: E[G(|yk |)yk x H ]PkH wh + wkH Ph E[G(|yh |)yh x] = 0 . It is now straightforward to find that in the symmetric network case, the projector operator is such that PkH wh = 0 for every pair of indices (k, h), while in the hierarchical case, it holds that PkH wh = 0 for k ≥ h, while PkH wh = wh for k < h. As a consequence, when k < h, the left-hand side of equation 2.62 becomes E[G(|yk |)yk yh + G(|yh |)yh yk ] = 0 .
(2.63)
Nonlinear Complex-Valued Extensions of Hebbian Learning
813
From this result, it is necessary to conclude that in the symmetrical (generalized principal or minor subspace analysis) case, it is true that W(t) ∈ St( p, m, C) at any time, while in the hierarchical (generalized principal or minor component analysis) case, in general it happens that W(t) ∈ / St( p, m, C) at any time only if the algorithm is out of equilibrium. It is worth noting, however, that in any case, it is guaranteed that the optimization of the weight vectors wk happens on a unitary radius hyper-sphere. In fact, in either the symmetrical or the hierarchical case, it is obtained from k conditions 2.62 that wkH dw = 0, which implies that wk 2 is constant (at one, dt if the weight vectors are initialized to have unitary norm). Ultimately, this means that vector orthogonality is to be progressively achieved by learning and is not guaranteed at any time. This is a drawback of the Lagrange multiplier method. On the basis of general knowledge and the results granted from the preceding sections (especially section 2.4), we may summarize these drawbacks:
r
r
r
r
Generally, the method of multipliers consists in embedding the constraints under which the optimization of a criterion function should be performed in the criterion function itself by adding a properly constructed function to the original criterion. From a geometrical point of view, it seems unnatural to modify the objective function instead of restricting it to the set of feasible configurations or variables values. As a consequence of this construction, the constraints should be thought of as a sort of subtarget to be achieved in parallel to the optimization of the criterion function. Therefore, provided the employed unconstrained optimization algorithm is able to perform the optimization of the augmented criterion, the constraints are fulfilled only when the optimization is completed and not—in general—at any time. The modification of the criterion function may cause the appearance of undesired effects, such as the appearance of local extremes that might make the optimization algorithm unable to solve the learning task or might make the learning progress slower than expected. The general idea behind constrained optimization by the Lagrange multiplier method is to replace the constrained optimization flow with an unconstrained optimization flow (that takes places, for example, in the Euclidean space R p ) by properly embedding the constraints into an augmented learning criterion. It is worth noting that the space that the optimum search takes place in plays an important role in determining the behavior of the gradient-based learning system. As evidenced in section 2.4, for instance, a learning differential equation on R p may be unstable even if it was derived from strong constraints such as the normality of the weight vectors. Defining learning differential equations
814
S. Fiori
on compact manifolds (such as the Stiefel one) instead may help to overcome this serious difficulty. A detailed treatment of the Lagrange multiplier method may be found in Bertsekas (1996). In the nonquadratic case, moreover, it is difficult to give general results on learning capabilities. Some recent results for the real-valued case based on the design of appropriate Lyapunov functions for the system equivalent to equation 2.56 have been introduced in Vegas and Zufiria (2004). These results are valid in the case of homogeneous nonlinearities L(w). It is worth mentioning that for the case that nonlinear complex-valued PCA is used for blind source separation of telecommunication-type source signals with a Sudjianto-Hassoun nonlinearity, a convergence theorem was given in Fiori (2003c). Following the line of the preceding text on multiunit artificial networks, it also appears appropriate to discuss briefly the important topic of the stability of learning equations in view of computer-based implementation. It is worth recalling the distinction between the related concepts of learning differential equation and learning algorithm. A learning differential equation arises when a learning trajectory is considered in the weight space and the learnable variables of an artificial neural networks are parameterized through a free variable (normally the time) that locates a network configuration over a learning trajectory. In the general gradient-based learning setting, for instance, the learning trajectory cannot be written in explicit form, and it is specified through a differential equation (often of the first order) whose intuitive meaning is to specify the velocity field associated with learning a specific task. The solution (integration) of such a differential equation (provided an initial configuration is specified) gives rise to the network’s learning trajectory, formally termed gradient flow. A learning differential equation may be solved in closed form very rarely. Normally it is necessary to approximate the gradient flow numerically through an iterative algorithm. This corresponds to sampling the time variable and trying to obtain a close approximation of the exact learning trajectory. Then the obtained discrete-time learning equation is what it is normally referred to as a learning algorithm. The passage from a generic learning differential equation to an associated learning algorithm is not unique, as it may be performed in several different ways, and it is not consequence free, because, depending on the chosen time discretization method, the salient features of the differential equations may be retained or lost. In particular, it is worth discussing here the effect of time discretization on the two salient features of orthogonality: preservation and learning stability:
r
Euler discretization. The simplest discretization method consists in replacing the derivatives with respect to the time with incremental
Nonlinear Complex-Valued Extensions of Hebbian Learning
815
ratios. This method is widely know as the Euler integration technique. For instance, the derivative in equation 2.3 may be approximated as dw(t) w(t + T) − w(t) ≈ , dt T
r
where T denotes the width of the time interval that the derivative is approximated within. It is clear that the Euler algorithm may be well suited on a linear space, as the Euclidean manifolds C p or C p×m , while the Stiefel manifold St( p, m, C) is a curved space, and it is impossible to move from a network configuration to another configuration through vector or matrix addition. This means that the Euler method does not preserve the orthonormality of a network’s connection matrix, and it is not guaranteed that the learning system eventually converges to a configuration belonging to St( p, m, C) or that the network connection stays in the vicinity of such a manifold for a sufficiently long time, which may ultimately affect the stability of the learning algorithm. These problems have been widely investigated by Yan, Helmke, and Moore (1994) and Zufiria (2002). Fiori and Piazza (2000) showed that by properly selecting the learning step size, it could be possible to obtain a convergent learning system. Chen, Amari, and Lin (1998) and Chen and Amari (2001) also investigated the problem of the existence of a Stiefel-manifold attractor for minor component or subspace analysis algorithms. It is worth recalling that, as already noted in the section 2.4, the minor component and subspace analysis algorithms appear to be the most problematic ones with regard to dynamical stability. Geometric integration. A good discretization method in the time domain should allow converting the learning differential equations into discrete-time algorithms so as to retain (up to reasonable precision) the geometrical properties that characterize the developed learning rules. From a numerical point of view, the solution of matrix-type differential equations on curved manifolds has been widely investigated in the context of geometrical integration (see, e.g., Hairer, Lubich & Wanner, 2002, and Iserles, Munthe-Kaas, Nørsett, & Zanna, 2000), which is a recent branch of numerical analysis. The traditional effort of numerical analysis and computational mathematics has been to render physical phenomena into algorithms that can produce sufficiently affordable, precise, and robust numerical approximations. Geometrical integration is also concerned with producing numerical approximations that preserve the qualitative attributes of the solution to the extent possible. In the context of neural learning under orthogonality constraints, some efforts have been devoted to develop learning algorithms over the group of orthogonal matrices and over the Stiefel manifold. These consist of Riemannian gradient-based and non-gradient-based learning algorithms that strongly bind the connection patterns to the Stiefel
816
S. Fiori
r
manifold. (For recent reviews, refer to Celledoni & Fiori, 2004, and Fiori, 2001, 2002.) One of the advantages offered by geometrical integration over compact manifolds (such as the orthogonal group and the Stiefel manifold) is the intrinsic stability: whether the algorithm converges to the expected solution or not, the network connection matrix always keeps a bounded norm. Fixed-point algorithms. Recently, preliminary results on fast batchtype fixed-point iteration for nonlinear Hebbian learning with orthonormality constraints have been illustrated in Fiori (2003d, 2004, in press). The extension of this class of algorithms to the complex domain as well as some fundamental issues such as convergence and computational-complexity reduction are still open problems.
3 Complex-Valued Hebbian Learning by Nonquadratic Reconstruction Error Minimization A second philosophy known in the literature allowing formulating Hebbian learning involves the concept of reconstruction error minimization. Let a stationary multivariate random signal x(t) of size p, whose covariance matrix has distinct eigenvalues, be expanded by means of a basis of size m < p given by {w1 , w2 , . . . , wm }. This means that a new random signal xˆ (t) may be defined so that xˆ = y1 w1 + y2 w2 + · · · + ym wm . This is termed the econstruction formula, where the coefficients yk are defined as the projections yk = wkH x in the presence of complex-valued signals; projection expressions are termed analysis formulas. It is clear that as m < p, in general xˆ (t) = x(t), and the difference between the original and the reconstructed version of x may be conceived as the reconstruction error (Karhunen & Joutsensalo, 1995; Xu, 1992). In the following sections, we consider nonquadratic complex-valued reconstruction error minimization for a symmetrical network, which leads to a generalized principal subspace analysis theory, and the component-wise nonquadratic reconstruction error minimization principle for a hierarchical network leads to a generalized principal component analysis learning theory. We also discuss the problem of nonlinearity selection by recalling interesting statistical interpretations of nonquadratic learning criteria. The rationale for considering nonquadratic reconstruction error optimization was given by O’Leary (1990) and Liano (1996) in the artificial neural field. Liano’s contribution was limited to supervised neural learning for modeling purposes, but it remains valid for the present unsupervised (linear) data modeling treatment. In particular, Liano observed that most neural networks are trained by minimizing the mean squared error pertaining
Nonlinear Complex-Valued Extensions of Hebbian Learning
817
to the given training set. In the presence of outliers, the resulting neural model can differ significantly from the system that generated the data or from the intended abstract model. The purpose of Liano’s research was to introduce a robust approach that can minimize the influence of gross errors on the accuracy of the neural model. Two different approaches were used in Liano (1996) to study the mechanism by which outliers affect the resulting models: the influence function approach and the maximum likelihood approach. The same approaches were followed by Song et al. (1998) and are also partially followed here (in fact, the maximum likelihood approach is replaced by the maximum entropy method here). The contents of this section may be anticipated as:
r r r r
Learning principal and minor subspace by minimization of nonquadratic complex-valued reconstruction error Choosing the nonlinear criteria via the Song-Yilong-Feng interpretation of Hebbian learning, which is largely based on Liano’s research in the context of supervised multilayer perceptron learning Learning principal and minor component by component-wise nonquadratic reconstruction error minimization Illustrative numerical simulation results in the robustness of the considered learning theories
3.1 Symmetrical Nonquadratic Complex-Valued Reconstruction Error Minimization. In the symmetrical network case, the reconstruction error may be formally defined as def
e = x − xˆ = (I − WW H )x .
(3.1)
It may be shown that in the real-valued case as well as for complex-valued signals, minimizing the mean square reconstruction error leads to principal subspace analysis. The real-valued case has been investigated by Xu (1992). In the complex-valued case of interest here, this property has been analytically shown by Yang (1995), who proved the following two results: def
Theorem 4. W is a stationary point of Y(W) = E[ x − WW H x 2 ] if and only def if W = Um Q where Um ∈ C p×m contains any m distinct eigenvectors of = H m×m E[xx ] and Q ∈ C is a unitary matrix. Theorem 5. All stationary points of Y(W) are saddle points except when Um contains the m dominant eigenvectors of . In this case, Y(W) attains the global minimum. Note the presence of the arbitrary unitary matrix Q, meaning that the network’s connection pattern W that minimizes the reconstruction error
818
S. Fiori
corresponding to coefficients y = Q H UmH x but in general E[yk y j ] = 0, which means the network-transformed signals are still correlated. One of the most interesting features of this approach is that the optimization problem of searching for a weight matrix W that minimizes the mean square reconstruction error is unconstrained, in that no constraints have to be added to the error cost function in order for the minimization problem to be consistent. This appears as a noticeable difference with respect to the material presented in section 2. In order to reduce the effects of noise, disturbances, and outliers affecting the data sets, generalized versions of the mean square reconstruction error minimization-based PSA algorithms have been developed for analyzing real-valued signals. In what follows, we extend this enhanced theory to the complex domain. Instead of minimizing the quantity E[ e 2 ], here we derive a generalized PSA algorithm by minimizing, through the use of a gradient-based search procedure, a more general cost defined as def
Z(wk ) = E[φ( e 2 )] ,
(3.2)
thought of as a function of any single vector wk for easy notation, where k = 1, 2, . . . , m. The function φ(u) with u ≥ 0 should be differentiable and convex and should possess a unique minimum in u = 0. The problem about its choice is crucial for the behavior of the PSA algorithm and will be discussed. In order to minimize function Z(·) with respect to each wk by means of a gradient steepest-descent recursive algorithm, its gradients are needed. Therefore, we want to evaluate ∂ Z(wk ) dφ( e 2 ) ∂ e 2 . =E ∂wk d( e 2 ) ∂wk
(3.3)
Note that from definition 3.1, there holds e 2 = x H x − 2x H WW H x + x H WW H WW H x . The expression of the gradient of e 2 is ∂ e 2 = 2(Wy − 2x)yk + 2y H W H wk x . ∂wk By recalling the identity Wy = −e + x, the above expression simplifies into 1 ∂ e 2 = −eyk − xe H wk . 2 ∂wk
(3.4)
Nonlinear Complex-Valued Extensions of Hebbian Learning
819
def
Let us define (u) = dφ(u) . This is what Liano (1996) refers to as the indu fluence function. Then the gradient steepest-descent learning equations for generalized principal subspace extraction for complex-valued signals read ∂ Z(wk ) dwk = E[( e 2 )(eyk + xe H wk )] , k = 1, 2, . . . , m , (3.5) =− dt ∂wk with e being defined by equation 3.1. Choosing the nonlinear function (·) is not generally an easy task. Song et al. (1998) proposed a theory allowing linking the choice of (·) to the statistical features of the reconstruction error for real-valued signals. The aim of the following section is to show how these results may be extended to the complex-valued case. 3.2 The Song-Yilong-Feng Principle and Its Extension to the ComplexValued Case. The Song-Yilong-Feng theory for real-valued ergodic inputsignal x is based on the definition of the random variable: def
ε(x, W) = x − WWT x 2 .
(3.6)
Let us suppose that several random input vectors {xn }n=1,N have been collected. For each input sample, we may compute a reconstruction error value εn = ε(xn , W). The likelihood of W is then defined as the probability def
L(E|W) = q (ε1 , ε2 , . . . , ε N |W) ,
(3.7)
def
where E = [ε1 ε2 · · · ε N ] and q (·, . . . , ·|·) represents the joint probability density function of the εn subjected to the hypothesis W. Under the hypothesis that the xn are identically distributed and statistically independent (i.i.d.), the likelihood writes: L(E|W) =
N n=1
q ε (εn |W) =
N
q (εn ) ,
(3.8)
n=1
in short notation.4 Having defined the likelihood of a network connection matrix under the observed reconstruction error values, we may invoke the maximum likelihood estimation theory as a learning tool for the neural network. In order to make this learning principle be profitable, we should hypothesize a distribution for the reconstruction error values.
4 It might be worth recalling that successive x , but not their components, are supposed n to be i.i.d.
820
S. Fiori
In the following, we formulate the maximum-likelihood theory in a version suitable for treating an infinite number of available samples and discuss some possible choices of the error distributions. We also relate the Song-Yilong-Feng theory to the conceptualization of robust classification proposed by Xu and Yuille (1995). Let us suppose that the probability density function of the square reconstruction error ε has the negative exponential (one-side Laplacean) form def
q E (u) = κe −κu (u) ,
(3.9)
where κ > 0 denotes the Laplacean dispersion parameter and (u) again denotes the unit-step function. Then the log likelihood takes on the expression log L(E|W) = N log κ − κ
N
xn − WWT xn 2 .
n=1
The meaning of this expression is that the maximum likelihood principle and the minimum average square error principle coincide if the conditioned probability density function q ε (ε|W) is assumed as in equation 3.9. Note that the above statement is based on the hypothesis that N is finite; otherwise, the log likelihood could be unbounded. On the basis of this observation, Song et al. (1998) proposed extending the idea to a generic probability density function, replacing the averagesquare-error principle with the maximum likelihood estimation of the weight matrix W. It consists in looking for a weight matrix maximizing the quantity log L(E|W) =
N
log q (εn ) .
(3.10)
n=1
We aim at extending this optimization criterion in two ways. First, the maximum likelihood theory may be made rigorous for an infinite-size data set (i.e., for density-function-based descriptions instead of sample based). The key step for this extension relies on the use of differential entropy associated with a probability distribution. Second, we want to extend the Song-Yilong-Feng principle to the complex domain. As it shall shortly be clear, this extension leads us to face again the problem of nonlinear function selection. This problem may be solved directly through the Song-YilongFeng principle (an especially interesting formulation comes by choosing the Cauchy-Lorenz distribution for the reconstruction error model) and also from a classification theory by Xu and Yuille (1995) based on statistical physics.
Nonlinear Complex-Valued Extensions of Hebbian Learning
821
In the presence of infinitely many observations, instead of using the likelihood, the differential conditional entropy may be adopted. To this aim, let us define the function def
H(W) = −E x [log q ε (ε|W)|W] = −E[log q (ε)]
(3.11)
in short. The quantity H(W) plays the role of a “minus-mean-loglikelihood”. If we assume again the distribution 3.9 for the square error ε, we find H(W) = − log κ + κ E[ e 2 ] . If the reconstruction error is defined for complex-valued signals as in equation 2.64, minimizing the entropy H with respect to W also means minimizing the mean square error—in symbols: min {H(W)} ⇐⇒
W∈C p×m
min {E[ e 2 ]} .
W∈C p×m
Following the idea in Song et al. (1998), the minimum entropy principle may be extended by assuming different probability density functions q (·). In order to find the matrix W minimizing function 3.11, the gradient steepestdescent approach may be considered. Its use requires evaluating the gradient: ∂ H(wk ) q ( e 2 ) ∂ e 2 . = −E ∂wk q ( e 2 ) ∂wk def
(u) Defining the statistical score function (u) = − qq (u) , the gradient steepestdescent algorithm pertaining to the entropy 3.11 minimization reads
dwk = E[( e 2 )(eyk + xe H wk )] . dt
(3.12)
Note that equations 2.68 and 3.12 are identical except for the form of functions (·) and (·). The choice of (·) is driven by the hypotheses about the statistics of the scalar quadratic error e 2 . In order to illustrate the usefulness of the differential-entropy-based approach, let us consider in detail the probability density functions q E defined by equation 3.9 and the modified (one-sided) Cauchy-Lorentz distribution recommended in Liano (1996) and Song et al. (1998): def
q C (u) =
2 θ (u) , θ > 0 . 2 π θ + u2
822
S. Fiori
The most interesting difference between the Laplacean and Cauchy probability density functions is that the first distribution is characterized by a mean value and a variance, while the second distribution has neither a mean value nor a variance.5 This special feature of Cauchy distribution deserves some comment:
r
r
r
If we assume a Cauchy distribution for the quadratic error e 2 instead of the Laplacean one, we do not need to embody any a priori hypotheses about the variability of the error in the algorithm. In other words, the risk of biasing the error distribution with inaccurate information is less serious than with the Laplacean distribution. Another interesting implication of the inexistence of moments of the Cauchy distribution is that the effects of Cauchy-based modeling are less tied to the number of available samples. In fact, the practical meaning of the inexistence of moments is that collecting several data points gives no more accurate an estimate of the moments than a single data point does. However, the Cauchy density does impose scale limitation; the choice of the width parameter θ does affect the width of the receptive fields of the neurons in the component and subspace analysis network.
The score function corresponding to the Cauchy distribution is C (u) =
θ2
2u (u) . + u2
Several expressions for the function φ(·) in equation 3.2 have been proposed in the literature (Karhunen & Joutsensalo, 1994, 1995; Xu & Yuille, 1995), usually justified by invoking the theory of robust statistics. These different choices may be discussed using the Song-Yilong-Feng interpretation. In order to accomplish this, it is useful to equate functions (u) and (u). It is worth recalling that (u) =
dφ(u) 1 dq (u) , (u) = − , du q (u) du
where function φ(u) was introduced in formula 3.2 and the probability density function q (u) takes part in formula 3.8. From these relationships, the following formula is readily found: φ(u) = − log q (u) + c ,
(3.13)
5 In fact, the Cauchy distribution does not admit any statistical moment because its characteristic function is not differentiable at the origin.
Nonlinear Complex-Valued Extensions of Hebbian Learning
823
where c ∈ R is a constant. The above formula may be used in two directions. First, if the probability distribution model for the reconstruction noise q (·) has been selected, then the corresponding warping function φ(·) may be computed by formula 3.13. Care should be taken that the resulting function φ(·) fulfills regularity as well as convexity requirements, by definition; otherwise, the pair (q , φ) must be rejected as it is not valid in the present framework. If the pair (q , φ) is acceptable, the constant c, whose value does not influence the shape of the warping function φ(·) but only its vertical placement, may be chosen arbitrarily (e.g., in order to simplify the expression of function φ(·)). Alternatively, formula 3.13 may help in discussing some choices of φ(·) in terms of probability distributions for the square reconstruction error. In this case, care should be taken that the resulting function q (·) be a probability density function; that is, it is a positive integrable function with unitary area. In this case, the constant c may be determined +∞ by the formula c = − log 0 e −φ(τ ) dτ . As two examples of the first case, we may find the functions φ(·) corresponding to the Laplacean and Cauchy-Lorentz distributions. Straightforward computations show that they write φE (u) = − log κ + c + κu , 2θ + c + log(θ 2 + u2 ) , φC (u) = − log π respectively. In these cases, the constant c may be chosen to be zero or, for example, for the Cauchy distribution case, as c = log 2θ , which simplifies π the expression of φC (u). The obtained functions may be easily verified to fulfill the required restrictions such as regularity and convexity, at least in a properly chosen right-sided neighborhood of the origin. As an example of def the second case, let us consider the warping function φK (u) = log cosh(βu) found in Karhunen and Joutsensalo (1994, 1995). The corresponding q (·) function is computed to be q K (u) =
2 2β (u) , π e −βu + e βu
which may easily be proven to be a density distribution and shall hereafter be referred to as K density model. The shape of the above distributions q E , q C , and q K , along with the corresponding warping functions φE , φC , and φK , are depicted in the Figure 7 for arbitrary parameters values. In both the negative exponential distribution model and the K model, as the square reconstruction error ε increases, the density functions value decreases rapidly. This means that in those distributions, a large reconstruction error (usually associated to disturbances) has a small probability of occurring; then the model does not take into account properly the presence of outliers. Conversely,
824
S. Fiori
0.7
10 Laplacean Cauchy-Lorentz K-model
Laplacean Cauchy-Lorentz K-model
9
0.6 8 0.5
7
6 0.4 5 0.3 4
3
0.2
2 0.1 1
0
0
2
4
6
8
0
10
0
2
4
6
8
10
Figure 7: (Left) Reconstruction error distribution functions q E (κ = 0.5; Song et al., 1998), q C (θ = 5/π ; (Song et al., 1998) and q K (β = 1; Karhunen & Joutsensalo, 1995). (Right) Warping functions φE , φC , and φK .
the use of the modified Cauchy distribution improves the robustness of the PSA algorithm since it decreases more slowly than the former two. To end, it is interesting to discuss the relationships among the SongYilong-Feng theory and the conceptualization proposed by Xu and Yuille (1995) based on statistical physics. Let us consider the reconstruction error ε(x, w) due to a one-unit artificial neural network: def
ε(x, w) = x − (w H x)w 2 , x ∈ {xn }n=1,N . Xu and Yuille (1995) considered a neuron’s energy function to be minimized for statistical learning, def
K (f, w) =
N n=1
f n ε(xn , w) + η
N (1 − f n ) ,
(3.14)
n=1
where η is a constant and the f k ’s are random binary variables acting as decision flags indicating whether xk is an outlier ( f k = 0) or not ( f k = 1) def and f = [ f 1 f 2 · · · f N ]T . The aim is to minimize K (f, w) with respect to f
Nonlinear Complex-Valued Extensions of Hebbian Learning
825
and w simultaneously. To solve this difficult problem, Xu and Yuille defined a Gibbs distribution, def
ρ(f, w) =
1 −β K (f,w) , e
where is a normalization constant and 1/β stands for a “temperature” (for a discussion on the meaning of parameters η and β, see Xu & Yuille, 1995). Then they proposed to approximate ρ(f, w) by computing the marginal probability distribution ρm (w) obtained averaging ρ(f, w) over all possible configurations f (which may be thought of as a kind of mean-field approximation). This gives (Xu & Yuille, 1995) ρm (w) = 1m e −β K eff (w) , with def m = e Nβη and def
−K eff (w) =
N 1 log{1 + e −β[ε(xn ,w)−η] } . β n=1
(3.15)
The optimal connection configuration wopt is found by maximizing−K eff (w). It is now interesting to note that −K eff (w) is very similar to log L(E|w) in equation 3.10; thus, we may define a probability density function, 1
def
q X (u) = A[1 + e −β(u−η) ] β (1 − (u − )) ,
(3.16)
where Ais a normalization constant and the factor 1 − (u − ) makes q X (u) vanish outside a compact Then the normalization constant 1 domain [0, ]. 1 def def can be found by A1 = β1 D u−1 (1 + Cu) β du, where D = exp(−β) and C = exp(βη). Note that may be arbitrarily large, albeit bounded. A distribution q X is shown in Figure 8. The distribution q X (u) represents an equivalent reconstruction error distribution that would make it possible to select a proper nonlinear warping function in reconstruction-error-based learning via the Xu-Yuille statistical physics–based theory. 3.3 Component-Wise Nonquadratic Reconstruction Error Minimization. The learning theory discussed in section 3.1 allows extracting a principal subspace only from a set of complex-valued random signals. A neural network trained by means of that algorithm is not able to effect PCA substantially because of its inherent symmetry. With the aim of breaking the symmetry and making the network hierarchical, it is possible to define a reconstruction error vector for each neuron (Karhunen & Joutsensalo, 1995) and to design a learning rule for each neuron so that it tries to minimize a local error. Again, only linear reconstruction is dealt with. A possible definition is ek = x −
K (k) j=1
yj w j ,
(3.17)
826
S. Fiori 0.9
Recon. error probability density distribution
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.5
1
1.5 Reconstruction error
2
2.5
3
Figure 8: Reconstruction error distribution functions q X (β = 0.5, η = 4/π as in Xu & Yuille, 1995, and = 3).
where k ranges from 1 to m, the number of the neurons. The indexing function K (k) determines whether the network is symmetrical. If K (k) = m, we come back to the symmetrical case; all the neurons learn to correct the total error e = x − mj=1 y j w j , while K (k) = k makes the network hierarchical.6 Following Karhunen and Joutsensalo (1995), here we discuss the optimization of the cost function def
R(ek ) =
p
E[ f (e kr )] ,
(3.18)
r =1
where, contrary to the case discussed in section 3.1, the r th entry of the k th error vector ek , namely, e kr , warps by the function f (·). Again, f (ζ ) = g(|ζ |) for ζ ∈ C, with g(u) being differentiable, convex, with a unique minimum in u = 0 in a convenient right-sided neighborhood of the origin. Evaluating the gradient of R(ek ) with respect to the weight vector wk involves rather 6 Note that in the hierarchical case, the reconstruction error may be defined recursively as e1 = x − y1 w1 , while for k ≥ 2: ek = ek−1 − yk wk .
Nonlinear Complex-Valued Extensions of Hebbian Learning
827
difficult operations that may be considered interesting from a methodological point of view; hence, we provide some details about its computation. Particularly, we need the Hadamard product operator, defined as
a1 c1 a 1 c1 a 2 c2 def a 2 c 2 a◦c= . ◦ . = . , .. .. .. ap cp a pc p with a, c being p-dimensional real- or complex-valued vectors. Note that the Hadamard product is commutative and associative. Furthermore, we need a set of vectors br : each br ∈ R p is defined so that its entries are 0 except for the r th one equal to 1 (i.e., the {br } is the canonical basis of R p ). R(ek ) , let us consider the quantity |e kr |2 , To begin with the calculation of ∂ dw k which has the structure |e kr |2 = xr xr − 2P(e kr ) + Q(e kr ) ,
(3.19)
where the following quantities have been defined: def
P(e kr ) =
K (k)
Re{y j xr w jr } ,
j=1 def
Q(e kr ) =
K (k) K (k)
Re{yj ys w jr wsr } .
j=1 s=1
The gradient of P(e kr ) computed with respect to the complex-valued weight vector wk is found to be ∂ P(e kr ) = yk x ◦ br + (xr wkr )x . ∂wk
(3.20)
About the gradient of Q(e kr ), after some mathematical work, we find 1 ∂ Q(e kr ) = wkr x . yk yh w j ◦ br + yj whr 2 ∂wk j=1 K (k)
(3.21)
Now the effects of the nonlinearity f (·) have to be taken into account. For notational compactness, let us define functions: def
b(u) =
dg(u) def b(|ζ |) , u ≥ 0 and B(ζ ) = , ζ ∈C. d(u2 ) |ζ |
(3.22)
828
S. Fiori
Let us evaluate the gradient of R(ek ) with respect to the weight vector wk : p 1 ∂ R(ek ) b(|e kr |) ∂|e kr |2 . = E ∂wk 2 r =1 |e kr | ∂wk
(3.23)
From equations 3.19 to 3.23, rearranging terms, recalling the use of the Hadamard operator ◦, and allowing function B(·) to operate componentwise, we have: K (k) 1 ∂ R(ek ) H =E (y j w j )[B(ek ) ◦ wk ]x 2 ∂wk j=1
+E
yk B(ek )
◦
K (k)
yj w j
j=1
− E yk x ◦ B(ek ) + x H [B(ek ) ◦ wk ]x . K (k) By replacing the term j=1 y j w j with x − ek , we obtain the gradient steepest-descent learning algorithm: dwk 1 ∂ R(ek ) = E ekH [B(ek ) ◦ wk ]x + yk [B(ek ) ◦ ek ] . =− dt 2 ∂wk
(3.24)
def
Let us consider the nonquadratic warping function g(u) = 2u3 for u ∈ This choice yields B(z) = 1—hence, the learning algorithm:
R+ 0.
dwk = E[(ekH wk )x + yk ek ] , k = 1, 2, . . . , m , dt
(3.25)
which coincides with the learning rule 3.5 when (·) = 1 and K (k) = m. Also, in the presence of real-valued input signals, the quantities involved in equation 3.24 are real valued; thus, it may be rewritten as: −
1 ∂ R(ek ) = E ekT [B(ek ) ◦ wk ]x + yk [B(ek ) ◦ ek ] . 2 ∂wk
(3.26)
Now, since B(u) = b(u)/u for u ∈ R, if we allow b(·) to operate componentwise, the following identities hold: B(ek ) ◦ ek = b(ek ) and ekT [B(ek ) ◦ wk ] = wkT b(ek ). Therefore, the learning rule corresponding to the gradient 3.26 for a realvalued PCA neural network reads dwk = E[wkT b(ek )x + wkT xb(ek )] , k = 1, 2, . . . , m . dt
(3.27)
Nonlinear Complex-Valued Extensions of Hebbian Learning
829
This learning rule coincides with the nonclassic PCA extraction rule discussed, for instance, in Karhunen and Joutsensalo (1995). With reference to the choice of the nonlinear function g(·), it seems reasonable to extend the entropy version of the Song-Yilong-Feng theory in order to model the probability density function of any single reconstruction error p ek . Formally, we can then particularize the cost function 3.18 as − j=1 E[log q (|e k j |2 )], which ultimately means identifying g(u) = − log q (u2 ). By choosing as the conditional probability density function the one-sided Cauchy distribution, plugging this g(u) into the definition of B(·) leads to BC (z) =
2|z| . θ 2 + |z|4
(3.28)
This shows that the approach in Song et al. (1998) may be successfully extended to the component-wise reconstruction error minimization method. The extended Hebbian learning theory described in this section relates to a series of contributions. One of them is Xu’s interpretation of extended Hebbian learning known as maximum uncertainty theory (a detailed analysis recently has been presented in Fiori, 2001). Other interesting contributions to this topic are the learning theories by Corchado and Fyfe (cited in Fiori, 2003b), Miao and Hua (1998), Plumbley (1992), Shirazi, Peper, and Sawai (1999), and Higuchi and Eguchi (2004). 3.4 Illustrative Numerical Experiments. In order to illustrate the concept of principal component and subspace analysis in the complex domain by nonquadratic reconstruction error minimization via examples, the results of some numerical simulations are reported below. The experiments aimed at illustrating the convergence of the learning algorithms 3.5 and 3.24, and comparing the performances of the robust and nonrobust versions in the presence of complex-valued data corrupted by gross outliers. Robustness should emerge by the proper choice of the nonlinear warping functions (·) and (·). The experiments performed retrace the ones suggested in Song et al. (1998) and Xu and Yuille (1995). A network with two inputs x1 and x2 and one output is trained with a set of uncorrupted samples and with the same data set corrupted by 3% of outliers. The two data sets are shown in Figure 9. Note that for a single-unit neural network, the distinction between PCA and PSA disappears. The network weight vector is denoted here by w. Furthermore, the quantity q1 denotes the first normalized eigenvector of the covariance matrix of the uncorrupted data (i.e., the true eigenvector). As the component and subspace analysis performance index, we consider the principal angle γ , here defined in the following way: def
γ = cos−1
|w H q1 | . wH w
830
S. Fiori 2
4 2
2
Im{x }
Im{x1}
1
0
−1 −2 −10
0
−2 −4
−5
0 Re{x1}
5
−6 −2
10
10
−1
0 Re{x2}
1
2
10
8 5 2
Im{x }
Im{x1}
6 4 2
0
0 −2 −10
−5
0 Re{x1}
5
10
−5 −5
0
5
10 Re{x2}
15
20
Figure 9: Experiments on reconstruction error minimization-based learning: Uncorrupted data set (top) and data set with outliers (bottom).
The outliers are such that they produce a deviation between the true first eigenvector and the noisy first eigenvector of about 30 degrees. First, algorithms 3.5 and 3.24 have been trained on the uncorrupted data, and both have been able to detect the principal eigenvector corresponding to a principal angle value close to zero. Second, algorithm 3.5 with (·) = 1 was trained on the noisy data; the results are shown in Figure 10. The same algorithm, trained on the same data set, is made robust by introducing the Cauchy-Lorentz nonlinearity (·) = C (·), and has given instead the result shown in Figure 11, which pertains to the value of the Cauchy parameter θ = 5. Figures 10 and 11 show the behavior of the principal angle γ as well as the values of the norm of the reconstruction error e 2 during learning. The net result of the introduction of the Cauchy nonlinearity is apparent, as it allows almost completely neglecting the effects of outliers on learning. The results in both robust and nonrobust cases pertaining to algorithm 3.24 are completely equivalent and are therefore not shown.
Nonlinear Complex-Valued Extensions of Hebbian Learning
831
Principal angle γ (Deg)
100 80 60 40 20 0
0
500
1000
1500
2000
2500 Samples
3000
3500
4000
4500
5000
0
500
1000
1500
2000
2500 Samples
3000
3500
4000
4500
5000
Reconstruction error ||e||
14 12 10 8 6 4 2 0
Figure 10: Numerical result obtained with algorithm 3.5, in the nonrobust version ((·) = (·) = 1).
4 Conclusion We initially conceived of this review around 1997 and have been researching it since then. It has led to a great deal of research, and new material has been added through the years as soon as it became available in the scientific literature. As the principal component analysis literature is very large and fragmented, this contribution aimed at bringing together a number of mathematical results for principal component analysis and its generalizations such as principal subspace and minor component analysis in the complex domain. The emphasis of the article was on extensions to complexvalued neural networks and on relating these to a number of previous results known from scientific literature. In particular, extensions of previous research work and relationships were sought on Hebbian learning for linear, complex-weighted, feedforward, and laterally connected neural networks. Learning principles have been presented and analyses carried out on complex-valued principal, minor component, subspace quadratic, and nonquadratic rules.
832
S. Fiori
Principal angle γ (Deg)
100 80 60 40 20 0
0
500
1000
1500
2000
2500 Samples
3000
3500
4000
4500
5000
0
500
1000
1500
2000
2500 Samples
3000
3500
4000
4500
5000
Reconstruction error ||e||
20
15
10
5
0
Figure 11: Numerical result obtained with algorithm 3.5, in the robust version ((·) = (·) = C (·)).
The starting point of the analysis was an optimization approach for complex-valued networks that aimed at producing a purposeful contribution. Many existing results could then be subsumed into this framework (although some may not). Neural networks architectures for principal subspace analysis were considered in section 2. After reviewing previous work that considered the complex-valued linear and real-valued nonlinear case, the equations for the complex-valued nonlinear case were derived using gradient optimization with Lagrange multipliers. Then neural principal component analyzers for the real-valued linear and complex-valued linear cases were reviewed and extended to the complex-valued nonlinear case, again using gradient-based optimization. The problem of the choice of the nonlinearity was mentioned, and an attempt was made to generalize observations of Sudjianto and Hassoun to the complex case. The analysis proceeded by extending some results from real-valued (linear) minor component analysis algorithms to the complex linear case. Minimization of reconstruction error was considered in section 3. Existing principal subspace and principal component approaches for the
Nonlinear Complex-Valued Extensions of Hebbian Learning
833
real-valued nonlinear and complex-valued linear cases were reviewed and gradient equations for the complex-valued nonlinear case were derived. The problem of choosing the nonlinearity was discussed based on a maximum likelihood approach by Song et al. (1998) for the real valued non-linear case. As a crucial part of the presented learning rules is the selection of a cost function on which the derivation of the involved nonlinearity is based, we made some choices concerning this selection. We made the choice to consider only cost functions that take into account the magnitude of their complex argument. While this choice could seem natural, it was discussed and motivated with respect to its implicit assumptions through citations of relevant literature and explanation of its usefulness in applied fields related to signal processing. The treatment of the shape of the nonlinearity seemed to provide insight that would go beyond the results known for nonlinearities in the real-valued case. The discussion of the choice of the nonlinearity and related developments in independent component analysis is an example of this finding. We believe this review will be useful to other researchers because it presents an overview of existing approaches to complex-valued linear and real-valued nonlinear versions of Hebbian learning algorithms and provides derivations of equations that may be useful for the complex-valued nonlinear case. It may be of interest to readers who are interested in the extension of Hebbian learning to the complex and nonlinear case. Acknowledgments I gratefully thank the anonymous reviewers, the action editor, Mark Plumbley, and the editor, Terrence Sejnowski, for the careful and accurate suggestions that helped to improve the clarity and the overall organization of this article, and Leslie-Anne Chaden for her constant constructive support in handling the manuscript. The final version of this work was completed when I was a short-term visitor of the Mathematical Neuroscience Laboratory of the Brain Science Institute at the Institute of Physical and Chemical Research RIKEN (Japan), from July to September 2004. I express my gratitude to the BSI director, Shunichi Amari, and to the laboratory members for the kindest and warmest hospitality. References Abbas, H. M., & Fahmy, M. M. (1994). Neural model for Karhunen-Lo´eve transform with application to adaptive image compression. IEE Proceedings I, Communications, Speech and Vision, 140(2), 135–143. Amari, S. (1977). Neural theory of association and concept formation. Biological Cybernetics, 26, 175–185.
834
S. Fiori
Amari, S. & Cichocki, A. (1998). Adaptive blind signal processing: Neural network approaches. Proceedings of the IEEE, 86(10), 2026–2048. Atick, J. J., & Redlich, A. N. (1993). Convergent algorithm for sensory receptive field development. Neural Computation, 5(1), 45–60. Baldi, P. F., & Hornik, K. (1995). Learning in linear neural networks: A survey. IEEE Transactions on Neural Networks, 6(4), 837–858. Bannour, S., & Azimi-Sadjadi, M. R. (1995). Principal component extraction using recursive least squares learning. IEEE Transactions on Neural Networks, 6(2), 457– 469. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Barlow, H. B. (1998). Guest editorial: Cerebral predictions. Perception, 27(8), 885–888. Barlow, H. B. (2001). Redundancy reduction revisited. Network: Computation in Neural Systems, 12(3), 241–253. Barlow, H. B., & Foldi´ ¨ ak, P. (1989). Adaptation and decorrelation in the cortex. In C. Miall, R. M. Durbin, & G. J. Mitchison (Eds.), The computing neuron, (pp. 54–72). Reading, MA: Addison-Wesley. Bechtel, W., & Abrahamsen, A. (1993). Connectionism and the mind. Oxford: Blackwell. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159. Benvenuto, N., & Piazza, F. (1992). On the complex back-propagation algorithm. IEEE Transactions on Signal Processing, 40(4), 967–969. Bertsekas, D. P. (1996). Constrained optimization and Lagrange multiplier methods. Bedford, MA: Athena Scientific. Bingham, E., & Hyv¨arinen, A. (2000). A fast fixed-point algorithm for independent component analysis of complex valued signals. International Journal of Neural Systems, 10(1), 1–8. Celledoni, E., & Fiori, S. (2004). Neural learning by geometric integration of reduced “rigid-body” equations. Journal of Computational and Applied Mathematics, 172 (2), 247–269. Chen, T.-P., & Amari, S.. (2001). Unified stabilization approach to principal and minor components extraction algorithms. Neural Networks, 14(10), 1377–1387. Chen, T.-P., Amari, S., & Lin, Q. (1998). A unified algorithm for principal and minor components extraction. Neural Networks, 11(3), 385–390. Chen, Y., & Hou, C. (1992). High resolution adaptive bearing estimation using a complex-weighted neural network. Proc. of International Conference on Acoustics, Speech and Signal Processing, 2, 317–320. Chow, T. W., & Fang, Y. (2001). Neural blind deconvolution of MIMO noisy channels. IEEE Transactions on Circuits and Systems—Part I, 48(1), 116–120. Cichocki, A., & Amari, S. (2002). Adaptive blind signal and image processing. New York: Wiley. Costa, S., & Fiori, S. (2001). Image compression using principal component neural networks. Image and Vision Computing Journal [Special issue], 19(9–10), 649–668. De Castro, M. C. F., De Castro, F. C. C., Amaral, J. N., & Franco, P. R. G. (1998). A complex valued Hebbian learning algorithm. Proc. of International Joint Conference on Neural Networks (pp. 1235–1238) Piscataway, NJ: IEEE. Desodt, G., & Muller, D. (1990). Complex ICA applied to the separation of radar signals. Proc. of Signal Processing V: Theories and Applications (EUSIPCO), 1 , 665–668.
Nonlinear Complex-Valued Extensions of Hebbian Learning
835
Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks: Theory and applications. New York: Wiley. Douglas, S. C., Amari, S. & Kung, S. Y. (2000). On gradient adaptation with unit-norm constraints. IEEE Transactions on Signal Processing, 48(6), 1843–1847. Fiori S. (2000). Blind separation of circularly distributed source signals by the neural extended APEX algorithm. Neurocomputing, 34(1–4), 239–252. Fiori S. (2001). A theory for learning by weight flow on Stiefel-Grassman manifold. Neural Computation, 13(7), 1625–1647. Fiori S. (2002). A theory for learning based on rigid bodies dynamics. IEEE Transactions on Neural Networks, 13(3), 521–531. Fiori S. (2003a). Extended Hebbian learning for blind separation of complex-valued sources. IEEE Transactions on Circuits and Systems—Part II, 50(4), 195–202. Fiori S. (2003b). A neural minor component analysis approach to robust constrained beamforming. IEE Proceedings—Vision, Image and Signal Processing, 150(4), 205– 218. Fiori S. (2003c). Neural ICA by “maximum-mismatch” learning principle. Neural Networks, 16(8), 1201–1221. Fiori S. (2003d). Fully-multiplicative orthogonal-group ICA neural algorithm. Electronics Letters, 39(24), 1737–1738. Fiori S. (2004). A fast fixed-point neural blind deconvolution algorithm. IEEE Transactions on Neural Networks, 15(2), 455–459. Fiori S. (in press). Fixed-point neural independent component analysis algorithms on the orthogonal group. Journal of Future Generation Computer Systems. Fiori, S., Faba, A., Albini, L., Cardelli, E., & Burrascano, P. (2003). Numerical modeling for the localization and the assessment of electromagnetic field sources. IEEE Transactions on Magnetics, 39(3), 1638–1641. Fiori, S., & Piazza, F. (2000). A general class of ψ–APEX PCA neural algorithms. IEEE Transactions on Circuits and Systems—Part I, 47(9), 1394–1398. Foldi´ ¨ ak P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 165–170. Fyfe, C., & MacDonald, D. (2002). ε-insensitive Hebbian learning. Neurocomputing, 47(1–4), 35–57. Gao, K., Ahmed, M. O., & Swamy, M. N. S. (1994). A constrained anti-Hebbian learning algorithm for total least-squares estimation with applications to adaptive FIR and IIR filtering. IEEE Transactions on Circuits and Systems–Part II, 41(11), 718– 729. Georgiou, G. M., & Koutsougeras, C. (1992). Complex-domain backpropagation. IEEE Transactions on Circuits and Systems—Part II, 39(5), 330–334. Hairer, E., Lubich, C., & Wanner (2002). Geometric numerical integration: Structure preserving algorithms for ordinary differential equations, Berlin: Springer-Verlag. Hanna, A. I., & Mandic, D. P. (2003). A complex-valued nonlinear neural adaptive filter with a gradient adaptive amplitude of the activation function. Neural Networks, 16(2), 155–159. Harpur, G. F. (1997). Low entropy coding with unsupervised neural networks, Unpublished doctoral disseratation, University of Cambridge. Haykin, S. (1989). An introduction to analog and digital communications. New York: Wiley.
836
S. Fiori
Haykin, S. (1998). Neural networks: A comprehensive foundation, (2nd eds). Upper Saddle River, NJ: Prentice Hall. Hebb, D. (1949). The organization of behaviour. New York: Wiley. Helmke, U., & Moore, J. B. (1993). Optimization and dynamical systems. Berlin: Springer-Verlag. Higuchi, I., & Eguchi, S. (2004). Robust principal component analysis with adaptive selection for tuning parameters. Journal of Machine Learning Research, 5, 453–471. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Iserles, A., Munthe-Kaas, H. Z., Nørsett, S. P., & Zanna, A. (2000). Lie-group methods. Acta Numerica, 9, 215–365. Karhunen, J., & Joutsensalo, J. (1994). Representation and separation of signals using nonlinear PCA type learning. Neural Networks, 7(1), 113–127. Karhunen, J., & Joutsensalo, J. (1995). Generalizations of PCA, optimization problems, and neural networks. Neural Networks, 8(4), 549–562. Kates, J. M. (1993). Superdirective arrays for hearing aids. Journal of Acoustics Society of America, 94(4), 1930–1933. Klemm, R. (1987). Adaptive airborne MTI: An auxiliary channel approach. IEE Proceedings F, 134, 269–276. Kung, S.-I., Diamantaras, K. I., & Taur J. S. (1994). Adaptive principal component extraction (APEX) and applications. IEEE Transactions on Signal Processing, 42(5), 1202–1217. Laheld, B., & Cardoso, J. F. (1994). Adaptive source separation with uniform performance. Proc. of Signal Processing VII: Theories and Applications (EUSIPCO), 1, 183–186. Levine, D. S., Brown, V. R., & Shirey, T. V. (1999). Oscillations in neural systems. Mahwah, NJ: Erlbaum. Liano, K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7(1), 246–250. Linsker, R. (1992). Local synaptic rules suffice to maximize mutual information in a linear network. Neural Computation, 4, 691–702. Luo, F.-L., & Unbehauen R. (1997). Applied neural networks for signal processing. Cambridge: Cambridge University Press. Mathew, G., & Reddy, V. U. (1994). Development and analysis of a neural network approach to Pisarenko’s harmonic retrieval method. IEEE Transactions on Signal Processing, 42(3), 663–667. Mathis, H., & Douglas, S. C. (2002). On the existence of universal non-linearities for blind source separation. IEEE Transactions on Signal Processing, 50(5), 1007–1016. Miao, Y., & Hua, Y. (1998). Fast subspace tracking and neural network learning by a novel information criterion. IEEE Transactions on Signal Processing, 46(7), 1967– 1979. Michaels, R. B., & Upadhyaya, B. R. (1999). A complex valued neural network with local learning laws. In C. H. Dagli et al. (Eds.), Intelligent engineering systems through artificial neural networks (Vol. 9, pp. 101–109). New York: ASME Press.
Nonlinear Complex-Valued Extensions of Hebbian Learning
837
Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6(1), 100–126. Miyauchi, M., Seki, M., Watanabe, A., & Miyauchi, A. (1993). Interpretation of optical flow through complex neural network. Proc. of the International Conference on Artificial Neural Networks, (pp. 645–650). New York: Springer. Muezzino ¨ glu, ˇ M. K., Guzeli¸ ¨ s, C., & Zurada, J. M. (2003). A new design method for the complex-valued multistate Hopfield associative memory. IEEE Transactions on Neural Networks, 14(4), 891–899. Nitta, T. (1997). An extension of the back-propagation algorithm to complex numbers. Neural Networks, 10(8), 1391–1415. Nitta, T. (2000). An analysis of the fundamental structure of complex-valued neurons. Neural Processing Letters, 12(3), 239–246. Nitta, T. (2004). Orthogonality of decision boundaries in complex-valued neural networks. Neural Computation, 16, 73–97. Oja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of Mathematics and Biology, 15, 267–273. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural System, 1, 61–68. Oja, E. (1992). Principal components, minor components, and linear neural networks. Neural Networks, 5(6), 927–935. Oja, E. (1994). Beyond PCA: Statistical expansions by nonlinear neural networks. Proc. of the International Conference on Artificial Neural Networks, 2, 1049–1054. O’Leary, D. P. (1990). Robust regression computation using iteratively reweighted least squares. SIAM Journal of Matrix Analysis Application, 11(3), 466–480. Palmieri, F., & Zhu, J. (1995). Self-association and Hebbian learning in linear neural networks. IEEE Transactions on Neural Networks, 6(5), 1165–1184. Pandey, R. (2001). Blind equalization and signal separation using neural networks. Unpublished doctoral disseratation, Indian Institute of Technology, Roorkee, India. Plumbley, M. D. (1992). Efficient information transfer and anti-Hebbian neural networks. Neural Networks, 6(6), 823–833. Plumbley, M. D. (2003). Algorithms for non-negative independent component analysis. IEEE Transactions on Neural Networks, 14(3), 534–543. Rubner, J., & Tavan, P. (1989). A self-organizing network for principal component analysis. Europhysics Letters, 10(7), 693–698. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2(6), 459–473. Schmidt, R. (1986). Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3), 276–280. Shirazi, M. N., Peper, F., & Sawai, H. (1999). Principal component analysis by entropylikelihood optimization. Transactions of the Information Processing Society of Japan, 40(10), 3638–3644. Song, W., Yilong, L., & Feng, M. (1998). An adaptive robust PCA neural network. In Proc. of International Joint Conference on Neural Networks (pp. 2288–2293). Piscataway, NJ: IEEE. Spratling, M. W., & Johnson, M. H. (2002). Preintegration lateral inhibition enhances unsupervised learning. Neural Computation, 14(9), 2157–2179.
838
S. Fiori
Sudjianto, A., & Hassoun, M. H. (1995). Statistical basis of non-linear Hebbian learning and application to clustering. Neural Networks, 8(5), 707–715. Thirion Moreau, N., & Moreau, E. (2002). Generalized criterion for blind multivariate signal equalization. IEEE Signal Processing Letters, 9(2), 72–74. Vegas, J. M., & Zufiria, P. J. (2004). Generalized neural networks for spectral analysis: Dynamics and Lyapunov functions. Neural Networks, 17(2), 233–245. Weingessel, A., & Hornik, K. (2000). Local PCA algorithms. IEEE Transactions on Neural Networks, 11(6), 1242–1250. Widrow, B., & Winter, R. (1988). Neural nets for adaptive filtering and adaptive pattern recognition. IEEE Computer Magazine, 21, 25–39. Xu, L. (1992). Least mean square error reconstruction principle for self-organizing neural nets. Neural Networks, 6(5), 627–648. Xu, L., Oja, E., & Suen, C. Y. (1992). Modified Hebbian learning for curve and surface fitting. Neural Networks, 5(3), 441–457. Xu, L., & Yuille, A. L. (1995). Robust PCA by self-organizing rules based on statistical physics approach. IEEE Transactions Neural Networks, 6(1), 131–143. Yan, W.-Y., Helmke, U., & Moore, J. B. (1994). Global analysis of Oja’s flow for neural networks. IEEE Transactions on Neural Networks, 5(5), 674–683. Yang, B. (1995). Projection approximation subspace tracking. IEEE Transactions on Signal Processing, 43(1), 1247–1252. Zufiria, P.J. (2002). On the discrete-time dynamics of the basic Hebbian neural network node. IEEE Transactions on Neural Networks, 13(6), 1342–1352.
Received November 26, 2003; accepted August 10, 2004.
LETTER
Communicated by Kechen Zhang
Difficulty of Singularity in Population Coding Shun-ichi Amari
[email protected] Hiroyuki Nakahara
[email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako, Saitama, 351-0198 Japan
Fisher information has been used to analyze the accuracy of neural population coding. This works well when the Fisher information does not degenerate, but when two stimuli are presented to a population of neurons, a singular structure emerges by their mutual interactions. In this case, the Fisher information matrix degenerates, and the regularity condition ensuring the Cram´er-Rao paradigm of statistics is violated. An animal shows pathological behavior in such a situation. We present a novel method of statistical analysis to understand information in population coding in which algebraic singularity plays a major role. The method elucidates the nature of the pathological case by calculating the Fisher information. We then suggest that synchronous firing can resolve singularity and show a method of analyzing the binding problem in terms of the Fisher information. Our method integrates a variety of disciplines in population coding, such as nonregular statistics, Bayesian statistics, singularity in algebraic geometry, and synchronous firing, under the theme of Fisher information.
1 Introduction A fundamental concern in systems neuroscience is how information is represented in the brain. Population coding is a widely accepted scheme of representation in which the accuracy of encoding is evaluated by Fisher information (Seung & Sompolinsky, 1993; Zemel, Dayan, & Pouget, 1998; Yoon & Sompolinsky, 1999; Zhang & Sejnowski, 1999; Pouget, Dayan, & Zemel, 2000; Wu, Amari & Nakahara, 2002a, 2002b; Wu, Nakahara, & Amari, 2001; Nakahara & Amari, 2002; Nakahara, Wu, & Amari, 2001; Wu, Chen, Niranjan, & Amari, 2003). The Fisher information is not only related to the mean square error of an estimator by the Cram´er-Rao theorem, but is an intrinstic quantity to measure how much information is included in the population coding. Neural Computation 17, 839–858 (2005)
© 2005 Massachusetts Institute of Technology
840
S. Amari and H. Nakahara
When two similar stimuli are presented at the same time, their responses interact mutually and cause strange phenomena in both the mathematical structure and the behavior of an animal. We explain this phenomenon by a simple example (Ingle, 1968; Amari & Arbib, 1977). When a frog observes a fly in its visual field, it is represented by an excitation pattern of neurons in the visual field (tectum). The pattern has a peak at the place corresponding to the location of the fly (see Figure 1A). Neural firing fluctuates randomly, so that the amount of information included in such population coding is analyzed by Fisher information. However, when two flies appear at the same time, the response pattern of neural excitation has two peaks at the corresponding locations of the flies (see Figure 1B). When the two flies are very close, the two peaks merge into one, and there remains only a noisy pattern of one peak (see Figure 1C). What will happen under such a condition? Neuroethological study has observed pathological behavior in that the frog cannot distinguish the two flies and sometimes jumps in between the flies, failing to catch either (Ingle, 1968). Mathematically, the Fisher information matrix degenerates in this case, so that the established Cram´er-Rao paradigm of statistics does not hold. Hence, a new framework of statistics is required to analyze such a nonregular or singular situation. Looking at the situation in more detail, we find that algebraic singularity exists in the underlying statistical manifold, which is the main cause of difficulty. There are many practical methods to resolve the problem. The frog has its own neural algorithm. Many theoreticians use the Bayesian strategy, and biologically plausible algorithms have been proposed (Pouget, Zhang, Denever, & Latham, 1998; Zemel, et al., 1998; Zemel & Pillow, 2002; Sahani & Dayan, 2003). However, the Bayesian method is not free of the difficulty caused by the singularity, because the performances of Bayesian algorithms are largely affected by the prior distributions in the singular case, which is completely different from the regular case. Computer simulations have been used so far to analyze the estimation errors of proposed algorithms. We need a new theretical framework to carry out the mathematical analysis, in particular the limit of accuracy attainable by the optimal method. The singularity also affects the Bayesian method strongly in a different way from the regular case. This study focuses on such a singular problem of population coding, providing a new framework of analysis. We do not propose a practical procedure to decide the number of stimuli or to decode information giving the estimated locations of stimuli. Instead, our framework is necessary to analyze the behaviors of algorithms mathematically. For this purpose, we must integrate various disciplines related to the scheme of population coding: neural fields (Amari, 1977), degenerate Fisher information in a nonregular model (Amari & Ozeki, 2001; Fukumizu, 2003), algebraic geometrical singularity (Fukumizu & Amari, 2000; Watanabe, 2001a) that causes such degeneration, and the information-geometrical method (Amari & Nagaoka, 2000; Amari, 1998) to elucidate the whole
Difficulty of Singularity in Population Coding
A
B
r(z)
x
z
r(z)
x1
C
841
x2
z
r(z)
x1 x2
z
Figure 1: Schematic presentation of firing patterns r (z) in a neural population, where r (z) denotes the firing rate of neurons at location z: given a single stimulus at x (A), given two stimuli at x1 and x2 that are far enough apart (B), and given two stimuli close to each other (C).
structure. (See also Dacunha-Castelle & Gassiat, 1997; Lin & Shao, 2003 and Burnashev & Amari, 2002, for recent statistical analysis of the log-likelihood ratio and the maximum likelihood estimator in singular statistical models.) We begin with the analysis of the Fisher information of population coding in a neural field. The Fourier domain is useful for analysis when the field is homogeneous. Various properties in population coding, such as the effects of the shape of a tuning curve and the correlation of noises, will be clearly understood in this framework (Wu et al., 2002b). We
842
S. Amari and H. Nakahara
summarize the method in section 2, when a single stimulus is given in a one-dimensional neural field. When two stimuli appear close together at the same time in a one-dimensional field, their mutual interaction becomes very strong. The Fisher information is a matrix in this case, but it degenerates when the two are close to each other. This violates the classic framework of statistics, and one cannot apply the Cram´er-Rao theorem because the estimation error grows without limit. It is indeed difficult to define the optimal strategy to judge if there are two targets or one. It is more difficult to identify the intensity and the location of each target and to analyze rigorously the performances of various algorithms. The difficulty is observed in animal behavior (Ingle, 1968; Treue, Hol, & Rauber, 2000; Reichardt & Poggio, 1975). When the Fisher information degenerates, its inverse diverges to infinity. Therefore, we need a careful mathematical analysis. We demonstrate that the space of parameters to be estimated from the neural excitation pattern has algebraic singularity and analyze its structure in detail. Such singularity causes the degeneration of the Riemannian structure (or the Fisher information) of the space as a black hole does in the real space-time universe. We analyze how the accuracy of estimation degenerates as the two targets draw near to each other. To this end, the Fisher information is decomposed into regular directions and singular directions, and the accuracy of estimation in the respective directions is analyzed. Such analysis gives new insight into modern statistics. The Bayes inference is another method to give a practical solution even in a singular case. However, we show that there exists a serious problem, because a smooth, positive prior over the parameters induces a singular prior on the singular model. A smooth, positive prior on the space of response functions is not absolutely continuous to a smooth prior on the parameter space. They are mutually singular. We study such a problem, showing that the same difficulty exists in Bayesian framework. From the viewpoint of model selection, the number of stimuli is decided from the data by some criteria such as the Akaike information criterion, the Bayesian information criterion, or the minimum descriptive length criterion. However, for all of these model selection procedures, their statistical validity loses its ground in such a singular case, as we explain below. We then ask if there is a way to get rid of this difficulty, finding that this is closely related to the binding problem. We demonstrate how synchronous firing resolves this difficulty by using the Fisher information. The difficulty is remarkably relaxed by synfiring, which resolves singularity. The synfiring scheme has been studied extensively, but this is the first time it is analyzed in terms of the Fisher information. We suggest a new method of analyzing a general binding problem in terms of the Fisher information in the framework of information geometry (Amari & Nagaoka, 2000).
Difficulty of Singularity in Population Coding
843
2 Fisher Information for a Single Stimulus We recapitulate the population coding in a neural field, summarizing the analysis of the Fisher information for a single stimulus. Details are found in the work of Wu et al. (2001). Consider a population of neurons located on a line, which we call a one-dimensional neural field, and the position of a neuron is denoted by a coordinate z. Let r (z) be the firing rate of the neurons at around position z. Neurons are excited in response to a stimulus applied from the outside. We consider a set of possible stimuli of which the features are specified by one-dimensional coordinates x. The feature may be the orientation of a bar, the position of an object, or something else. There is a natural one-to-one correspondence between x and z, and a stimulus with feature x is encoded by an activity pattern r (z) over the field such that the neurons at positions z ≈ x are excited strongly, with r (z) having a peak at z ≈ x (see Figure 1). More precisely, let f (z) be a unimodal function having a peak at z = 0, and when stimulus x is applied, the expected firing rate of the neurons at around z is given by f (z − x). The function f (z) is called the tuning curve, and we use the gaussian function ϕ for f , ϕ(z) = √
1 2πa
e −z /2a , 2
2
(2.1)
for the ease of analysis, where a represents the width of the tuning function. Neuronal firing is stochastic in its nature. Given stimulus x, the observed firing pattern r (z; x) is given by r (z; x) = f (z − x) + σ ε(z),
(2.2)
where ε(z) denotes random fluctuations with intensity σ . We assume that ε(z) is gaussian with mean 0. Firing of neurons at positions z and z is correlated in general, so that we put
ε(z) = 0, ε(z)ε z = h z − z ,
(2.3) (2.4)
where < > denotes expectation and h denotes the covariance function. Here, h (z − z ) decreases to 0 as z − z becomes larger. We assume the following form of the covariance function h(z), h(z) = (1 − β)δ(z) + βe −z /2b , 2
2
(2.5)
where δ(z) is the delta function, 1 − β represents the intensity of independent fluctuation proper to each neuron, and b represents the effective range
844
S. Amari and H. Nakahara
of correlated fluctuation. This form captures a variety of the correlation structure. Our previous articles (Wu et al., 2001, 2002a, 2002b) investigated the properties of the Fisher information for the single stimulus. Although the above h(z) does not include a multiplicative noise structure, it is possible to address such a case with a little modification (Yoon & Sompolinsky, 1999; Eurich & Wilke, 2000; Nakahara & Amari, 2002; Wu, Amari, & Nakahara, 2004), so that for simplicity we use only the above form in this study. The probability distribution of firing pattern r = {r (z)} given x is then written explicitly as Q(r|x) =
∞ ∞ 1 1 [r (z) − f (z − x)]h ∗ (z − z ) exp − 2 Z 2σ −∞ −∞ [r (z ) − f (z − x)]dzdz ,
(2.6)
where Z is the normalization factor. The function h ∗ (z) is the inverse kernel of the correlation function h, defined by
∞ −∞
h ∗ (z − z )h(z − z )dz = δ(z − z ),
where δ is the delta function. The Fisher information for the single stimulus x is given by
d 2 ln Q(r|x) d 2 ln Q(r|x) − Q(r|x) dr, I F (x) = E − d x2 d x2
(2.7)
where E denotes expectation. Simple calculation yields (Wu et al., 2001) 1 IF = 2 σ
f (z − x)h ∗ z − z f z − x dzdz ,
(2.8)
which does not depend on x because of the translation invariance of the neural field. By using the Fourier transform, I F is given by 1 IF = 2πσ 2
∞
−∞
ω2 exp −a 2 ω2 dω, H(ω)
(2.9)
where H(ω) is the Fourier transform of h(z), 2 2 √ b ω H(ω) = (1 − β) + β 2π b exp − . 2
(2.10)
Difficulty of Singularity in Population Coding
845
This form elucidates various characteristics of the Fisher information for a single stimulus (Wu et al., 2001). 3 Fisher Information for Two Stimuli Fisher information is useful for evaluating the accuracy of population coding. The Cram´er-Rao theorem shows that I F −1 provides a lower bound of ˆ The situation, the mean square error for the estimated feature or location x. changes drastically however, when two stimuli are presented at the same time. The response of neurons at z, given two stimuli at locations x1 and x2 , is modeled by r (z) = f (z; x1 , x2 ) + σ (z),
(3.1)
where f (z; x1 , x2 ) is the tuning curve when two stimuli with features x1 and x2 are presented. Let the intensities of the two stimuli be v and 1 − v, where we assume that the total intensity is regulated to be equal to 1. Then the tuning curve is defined by f (z; x1 , x2 , v) = (1 − v)ϕ(z − x1 ) + vϕ(z − x2 ),
(3.2)
where 0 ≤ v ≤ 1. Experimental results (Treue et al., 2000; Martinez-Trujillo & Treue, 2000) suggest this form of tuning curves. The probability distribution Q (r; x1 , x2 , v) of neural firing is given as equation 2.6, where f (z) is now set as equation 3.2. In this setting, we search for the Fisher information in population coding concerning two stimuli, (1 − v, x1 ) and (v, x2 ). Since the unknown parameters (x1 , x2 , v) are three in number, the Fisher information now becomes a 3 × 3 matrix. When the parameters are represented by a vector θ = (x1 , x2 , v),
∂ 2 ln Q(r|θ) I F (θ) = −E . ∂θ∂θ
(3.3)
We can calculate the Fisher information matrix I F in a similar way, and its inverse would give the accuracy of the estimated features of the stimuli. When x1 = x2 , the two targets merge into one, and f (z; x, x, v) = ϕ(z − x) does not depend on v, because of x1 = x2 , that is, a single stimulus. However, when the two targets are close, x1 ≈ x2 , a pathological situation occurs, and it causes an extreme difficulty in estimating v, causing nonidentifiability of v. Let us first discuss this difficulty in case of x1 = x2 for ease of understanding. The Fisher information matrix indeed degenerates, and its inverse diverges. Therefore, we cannot apply the Cram´er-Rao paradigm of statistics in this case. Such a pathological statistical model is said to be nonregular and has
846
S. Amari and H. Nakahara
been outside the scope of the standard theory of statistics, except for some pioneering work (Weyl, 1939; Hotelling, 1939) and recent developments (Dacunha-Castelle & Gassiat, 1997; Lia & Shao, 2003; Fukumizu, 2003). The analysis in the pathological case is the main topic of this letter. We also show how the singularity affects the Bayesian method and model selection. The pathological situation is elucidated by algebraic geometry, because algebraic singularities exist in the parameter space M = {(x1 , x2 , v)} of two stimuli (see Figure 2A). First, when x1 = x2 = x, the tuning curve is the same whatever v is, so that the points on a vertical line, l x : x1 = x2 = x, v = arbitrary,
(3.4)
have the same effect on the neural field (see Figure 2A, red bold vertical line). Second, when v = 1 (v = 0), there is no x1 (x2 ) stimulus, so that x1 (x2 ) cannot be decided. Hence, the points on the line l x,1 : v = 1, x1 are arbitrary, and the line l x,0 : v = 0, x2 is arbitrary (see Figure 2A, red bold horizontal lines). Thus, the points in the parameter space that have the same effect would ˜ consisting of the resultant be summarized as the same point in the set M probability distributions. Mathematically, we divide M by the equivalence relation of the same effect. The parameter space M then collapses along ˜ has algebraic singularity, those critical lines, and the resultant reduced M as shown in Figure 2B. The Fisher information matrix degenerates on these critical points. We now introduce a new method of analysis, focusing on the algebraic singularity. We begin with reparameterizing the features of stimuli as u = x2 − x1 w = (1 − v)x1 + vx2 . Here, u indicates the difference between the locations of the two stimuli, and w indicates their center of gravity. Using these, we can rewrite the tuning curve as f (z; u, w, v) = (1 − v)ϕ(z − w + vu) + vϕ(z − w − (1 − v)u).
(3.5)
With these parameters (w, u, v), we evaluate the accuracy of coding in the neighborhood of u ≈ 0. Without loss of generality, we rescale the z-axis to let a = 1, where x1 , x2 , and u should be rescaled as well as the width b of the correlation. Using the Chebyshev-Hermite expansion with respect to u of equation 3.5, we get, f (z; u, w, v) =
∞ c n (v) n=0
n!
u Hn (z − w) ϕ(z − w), n
Difficulty of Singularity in Population Coding
A
847
B
v
x2 x1
~ M
M
D
C 1
θ3
v
0.75
x 10-4 1.5 1
0.5 0.25 0 0
0.04
u
0.08 0.1
v
0
-1 -1.5 0
0.4
θ2
0.8
1.2 x 10-3
Figure 2: (A) Parameter space M = {(x1 , x2 , v)}. The critical set consists of three surfaces with red lines. The three red thicker lines correspond to a set of those parameters that are equivalent to one stimuli and give the same probability ˜ where the equivalent set of parameters is distribution. (B) The reduced space M, reduced to one point. (C) Regular (u, v)-plane, where the part u < 0, v = 0.5 + v0 , is equivalent to u > 0, v = 0.5 − v0 by interchanging x1 and x2 . (D) The (u, v)plane embedded in regular space (θ1 , θ2 ) by a cusp-type singular map. Note that the line u = 0 in (C) (the light blue line) is mapped to a point in D.
where c n (v) ≡ (1 − v)(−v)n + v(1 − v)n and Hn (z) is the Chebyshev-Hermite polynomials. We approximate f (z; u, w) by taking the summation up to 3, having 1 1 2 3 f (z; u, w, v) = 1+ c 2 (v)u H2 (z − w)+ c 3 (v)u H3 (z−w) ϕ(z − w), 2 6 c 0 = 1, c 1 = 0, c 2 = v(1 − v), c 3 = −v(1 − v)(2v − 1) H0 (z) = 1, H1 (z) = z, H2 (z) = z2 − 1, H3 (z) = z3 − 3z.
848
S. Amari and H. Nakahara
Let us now introduce the set of new parameters θ = (θ1 , θ2 , θ3 ) in which f is written as f (z; θ) = {1 + θ2 H2 (z − θ1 ) + θ3 H3 (z − θ1 )} ϕ(z − θ1 ),
(3.6)
and consider a new statistical model S parameterized by θ in which the probability distributions are given by equation 2.6 and f is given by equation 3.6. Our original statistical model M is parameterized by ξ = (w, u, v) and is included in S by the mapping c 2 (v)u2 c 3 (v)u3 θ = w, , . 2 6
(3.7)
The mapping (3.7) from ξ to θ is singular, because its Jacobian determinant, ∂θ |J | = , ∂ξ
(3.8)
vanishes at u = 0. We adopt the strategy of calculating the Fisher information I FS (θ) of S, which is regular in the coordinates θ. The singular I F (ξ) of M is given through the singular mapping equation 3.7. This method elucidates the singular structure of M. We first evaluate I FS (θ) at around θ2 = θ3 = 0, which corresponds to u = 0. The elements of I FS (θ) = (Ikl ) are calculated as Ikl =
1 σ2
∂k f (z; θ)h ∗ z − z ∂l f z ; θ dzdz ,
where ∂k f = (∂/∂θk ) f . In terms of the Fourier transform, they are 1 Ikl = 2πσ 2
Fk (ω)H ∗ (ω)Fl (ω)dω,
(3.9)
where Fk (ω) = ∂ F (ω)/∂θk and F (ω) is the Fourier transform of f (z). The Fisher information matrix is explicitly calculated as g1 0 −g2 I FS (θ) = 0 g2 0 , −g2 0 g3
where gk =
1 2πσ 2
ω2k exp(ω2 ) dω > 0. H(ω)
Difficulty of Singularity in Population Coding
849
For ease of further study, we modify the parameters from θ to θ ∗ , θ1∗ = θ1 +
g2 θ3 , g3
(3.10)
θ2∗ = θ2 ,
θ3∗ = θ3 .
(3.11)
Then the Fisher information matrix is diagonalized,
g1 0 0 I FS (θ ∗ ) = 0 g2 0 , 0 0 g¯ 3 where g¯ 3 = g3 − g22 /g1 > 0. It is now easy to obtain the Fisher information matrix I F (ξ) in terms of the parameters ξ = (w, u, v), which represent the features of the stimuli. Let J be the Jacobian of the transformation from ξ to θ ∗ , 1 ∂θ ∗ J = = 0 ∂ξ 0
(1/2)g2 g3−1 c 3 u2 c2 u (1/2)c 3 u2
(1/6)g2 g3−1 c 3 u3 (1/2)c 2 u2 , (1/6)c 3 u3
(3.12)
where c i = dc i (v)/dv. Then the Fisher information I F (ξ) is obtained from I F (ξ) = J T I FS (θ ∗ ) J .
(3.13)
4 Algebraic Singularity and Accuracy of Estimation Although the Fisher information I FS (θ ∗ ) is regular, I F (ξ) is singular at u = 0, because the Jacobian J of the transformation is singular at u = 0, |J | = 0. In other words, the present model M is embedded in a regular space as ˜ and the singularity cannot be resolved by coordinate a singular subset M, transformation. Let us highlight the role of the singular mapping. Since the structure of the field is homogeneous along the w-axis, we focus on the mapping between (u, v) and (θ2∗ , θ3∗ ) =
v(1 − v)u2 v(1 − v)(1 − 2v)u3 , 2 6
.
(4.1)
Figure 2C shows the u-v plane, where the v-axis (indicated by a thick blue vertical line in the figure) corresponding to u = 0 degenerates to a single point in the θ-plane, causing the singularity. Horizontal (black) and vertical (red) lines simply indicate the grids of v and u values. Figure 2D shows the regular θ2 -θ3 plane in which the grids of (u, v) in Figure 2C are embedded.
850
S. Amari and H. Nakahara
Note that Figure 2C corresponds to a slice vertically cut from Figure 2A. Now consider the line v = 1/2 + v0 (solid horizontal blue line) in Figure 2C. It corresponds to the stimuli of intensity 1/2 + v0 , of which distance u changes arbitrarily. Note that when u < 0, it corresponds to the stimuli of intensity 1/2 − v0 , where stimulus 1 and stimulus 2 are interchanged. Hence, we consider the two blue horizontal lines as one component. The line is mapped to a cusp line (black or blue pair) in Figure 2D. The line v0 = 0 corresponds to the horizontal line θ3∗ = 0 in the right figure, and as v0 increases, the angle of the corresponding cusp becomes larger. As v0 tends to 1/2, the cusp shrinks and reduces to the origin. This gives rise to the singularity corresponding to v = 0 or 1, another type of singularity, which we do not address in this article. Note that the space of θ ∗ itself corresponds to a regular statistical model with a nondegenerate Fisher information matrix. The space of ξ is embedded in the θ ∗ space by the mapping with the singularity of the cusp structure (see Figure 2D). ˆ u, ˆ vˆ be the maximum likelihood estimator. The errors wˆ = Let ξˆ = w, wˆ − w, uˆ = uˆ − u, vˆ = vˆ − v are given by ξˆ =
∂ξ ˆ ∗ −1 ˆ ∗ ∗ θ = J θ . ∂θ
(4.2)
The covariance matrix of the error is given by I F−1 (ξ), but the inversion of I F is complicated, so we evaluate the error directly by using equation 3.13. ∗ Since three components of θˆ are independently distributed, we obtain, −1 by evaluating J carefully, the asymptotic evaluation of the variances of the ˆ errors ξ, g3 , g1 g¯3
(4.3)
ˆ ≈ k1 u−4 + k2 u−2 , V(u)
(4.4)
ˆ ≈ k3 u−6 , V(v)
(4.5)
ˆ ≈ V(w)
where k1 = 36(1 − 2v)2 v−2 (1 − v)−2 g¯3 −1 , 2 k2 = 4 6v2 − 6v + 1 v−2 (1 − v)−2 g2−1 ,
(4.6)
k3 = 144g¯3−1 .
(4.8)
(4.7)
When v = 1/2, k1 , the coefficient of u−4 vanishes. The above results highlight the pathological nature of estimation:
r
For w, the Fisher information is of order 1 even when u → 0. Hence, one can estimate the center of gravity of two stimuli accurately.
Difficulty of Singularity in Population Coding
r
r
851
When v = 1/2, the variance of uˆ diverges to infinity in the order of 1/u4 as u → 0. The estimation of v is worse, because its variance diverges in the order of 1/u6 . These show the difficulty in estimating them accurately. When v = 1/2, that is, the intensities of two stimuli are the same, the dominant term in equation 4.6 vanishes, so that the variance of uˆ is of order 1/u2 . Hence, estimation is carried out more accurately under such a situation.
Turning to the estimation of the original locations x1 and x2 , we have ˆ xˆ 1 = wˆ − vˆu, ˆ u, ˆ xˆ 2 = wˆ + (1 − v)
(4.9) (4.10)
so that terms of O 1/u6 dominate in their error variances. In conclusion, one can estimate the center of gravity accurately, but it is very difficult to estimate their intensities and difference of locations, or the respective locations x1 and x2 , as u tends to 0. We have obtained the order of divergence, u−4 and u−6 . More rigorous mathematical analysis when v = 1/2 is given in Burnashev and Amari (2002) by using the technique of large deviation.
5 Model Selection and Singularity We have discussed the behavior of the maximum likelihood estimator (MLE) when two stimuli are presented. When the two are located closely, the estimation of the two locations is difficult. One may decide the number of stimuli (one or two in the present case) after the estimation of the parameters, but one may decide the number before estimation based on the encoded pattern r and then perform estimation. We apply different statistical models: one including single stimulus and the other including two stimuli. By comparing the results, one can decide the number. This is the problem of model selection. There are two typical methods frequently used in model selection. One is the Akaike information criterion (AIC), which evaluates the KullbackLeibler divergence between the true distribution and the estimated distribution based on the two models. The other is the Bayesian information criterion (BIC), which evaluates the posterior probabilities of the two models. The minimum descriptive length (MDL) criterion gives practically the same criterion as the BIC, although the underlying philosophy is different. One may apply these criteria in the singular case, obtaining a practical answer to decide the number of stimuli. However, all the criteria (AIC, BIC, MDL) are based on the statistical analysis of the behavior of the MLE under each model. In these analyses, the estimator is assumed to be asymptotically
852
S. Amari and H. Nakahara
subject to the gaussian distribution with the covariance matrix equal to the inverse of the Fisher information matrix. Since the Fisher information is degenerate in the present singular case, the MLE is not subject to the gaussian distribution even asymptotically. So one should be very careful in applying these criteria in hierarchical models including singularity, because no mathematical justification exists.
6 Bayesian Inference and Singularity The Bayesian inference presumes an adequate prior distribution over the parameters of the model. One may use a parametric family of priors, including hyperparameters, thus introducing a hierarchical structure of Bayesian inference. Given observed data r, one can calculate the posterior probability of the parameters under the model p(ξ|r) =
π(ξ) p(r|ξ) , p(r)
(6.1)
where π(ξ) is the prior distribution and p(r) is the marginal distribution of r. When one wants to combine the problem of model selection, where we presume a number of models M1 , M2 , · · · , and the posterior is p (ξi , Mi |r) =
π (Mi ) π (ξi |Mi ) p (r|ξi ) , p(r)
(6.2)
where π (Mi ) is the prior of the model Mi with parameters ξi . One may use the marginal distribution p(Mi |r) for model selection (BIC). The MAP estimator is the one that maximizes the posterior distribution. This is a point estimator, and it is known that the MAP estimator includes the same amount of Fisher information as the MLE. The Bayesian predictive distribution is obtained by using the posterior distributions of parameters to give a distribution of new data. The predictive distribution does not belong to the model, but is optimal when the prior is true. In general, it gives a good estimator, provided the prior is not extreme. The amount of Fisher information included in it is larger than that included in the MAP or MLE estimator, but the difference is small or only higher order, which can be revealed by calculating the curvature direction of the ancillary statistics (Komaki, 1996). When a smooth, positive prior is used, all the Bayesian estimators are asymptotically the same in a regular statistical model. One choice is the Jeffreys noninformative prior, which uses the square root of the determinant of the Fisher information matrix, and its optimality is guaranteed by information theory (Clarke & Barron, 1990). Hence, the performance of the
Difficulty of Singularity in Population Coding
853
Bayesian inference is asymptotically guaranteed to be good, and if one can choose an adequate prior based on prior knowledge, its performance is better. However, when one chooses an extreme prior, say diverging to infinity at some points or zero on some region, the result of the inference is biased and strongly affected by the chosen prior. When the statistical model is singular, as is our model, a serious problem arises. The Jeffreys prior converges to 0 in the critical region of parameters. Hence, this is an extreme case. One usually assumes a smooth prior, in particular, the uniform prior, on the model M. Instead of the parameter space ˜ of behaviors, which are the space of all M, we may consider the space M ˜ looks very natural to be the tuning functions f (z, ξ ). A smooth prior on M assumed, but it becomes singular on M because the critical set on M cor˜ responds to an infinite number of the same probability distributions in M. Therefore, the uniform (or smooth) prior counts a probability distribution ˜ infinitely many times in M. If we assume the uniform (or belonging to M ˜ of behaviors, then such multiplicity smooth) prior on the reduced model M disappears, as the Jeffreys prior does. However, such a prior becomes 0 on the critical region of the original parameter space. Therefore, the prior becomes singular in a singular statistical model. There has been little study on the Bayesian inference in a singular statistical model except for the algebraic geometry theory of Watanabe (2001a, 2001b, 2001c). We can observe that the uniform prior favors a smaller model (one stimulus in our case), so that it suppresses automatically larger models, avoiding overfitting. When the true distribution is in a smaller model, it gives a better performance. Watanabe and his colleagues have studied the Bayesian inference in singular models by using algebraic geometry, which we do not recapitulate here. 7 Synfiring Solution How can the brain solve the difficulty in the singular situation? We propose a possible solution in the scheme of synchronous firing. Given two stimuli representing different objects, the synfire hypothesis (Abeles, 1991; Diesmann, Gewaltig, & Aertsen, 1999; von der Malsburg, 1999; Singer, 1999) poses that the responses due to one object are synchronized, while those by the other fire synchronously in another phase. For simplicity, we assume there are only two phases, phase 1 and phase 2, and object 1 excites neurons in phase 1 with degree α and in phase 2 with the complementary degree α¯ = 1 − α. As for object 2, α and α¯ are interchanged. Hence, the responses of the neural field in phases 1 and 2 are, respectively, r1 (z) = α v¯ ϕ (z − x1 ) + αvϕ ¯ (z − x2 ) + σ ε1 (z)
(7.1)
r2 (z) = α¯ v¯ ϕ (z − θ1 ) + αvϕ (z − θ2 ) + σ ε2 (z),
(7.2)
854
S. Amari and H. Nakahara
where v¯ = 1 − v and εi (z) are independent noises. The degree of synchrony is represented by α, and there is no synchrony when α = 1/2. For r˜ = (r1 (z), r2 (z)) ,
(7.3)
the probability of r˜ is the product of Q (r1 (z)) and Q (r2 (z)) of equation 2.6, Q (˜r) = Q (r1 ) Q (r2 ) ,
(7.4)
because r1 and r2 are independent since ε1 and ε2 are independent. Hence, the Fisher information is the sum of those in phases 1 and 2. We calculate the Fisher information I F1 (ξ) included in r1 directly at u = 0. It is given by −(α− α) ¯ (α v¯ + αv) ¯ vv¯ g1 0 ¯ 2 g1 (α v¯ + αv) . I F1 (ξ)=−(α− α) ¯ (α v¯ + αv) ¯ vv¯ g1 0 ¯ 2 (vv¯ )2 g1 (α− α) 2 0 0 ¯ g0 (α− α)
(7.5) Its determinant vanishes so that r1 has singular information. The I F2 (ξ) in phase 2 is given by interchanging α and α¯ in I F1 . The total Fisher information is I F (ξ) = I F1 (ξ) + I F2 (ξ) cg1 = (α − α) ¯ 2 (v − v¯ ) g1 0
¯ 2 (v − v¯ ) g1 (α − α) 2 (α − α) ¯ 2 (vv¯ )2 g1 0
0 , 0 2 (α − α) ¯ 2 g0 (7.6)
¯ v¯ , which is nonsingular even at u = 0, except c = α 2 + α¯ 2 v2 + v¯ 2 + 4α αv for the case with α = α¯ = 1/2. The effect of synchrony is dramatic even when the degree of synchrony is weak. The population coding is no longer singular, keeping sufficient information to distinguish the two objects. We next discuss a possible mechanism of inducing synchrony.
8 Remarks on Binding Different Features A more general aspect of the binding problem is discussed in terms of correlated or synchronous firing. Let an object O have two different
Difficulty of Singularity in Population Coding
855
features, say, location and orientation, represented by analog features x and x , respectively. In a typical example in the binding problem, x represents the shape parameter and x the color of an object. Assume that two objects O1 and O2 are presented, with respective features x1 , x1 and x2 , x2 . If there exists a neural field representing the features (x, x ) jointly, when one object (x, x ) is applied, the tuning function f (z, z ) in the two-dimensional field with coordinates (z, z ) has a peak at (x, x ) of the presented object, and the neural response is its noisy version r (z, z ) = f (z − x, z − x ) + σ ε(z, z ). When two objects are presented simultaneously, the response is r (z, z ) = (1 − v)ϕ z − x1 , z − x1 + vϕ z − x2 , z − x2 + σ ε(z, z ). (8.1) Hence, r has two peaks at x1 , x1 and x2 , x2 , and there is no difficulty for binding xi with xi , provided the two peaks are separated. When they are close, the above type of singularity occurs. It is not plausible to assume the existence of a field combining any two features. There will be, in general, two different fields, Z and Z —one representing x and the other x . Given two objects, two peaks are aroused in field Z at x1 and x2 , and two peaks in field Z at x1 and x2 , in the respective response patterns: r (z) = f (z; x1 , x2 ) + σ ε(z), r z = f z ; x1 , x2 + σ ε z .
(8.2) (8.3)
The problem is to find which peak, x1 or x2 , is bound to which peak, x1 or x2 . Synchronization is a solution, as we have analyzed. Synchrony is generated by the correlated firing of neurons. A simple mechanism causing pairwise and higher-order correlations has already been shown (Amari, Nakahara, Wu, & Saka, 2003). We extend that model to be applicable in our case. Here, we assume spiking neurons in a neural field Z. Let p(z) be the potential of the neuron at z, and the neuron fires when p(z) exceeds a threshold. The potential is given by the tuning function disturbed by fluctuating noises, p(z) =
1 1 ϕ (z − x1 ) + ϕ (z − x2 ) + σ ε(z), 2 2
(8.4)
where the noise term is decomposed as ε(z) = ϕ (z − x1 ) ε1 (z) + ϕ (z − x2 ) ε2 (z) + ε0 (z).
(8.5)
856
S. Amari and H. Nakahara
Here, ε1 , ε2 , and ε0 (z) are independent and gaussian random variables. The noise terms ε1 and ε2 are multiplicative, representing the Poisson character of spiking neurons. When ε1 , ε2 , and ε0 are temporally independent (or temporal correlations decay to 0 quickly), we have a model of spiking neurons. The potential in field Z has a similar nature, p (z ) = (1 − v)ϕ z − x1 + vϕ z − x2 + σ ε (z ),
(8.6)
where the noise in field Z is also decomposed as ε (z ) = ϕ z − x1 ε1 (z ) + ϕ z − x2 ε2 (z ) + ε0 (z ).
(8.7)
It is plausible that noises ε1 and ε2 originate from the objects O1 and O2 , respectively, and are common to fields Z and Z . It is analyzed in Amari et al. (2003) that synchrony emerges in such a model. This can solve the binding problem. More detailed analysis using the Fisher information is in our future program. 9 Discussion This article investigated the Fisher information when two stimuli are presented together. Since ordinary statistical analysis fails in this case due to the degeneracy of the Fisher information, we introduced a novel method using algebraic singularity, and this method allowed us to examine not only the Fisher information but also the algebraic structure of the degeneracy. We analyzed the asymptotic property of how the information decays as the locations of the two stimuli near each other, although we do not propose any decoding algorithms. We then searched for a possible neural mechanism to resolve singularity. This is given by the synchronization of the neural firing in population coding. We analyzed how the Fisher information behaves under synchronization. We finally discussed a possible mechanism of generating synchronization in relation to the binding problem. Our method synthesizes various disciplines from many fields. Acknowledgments H.N. is supported by Grants-in-Aid 14780614 and 14017106 from the MEXT, Japan. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics, 27, 77–87.
Difficulty of Singularity in Population Coding
857
Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S., & Arbib, M. A. (1977). Competition and cooperation in neural nets. In J. Metzler (Ed.), Systems neuroscience (pp. 119–165). San Diego: Academic Press. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: AMS and Oxford University Press. Amari, S., Nakahara, H., Wu, S., & Sakai, Y. (2003). Synchronous firing and higherorder interactions in neuron pool. Neural Computation, 15, 127–142. Amari, S., & Ozeki, T. (2001). Differential and algebraic geometry of multilayer perceptrons. IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences, E84-A(1), 31–38. Burnashev, M. V., & Amari, S. (2002). On some singularities in parameter estimation problems. Problems of Information Transmission, 38, 328–346 Clarke, B. S., & Barron, A. R. (1990). Information theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, 36, 453–471 Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and application to mixture models. Probability and Statistics, 1, 285–317. Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Eurich, C. W., & Wilke, S. D. (2000). Multidimensional encoding strategy of spiking neurons. Neural Computation, 12(7), 1519–1529. Fukumizu, K. (2003). Likelihood ratio of unidentifiable models and multilayer neural networks. Annals of Statistics, 31, 833–851. Fukumizu, K., & Amari, S. (2000). Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13(3), 317–327. Hotelling, H. (1939). Tubes and spheres in n-spaces, and a class of statistical problems. Amer. J. Math., 61, 440–460. Ingle, D. (1968). Visual releasers of pre-catching behavior in frogs and toads. Brain Behavior and Evolution, 1, 500–518. Komaki, F. (1996). On asymptotic properties of predictive distributions. Biometrika, 83, 299–313. Liu, X., & Shao, Y. (2003). Asymptotics for likelihood ratio tests under loss of identifiability. Annals of Statistics, 31(3), 807–832. Mart´ınez-Trujillo, J. C. M., & Treue, S. (2000). Attention modulates tuning curves from two stimuli inside the receptive field of MT/MST cells in the macaque monkey. Society for Neuroscience Abstracts, 26, 251.16. Nakahara, H., & Amari, S. (2002). Attention modulation of neural tuning through peak and base rate in correlated firing. Neural Networks, 15, 41– 55. Nakahara, H., Wu, S., & Amari, S. (2001). Attention modulation of neural tuning through peak and base rate. Neural Computation, 13, 2031–2048. Pouget, A., Dayan, P., & Zemel, R. (2000). Information processing with population codes. Nature Neuroscience Reviews, 1, 125–132. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Computation, 10, 373–401. Reichardt, W., & Poggio, T. (1975). A theory of the pattern induced flight orientation of the fly, Muska domestica, II. Biological Cybernetics, 18, 69–80.
858
S. Amari and H. Nakahara
Sahani, M., & Dayan, P. (2003). Doubly distributional population codes: Simultaneous representation of uncertainty and multiplicity. Neural Computation, 15, 2255–2279. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Nat. Acad. Sci. USA, 90, 10749–10753. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24, 49–65. Treue, S., Hol, K., & Rauber, H. J. (2000). Seeing multiple directions of motionphysiology and psychophysics. Nature Neuroscience, 3, 270–276. von der Malsburg, C. (1999). The what and why of binding: The modeler’s perspective. Neuron, 24, 95–104. Watanabe, S. (2001a). Algebraic analysis for non-identifiable learning machines. Neural Computation, 13, 899–933. Watanabe, S. (2001b). Algebraic geometrical methods for hierarchical learning machines. Neural Networks, 14(8), 1409–1060. Watanabe, S. (2001c). Algebraic information geometry for learning machines with singularities. In T. K. Leen, T. G. Dieterrich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 329–336) Cambridge, MA: MIT Press. Weyl, H. (1939). On the volume of tubes. Amer. J. Math., 61, 461–472. Wu, S., Amari, S., & Nakahara, H. (2002a). Asymptotic behaviors of population codes. Neurocomputing, 44–46, 697–702. Wu, S., Amari, S., & Nakahara, H. (2002b). Population coding and decoding in a neural field: A computational study. Neural Computation, 14, 999–1026. Wu, S., Amari, S., & Nakahara, H. (2004). Information processing in a neuron ensemble with the multiplicative correlation structure. Neural Networks, 17(2), 205–214. Wu, S., Chen, D., Niranjan, M., & Amari, S. (2003). Sequential Bayesian decoding with a population of neurons. Neural Computation, 15(5), 993–1012. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation and an unfaithful model. Neural Computation, 13, 775–797. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing, II (pp. 167–173). Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10, 403–430. Zemel, R. S., & Pillow, J. (2002). A probabilistic network model of population responses. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 223–242). Cambridge, MA: MIT Press. Zhang, K., & Sejnowski, T. J. (1999). Neural tuning: To sharpen or broaden. Neural Computation, 11, 75–84.
Received November 24, 2003; accepted August 31, 2004.
LETTER
Communicated by Laurence Abbott
Neurons Tune to the Earliest Spikes Through STDP Rudy Guyonneau
[email protected] Centre de Recherche “Cerveau et Cognition,” Toulouse 31000, France, and Spikenet Technology, Revel, France
Rufin VanRullen
[email protected] Centre de Recherche “Cerveau et Cognition,” Toulouse 31000, France
Simon J. Thorpe
[email protected] Centre de Recherche “Cerveau et Cognition, Toulouse 31000, France, and Spikenet Technology, Revel 31250, France
Spike timing-dependent plasticity (STDP) is a learning rule that modifies the strength of a neuron’s synapses as a function of the precise temporal relations between input and output spikes. In many brains areas, temporal aspects of spike trains have been found to be highly reproducible. How will STDP affect a neuron’s behavior when it is repeatedly presented with the same input spike pattern? We show in this theoretical study that repeated inputs systematically lead to a shaping of the neuron’s selectivity, emphasizing its very first input spikes, while steadily decreasing the postsynaptic response latency. This was obtained under various conditions of background noise, and even under conditions where spiking latencies and firing rates, or synchrony, provided conflicting informations. The key role of first spikes demonstrated here provides further support for models using a single wave of spikes to implement rapid neural processing.
1 Introduction Activity-dependent learning at the systems level relies on dynamical modifications of synaptic strength at the cellular level. Experimental data have shown that these modifications depend on temporal pairing between a preand a postsynaptic spike: an excitatory synapse receiving a spike before a postsynaptic one is emitted potentiates while it weakens the other way around (Markram, Lubke, Frotscher, & Sakmann, 1997). The amount of modification depends on the delay between these two events: maximal when preand postsynaptic spikes are close together, the effects gradually decrease Neural Computation 17, 859–879 (2005)
© 2005 Massachusetts Institute of Technology
860
R. Guyonneau, R. Van Rullen, and S. Thorpe
and disappear with intervals in excess of a few tens of milliseconds (Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Feldman, 2000). Learning at the neuronal level thus heavily depends on the temporal structure of neuronal responses. Interestingly, in many brain areas, the temporal precision of spikes during stimulus-locked responses can be in the millisecond range. The evidence is clear in both the auditory cortex (Heil, 1997) and the somatosensory system (Petersen, Panzeri, & Diamond, 2001). Reproducible temporal structure can also be found in the visual system, such as in MT (Bair & Koch, 1996), and from retinal ganglion cells to the inferotemporal cortex (Sestokas, Lehmkuhle, & Kratz, 1991; Meister & Berry, 1999; Liu, Tzonev, Rebrik, & Miller, 2001; Richmond & Optican, 1990; Victor & Purpura, 1996; Nakamura, 1998). This raises the question of how a neuron with STDP will respond when repeatedly stimulated with the same pattern of input spikes. Let us consider a simple case where a neuron is exposed to an input spike pattern
Figure 1: Facing page. Single wave experiment. For a repeated spike wave under realistic conditions, a neuron learns to react faster to its target. Synaptic weights converge onto the earliest firing afferents, even with 25 ms jitter or 50 Hz background activity. (A) Typical incoming activity. Bottom: raster plot of a jittered spike wave amid spontaneous activity. At each presentation, 5 ms jitter and 5 Hz background activity are regenerated; the resulting pattern is presented to the postsynaptic neuron (with initial potential set to 0); when it spikes, the STDP learning rule is applied and its potential reset to 0 before going to the next presentation. Prior to presentation, the presynaptic neurons do not fire any spike. Top: the corresponding poststimulus time histogram (PSTH): the gaussian form in the left-most part corresponds to the reproduced spike wave. (B) Dynamics of repeated STDP. Top: sum of all synaptic weights at each presentation (the dashed line represents the output neuron threshold). The sum of the synaptic weights stored in the afferents stabilizes at threshold value. Bottom: the horizontal axis corresponds to the number of presentations (i.e., the learning step). The black line refers to the left axis and shows the reduction of postsynaptic latency during the course of learning. The background image refers to the right axis where each synapse weight is mapped by a gray-level index (see the corresponding bar on the right). Synapses are ordered by spiking latency of the corresponding neuron within the original reproducible input pattern (here, a wave), more precisely before superimposing spontaneous activity or adding some jitter. This order is decided a priori and stays fixed for the whole simulation. During learning, earliest synapses become fully potentiated and later ones are weakened. (C) Effect of jitter. Jitter is generated by a gaussian distribution. Increasing its standard deviation does not affect convergence until about 10 ms. From there, it slows the system roughly quadratically. (D) Effect of spontaneous firing rate. Increasing background activity slows convergence approximatively linearly.
Neurons Tune to the Earliest Spikes Through STDP
861
consisting of one input spike on each afferent synapse, with arbitrary but fixed (e.g., gaussian distributed, as in latencies. For one given input pattern presentation, the input spikes elicit a postsynaptic response, triggering the STDP rule. Synapses carrying input spikes just preceding the postsynaptic one are potentiated, while later ones are weakened. The next time this input pattern is re-presented, firing threshold will be reached sooner, which implies a slight decrease of the postsynaptic spike latency. Consequently, the learning process, while depressing some synapses it had previously potentiated, will now reinforce different synapses carrying even earlier spikes than the preceding time. By iteration, it follows that when the presentation of the same input spike pattern is repeated, the postsynaptic spike latency will tend to stabilize at a minimal value while the first synapses become fully potentiated and later ones fully depressed (see Figure 4 in Song, et al., 2000, as well as Gerstner and Kistler, 2002a, for related demonstrations; see also Figures 1A and 1B). Under these simplistic conditions, neurons can learn to react faster to a given stimulus pattern by emphasizing the role of the earliest firing inputs. However, input spike patterns in the brain do not consist of onespike-per-neuron waves of infinite precision. Not only do stimulus-locked responses occur with a jitter on the order of 1 or more milliseconds (Mainen
862
R. Guyonneau, R. Van Rullen, and S. Thorpe
& Sejnowski, 1995), neocortical cells also spontaneously fire at rates up to 20 Hz in awake animals (Evarts, 1964; Hubel, 1959; Steriade, Oakson, & Kitsikis, 1978). Finally, temporal structure in spike trains is often distributed over numerous consecutive spikes in long time windows. With realistic conditions, would the effect of STDP still emphasize the very first afferents? Would this be the case even when other types of neural codes (firing rate, synchrony) are present within the spike trains? This would have critical implications for our understanding of neural information coding and processing. In this theoretical study, we seek to answer these questions through a range of simulations using biologically plausible neuron models. 2 Spike Waves A target neuron is repeatedly presented with an input spike pattern. In a first example, it consists of a single asynchronous spike wave: one spike for each synapse with gaussian-distributed latencies (50 ms mean, 20 ms width). At each presentation, spike times are jittered, and Poisson spontaneous activity is added to the spike trains. We varied the amount of jitter and spontaneous firing and investigated their effect on the target neuron learning behavior. 2.1 Methods. 2.1.1 Integrate and Fire. The postsynaptic neuron is connected to a thousand afferent neurons via as many synapses and integrates spikes across time (in all simulations except one, no leakage was involved). Systematically starting at a resting level of 0 before presentation, it sums the weight of the synapses that carry each incoming spike. The presynaptic neurons are assumed to be quiet prior to presentation; they do not fire any spike. When reaching its threshold (100 for all simulations), the postsynaptic neuron fires, triggers an STDP-inspired learning rule for its first action potential only, and then sets its potential back to a resting level of 0. 2.1.2 STDP Model. The learning function F has the prototypical form of STDP according to electrophysiological results in cultured hippocampal neurons (Bi & Poo, 1998). The amount of synaptic modification arising from a single pair of pre- and postsynaptic spikes separated by time t is expressed as follows: if τ+ ≤ t ≤ 0, if 0 < t ≤ τ− , otherwise
F (t) = A+ . (1 − (t/τ +)) F (t) = −A− . (1 − (t/τ −)) F (t) = 0,
where A+ and A− determine the maximum amounts of synaptic modification that occur when t is close to zero (A+ = A− = 1 in all simulations). τ+ and τ− are the temporal windows for, respectively, potentiation and
Neurons Tune to the Earliest Spikes Through STDP
863
depression, expressed in milliseconds (here, τ+ = −20 ms and τ− = 22 ms). Synaptic growth from learning step n to n + 1 is computed as follows: wi (n + 1) = wi (n) + F (tpre − tpost ), where wi (n) is the “free” weight of synapse i at nth presentation step, tpost the postsynaptic spike timing, and tpre the presynaptic spike timing. Under these conditions, synaptic weights may grow to infinitely large values. To address this obstacle, we implemented a sigmoidal saturation function g: Wi = g(wi ), where g(x) = ((π/2) + atan(x))/π. The “free” weight, labeled wi , corresponds to the unconstrained weight of the synapse, which connects input neuroni to the postsynaptic one. It ranges from −∞ to +∞ and is the target of the strengthening/weakening process. The “effective” weights, Wi , lie in the ]wmin , wmax [ interval (here wmin = 0, wmax = 1). These “effective” weights are the ones that are considered for calculating the excitatory postsynaptic potential.1 2.1.3 Jitter. When needed, each reproducible spike was displaced in time by a value chosen randomly at each presentation from a gaussian distribution of mean 0 and standard deviation set to the expected jitter. 2.1.4 Spontaneous Activity. Spontaneous activity is generated using a Poisson process as described in section 3.1. It is redrawn for each afferent neuron, at each presentation, before being superimposed on the jittered reproducible structure to constitute the incoming activity. 2.1.5 Convergence Criterion. Convergence is met when the local average of postsynaptic latencies, in a ±5-step window, remains within 1 ms for 100 consecutive steps (reported convergence is the first of these 100 steps). We also checked that the synaptic weights were indeed selected on the basis of their earliest inputs.
1 Note that here, contrary to other update rules (Rubin, Lee, & Sompolinsky, 2001), when the effective weight approaches the upper and lower bounds, both the expected amounts of reward and punishment (in the “effective” weight domain) tend toward 0. In the multiplicative update case (Song et al., 2000), these amounts stay balanced one compared to the other: when the weight reaches a bound, for example, wmax , then reward goes to 0, whereas punition is maximal and inversely. For the additive update rule (Cateau & Fukai, 2003), both potentiation and depression are independent of the weight. If the update results in a synaptic weight outside the bounds, the weight is clipped to the boundary values.
864
R. Guyonneau, R. Van Rullen, and S. Thorpe
2.2 Resistance to Jitter and Spontaneous Activity. A typical example with 5 ms jitter and 5 Hz spontaneous activity (see Figure 1A) shows that the target neuron learns to react faster to the input pattern—in the case here, a spike wave—by selectively reinforcing the earliest firing afferents (see Figure 1B). Initially, synaptic weights are set so as to evoke the first postsynaptic response when the entire reproducible pattern, the spike wave, has been integrated (in the present case, around 120 ms after stimulus onset). Synapses are then progressively reinforced and then weakened, shortening the postsynaptic neuron latency from one step to the following. This is especially visible between steps 20 and 100, where this codependent dynamic appears as a “bright crest” going from the latest to the earliest afferences. When the postsynaptic neuron latency stabilizes, the STDP rule is systematically applied at approximately the same time: its rewarding part constantly affects the same synapses, those receiving the earliest of the reproducible spikes, thus driving them to their maximal strengths. Symmetrically, later spikes invariably arrive in the weakening part of the STDP window: the corresponding synapses are continuously depressed. This trend can also be observed in the evolution of the total amount of synaptic weight of the neuron at a given step (see Figure 1B, top). It corresponds to the potential the model neuron would reach if each of its synapses was hit by one spike only. Initially, it starts from a value well below the threshold needed for the postsynaptic neuron to fire a spike. Here, the presence of spontaneous activity allows the neuron to reach its threshold and trigger the STDP rule for the first time. As the learning process goes on, the sum of synaptic weights increases due to the increasing number of spikes, thus of synapses, falling in the rewarded part of the STDP window; tends toward the neuron’s threshold value; and stabilizes around this level. Since the maximum synaptic weight is fixed at 1.0, the neuron can reach a state where the first N spikes suffice to make it fire (where N is the output neuron’s threshold). The neuron will fire early only if it receives more or less the same pattern as the one learned during the first moments of incoming activity. In fact, it responds increasingly faster to a precise sequence of spikes than to any other (see Figure 5). Simulations showed that the convergent behavior of this trend is very robust. When jitter is increased (with spontaneous activity set to 0), the number of presentations needed for convergence stays the same until the amount of jitter reaches 10 ms; then it increases roughly quadratically (see Figure 1C). Thus, jitter has little or no effect on the consequences of repeated STDP for plausible values. Moreover, convergence of the synaptic weights onto the earliest afferents could even be obtained with jitter in the range of 20 to 25 ms, that is, as wide as the input spike times distribution itself. When increasing spontaneous activity (in the 5–50 Hz range without any jitter), the convergence is also slowed but is nonetheless obtained even at
Neurons Tune to the Earliest Spikes Through STDP
865
the highest rate (see Figure 1D). Note that in this case, the input spike wave represents less than 15% of the total spikes. As only the 10% of synapses carrying the earliest spikes are selected, this result implies that the STDP rule is able to focus exclusively on less than 2% of the spikes, those whose timing is reproducible and early, while discarding the rest. 3 Spike Trains When a single wave of spikes is the reproducible structure, it is clear that convergence to a state in which weights are heavily concentrated on the earliest firing inputs is very robust. However, it could be argued that the STDP rule focuses on the left-most part of the train only because it is the only part that contains information from one pattern presentation to the next (indeed, the right-most part contains only randomly generated spikes). 3.1 Uniform Spike Distribution. The main simulation uses an input spike pattern in the form of 500 ms–long spike trains where the temporal structure and the amount of information is statistically homogeneous across time and across afferent neurons. More specifically, spike trains were generated according to a Poisson process. Each of the 1000 excitatory afferents emits a given spike train where each interspike interval, isi, was determined according to a Poisson rule process depending on the expected rate of the train, u: isi (r ) = ln(r )/−u,
where r is a random value chosen from a uniform distribution on the interval ]0.0, 1.0[. The reproducible structure for the simulation illustrated in Figure 2 is defined once and for all using this method (u = 20 Hz for all afferent neurons). Note that we are not suggesting that neuronal responses are completely stochastic, since reproducibility implies the contrary (see, e.g., Meister & Berry, 1999). We need this only to ensure that no a priori assumptions are made as to how neural information is encoded, thanks to the homogeneous nature of Poisson processes. This basic pattern is reproduced on each presentation, undergoing 5 ms jitter and mixed with 5 Hz spontaneous activity (see Figure 2A). The initial synaptic weights are set so as to elicit the first postsynaptic response after approximately 400 ms. As in the previous simulation, latency decreases steadily from this initial value to stabilize after 1500 presentations. Conjointly, synapses carrying the first spikes become fully potentiated and later ones are fully depressed (see Figure 2B, bottom). The sum of synaptic weights displays the same dynamic as in the spike wave experiment (see section 2.2): going from a very low level, it rises until it reaches the minimal value needed to make the postsynaptic neuron fire on a single volley of
866
R. Guyonneau, R. Van Rullen, and S. Thorpe
Figure 2: Spike trains experiment. (A) Typical incoming activity. The input spike pattern consists of arbitrary 500 ms–long spike trains, where the amount of information is equally distributed over time. At each presentation, this reproducible structure is modified using 5 ms jitter and 5 Hz spontaneous activity, then presented to the postsynaptic neuron, whose potential is set to 0. Prior to presentation, the presynaptic neurons do not fire any spike. Due to the random aspect of the input design (Poisson-inspired process; see section 3.1), latencies and firing rates in the reproducible input structure were quite variable, ranging, respectively, from 0 to ≈350 ms (mean of ≈52 ± 51 ms, standard deviation) and from 4 to 44 Hz (mean 19.6 ± 6.3 Hz, standard deviation). (B) Dynamics of repeated STDP. Here again, latency decreases and stabilizes. Synapses are selected on the basis of their first-spike timings: the earliest ones are fully potentiated and the latest are weakened. The sum of synaptic weights follows the same behavior as in the spike wave experiment: it stabilizes around the output neuron’s threshold value (see Figure 1B).
spikes. At this point, the system has converged: synaptic potential flattens around a value corresponding to the postsynaptic neuron’s threshold (see Figure 2B, top). Thus, the selection of the earliest spikes through STDP is obtained even when late parts of the spike trains carry the same amount of information. 3.2 Latency versus Firing Rates and Synchrony. Note that no assumptions were made in the previous simulation as to how neurons encode information within the spike trains. Firing rates (Gerstner, Kreiter, Markram, & Herz, 1997; Shadlen & Newsome, 1998) or synchrony (Abeles, 1991) are widely believed to support the neural code. In this regard, one may think that these principles should drive the effects of STDP. Indeed, it has been argued that competition for control of the postsynaptic response would thus be won by the most correlated inputs (Song, Miller, & Abbott, 2000), or by
Neurons Tune to the Earliest Spikes Through STDP A
867
B 200
200
100
100
0
0 400
W 1 .5
300 .1 200 .01 100
0
100
200
300
Time (ms)
400
500
0
0
500
1000
1500
2000
.001
# presentations
Figure 3: Latency versus rate. (A) Typical incoming activity. The shorter the latency, the fewer spikes the input neuron emits. According to the spike train design, the tail of the PSTH contains more spikes than its head. A jitter of 5 ms is applied to each spike timing, and no spontaneous activity is superimposed on the reproducible input structure so as to rigorously control the rates of afferents. Activity is then presented to the postsynaptic neuron, whose potential is set to 0. Prior to presentation, the presynaptic neurons do not fire any spike. (B) Dynamics of repeated STDP. The same trend emerges: although receiving fewer spikes, synapses hit by the earliest trains are finally fully potentiated. Inversely, strongly firing inputs will be neglected because they fire late.
the most vigorously firing ones (Gerstner & Kistler, 2002b). However, this study points toward first-spike timing as a determining factor, emphasizing the role of temporal asynchrony in neural information coding (Gautrais & Thorpe, 1998). In the brain, short latencies are generally associated with highest firing rates, which also often result in high temporal correlations. This would make it difficult to disentangle the respective influences of these aspects of the spike train on neuronal learning. In our simulations, however, we can ask how these different aspects fare, one compared to the other, by artificially defining input spike patterns where first-spike timing is pitted against average firing rate or amount of synchrony. In the next simulation, latencies and rates have been artificially opposed: the afferent neuron with the shortest latency fires at the slowest rate over the entire window, the latest one at the highest rate, and neurons inbetween display a gradual latency-to-rate trade-off. As in the main experiment, spike times are jittered at each presentation (but spontaneous activity has been removed so as to rigorously control the firing rate of each input neuron, ranging from 4 to 44 Hz; (see Figure 3A). The result is clear as the same trend emerges (see Figure 3B). This means that a synapse receiving a very high firing rate will not be retained by STDP if it does not also correspond to one of the shortest latencies.
868
R. Guyonneau, R. Van Rullen, and S. Thorpe
A
B 400
200
200
100
0
0 400
W 1 .5
300 .1 200 .01 100
0
100
200
300
Time (ms)
400
500
0
0
500
1000 1500
2000
2500 3000 3500
.001
# presentations
Figure 4: Latency versus synchrony. (A) Typical incoming activity. The spike pattern was designed so as to oppose latency to synchrony: the shorter the latency, the shorter the synfire chain. Spontaneous activity of 5 Hz has been superimposed on the reproducible input structure and no jitter is simulated so as to rigorously keep the synchrony within the spike trains. Activity is then presented to the postsynaptic neuron whose potential is set to 0. Prior to presentation, the presynaptic neurons do not fire any spike. (B) Dynamics of repeated STDP. Once again, postsynaptic latency decreases before stabilizing, and the earliest synapses become fully potentiated while later ones are weakened. Thus, being highly correlated is not sufficient for inputs to be selected by STDP; they also have to fire with one of the shortest latencies.
In the final simulation, latency and amount of synchrony are artificially opposed: the longer the latency, the more neurons are made to share the same spike train (hence, defining longer synfire chains; Abeles, 1991). The size of these correlated groups gradually extends from 1 to 54 synchronously firing units. Here, 5 Hz spontaneous activity is added as usual, while jitter is removed so as to obtain truly synchronous waves (see Figure 4A). Once again, the results are clear: synapses carrying the very first input spikes are selected by STDP, whereas highly correlated inputs having late latencies are depressed (see Figure 4B). Note the small and eventually vanishing streaks of synaptic potential, a necessary consequence of the fact that synchronous inputs are always reinforced or depressed together by the STDP rule. Conclusively, when latency and synchrony are inversely correlated, the selection of the potentiated synapses still depends on the timing of the first reproducible spikes alone. 3.3 Selectivity Measures. Convergence of synaptic weights onto earliest afferents through STDP is now established, underlining the importance of first-spikes timing. But why do synaptic weights distribute in such a
Neurons Tune to the Earliest Spikes Through STDP
869
remarkable way? The first obvious answer is the postsynaptic spike latency reduction, as displayed in all the simulations. But having a neuron responding fast to a given event is not an interesting feature if it is not also selective to it. STDP has been shown to explain the development of direction selectivity in recurrent cortical networks, where excitatory and inhibitory synapses were modified according to their spiking activity (Rao & Sejnowski, 2000). Selectivity measures were based on whether the postsynaptic neuron fires. How should one define selectivity within the present, inhibition-less framework? Any conventional measure based on the postsynaptic firing rate is ruled out since the postsynaptic response is limited to its first spike (see sections 2.1 and 4). Under these conditions, we propose to use firing latency as a viable measure of selectivity: if a neuron spikes faster to the input pattern it learned than to any other one, it would de facto behave if not selectively, at least differentially to it. First, we generated 50,000 distractor spike trains using the same method (Poisson-inspired spike trains at 20 Hz; see section 3.1) as for the target pattern (see Figure 2A); a priori, these can be considered as equivalent to the target in terms of average latency of the spike trains, spike count, correlations, and so forth. At each learning step of the main simulation (see Figure 2B), these distractor patterns were tested for postsynaptic latency on the learning neuron and compared to 1000 responses to the target pattern, using the same conditions of background noise (5 ms jitter and 5 Hz spontaneous activity). These distractors and target presentations did not trigger the learning rule: the postsynaptic latency was measured but did not give rise to any synaptic modifications. A threshold was set around the mean response time to targets. Target spike trains yielding responses before the threshold were considered as hits and distractor ones as false alarms. Selectivity (d ) was computed as follows: d = z (hit rate) − z (false alarm rate), where z( p) stands for the inverse of the normal cumulative distribution function of p (Green & Swets, 1966). The expected maximum was computed using an expected hit rate of 50% and a false alarm rate of 1/2n, where n is the number of distractor patterns (here, n = 50, 000). The results (see Figure 5) show that while the neuron was initially less likely to respond to the target pattern than to arbitrary distractors (due to its initially random weight distribution), it became highly selective to its target as learning developed: after about 1500 presentations, when the postsynaptic neuron latency and weights have indeed stabilized (see Figure 2B), not one of the distractor patterns could make this neuron fire sooner than with the target one.
870
R. Guyonneau, R. Van Rullen, and S. Thorpe
Figure 5: Selectivity of the efferent neuron. The dashed line is the expected maximum value for d (under the present conditions), gray dots stand for d values at each learning step, and the black curve is a local average of these. The neuron becomes steadily more selective to the target pattern used for training. Initially, it tends to fire slightly more slowly to it than to 50,000 spike trains used as distractors. After 1500 presentations, when its response latency stabilizes (see Figure 2B), the postsynaptic neuron reacts to its target before any of the distractors. Note that some observations landed above the expected maximum value for d : in certain conditions (100% hit rate and/or 0% false alarms), the observed d can rise above the expected max value computed for an expected 50% hit rate and 1 for 100,000 false alarms.
Conclusively, a neuron exhibiting STDP will characteristically respond faster and faster to a precise repeated pattern than to any comparable one, thus becoming more selective to it, at least in the terms proposed here. 4 Discussion The temporally asymmetric STDP rule reinforces the last synapses involved in making a neuron fire. At first sight, one might think that its repeated application invariably reinforces afferents at the tail of the spike sequence firing a neuron. Instead, we have demonstrated that it focuses, in a remarkably robust manner, on the head of the spike sequence: dynamical STDP tunes a neuron to synapses transmitting reproducible spikes with the shortest latencies, thus enabling the cell to respond faster and selectively to the input
Neurons Tune to the Earliest Spikes Through STDP
871
it has learned. This result is obtained for arbitrary structures of the input spike pattern. This STDP tuning is achieved by selecting a relatively small subset of afferents, here fully determined by the model’s output threshold. In that sense, it is worth noting that only 10 to 40 fully potentiated excitatory inputs, among a possible 10,000 or so, would be enough to evoke a response (Shadlen & Newsome, 1994). The main assumption for the demonstration is that spike times within the input pattern must show a certain degree of reproducibility (which can be modulated by fairly high amounts of jitter and spontaneous activity). This is compatible with experimental observations in numerous brain structures (Heil, 1997; Petersen et al., 2001; Bair & Koch, 1996; Sestokas et al., 1991; Liu et al., 2001; Richmond & Optican, 1990; Victor & Purpura, 1996; Nakamura, 1998). This repetition of similar input patterns could quite simply be the result of multiple exposures with the same stimulus at different times in life. Alternatively, we propose that a single stimulus exposure could result in a sequence of similar processing waves through the rhythmic activity of cortical oscillations (Hopfield, 1995). Note that in both situations, one would not expect the state of the system to be the same from one step to another. In particular, the efferent neuron potential would be unlikely to have the same resting level every time the input pattern is presented. We have performed simulations where baseline fluctuations were simulated. As usual, the typical incoming activity was of the same kind as the one used in the main simulation: the reproducible input structure consisted of 1000 spike trains generated using a Poisson-inspired process (see section 3.1). At every step, a 5 ms jitter was applied to each precise spike time; then 5 Hz spontaneous activity was superimposed before being presented to the neuron. But instead of being systematically reset to 0, the postsynaptic neuron potential was set according to a gaussian process of mean 0 and width 50. Prior to presentation, the presynaptic neurons do not fire any spike. An identical convergence was reached, though less rapidly (see Figure 6). For simplicity, all of the simulations so far have been conducted with a single postsynaptic spike, no matter what the activity is afterward: a single STDP learning rule was applied at each input pattern presentation. While it may seem a controversial simplification, experimental evidence suggests that the effect of the late spikes could be neglected (Froemke & Dan, 2002). As Tsodyks (2002) commented, “The main effect of STDP is well explained by the first pair of spikes, with the additional spike having only a marginal contribution.” The first presynaptic spike is also the most relevant in evoking an excitatory postsynaptic potential (EPSP), because of synaptic depression (Thomson & Deuchars, 1994), a trend that is reinforced by synaptic redistribution, where a synapse depresses even more when potentiated (Markram & Tsodyks, 1996; see below for more details). Nonetheless, for completeness, we investigated the effects of having multiple postsynaptic spikes that repeatedly triggered the STDP mechanism. Here we implemented a potential
872
R. Guyonneau, R. Van Rullen, and S. Thorpe 200 100 0 500
W 1 0.5
400
0.1
300 0.01 200 0.001 100 0.0001 0
# presentations
Figure 6: Fluctuating potential. In this simulation, the initial efferent neuron potential was set—just before the cell is presented with the incoming activity— according to a gaussian distribution of mean 0 and standard deviation 50. Typical incoming activity is of the same kind as in the main experiment: a reproducible structure of spike trains amid 5 ms jitter and 5 Hz spontaneous activity (see Figure 2A). Prior to presentation, the presynaptic neurons do not fire any spike. Due to the fluctuation, postsynaptic neuron latencies were highly variable; the black line thus depicts a smoothed version (using a moving average of width 101 steps; ± standard deviation, dotted lines). Here, the fluctuation results in a delay of the reinforcement process applied to the reproducible structure: from step 0 to 5000, the crest-ascending motion is not obvious. But once some synapses are strong enough (i.e., displayed in white on the figure), the modification sequence starts until only the earliest firing inputs remain with large weights and the postsynaptic neuron latency has decreased to reach a stable, steady state. The sum of synaptic weights follows the same behavior as in the spike wave experiment: it stabilizes around the output neuron’s threshold value (see Figure 1B). While convergence needs more time to be reached (about 10 times more than for the original simulation; see Figure 2B), the repeated application of STDP leads the neuron with a highly fluctuating potential to tune itself on its earliest afferents and respond faster.
leak current (τ = 20 ms) in the efferent neuron. Simulation parameters and typical incoming activity were of the same kind as in the main experiment (see Figure 2A). Again, the same trend was observed with a slowing of convergence and the appearance of irregular second postsynaptic spikes that did not prevent the neuron from becoming tuned to its earliest regular afferents (see Figure 7).
Neurons Tune to the Earliest Spikes Through STDP
873
Figure 7: Leaky integrator and multiple postsynaptic responses. Here, a leak current (τ = 20 ms) is added to the neuron potential and more than one postsynaptic spike can be elicited in the course of the presentation, triggering STDP repeatedly. Typical incoming activity is of the same kind as in the main experiment: a reproducible structure of spike trains amid 5 ms jitter and 5 Hz spontaneous activity (see Figure 2A). It is presented to the output neuron after setting its potential to 0. Prior to presentation, the presynaptic neurons do not fire any spike. The black line links the first postsynaptic response latency at each presentation. Times at which the efferent neuron has subsequent discharges are displayed by empty circles. Except for some additional postsynaptic responses that did not affect the dynamics of learning, convergence on the earliest afferents was reached, slightly more slowly than in the comparative case without the use of a leaky integrator (see Figure 2B, bottom). Due to the leakage current, synapses had to be initialized at a slightly higher value than in previous simulations. The sum of synaptic weights thus started from a higher value than before, above the threshold of the postsynaptic neuron’s threshold (see Figure 2B, top). It nonetheless acted the same way to stabilize at threshold value.
One could also address the fact that theoretical studies use the idealized “smooth” STDP curve, while experimental data always display some noise: the curve is also likely to be noisy in biological settings (Bi & Poo, 1998). We thus tested this feature in a specific simulation under the same conditions as in section 3.1, except that each single synaptic modification was affected by a random offset taken from a gaussian distribution of mean 0 and with a standard deviation set to the noiseless amount of modification (see Figure 8A). Even when noise affects the synaptic modifications in a biologically realistic way, the trend emerges in a very similar manner (see Figure 8B).
874
R. Guyonneau, R. Van Rullen, and S. Thorpe
Figure 8: Noisy synaptic modifications. This simulation is identical to the main one (see Figure 2) except that in this case, each synaptic modification is affected by an offset taken from a gaussian distribution of mean 0 and standard deviation set to the noiseless amount of modification. Incoming activity is of the same kind as in the main experiment: a reproducible structure of spike trains amid 5 ms jitter and 5 Hz spontaneous activity (see Figure 2A). It is presented to the output neuron after setting its potential to 0. Prior to presentation, the presynaptic neurons do not fire any spike. (A) Typical sample of synaptic modifications. Taken from the modifications occurring at step 1 (only one out of two modifications are represented for clarity), this scatter plot shows the amount of each synaptic modification depending on the relative timing of the corresponding spike compared to the postsynaptic spike timing (dashed line). Notice how some expected rewards are in fact depressions and vice versa, as in the experimentally obtained STDP graphs (Bi & Poo, 1998). (B) Dynamics of repeated STDP. The trend emerges quite the same way as in Figure 2, illustrating the fact that STDP is able to reach for the first spikes of a reproducible input pattern, even when synaptic modifications are randomized.
Evidently, the demonstration here depends on the learning rule used in these simulations, and in particular on its temporally asymmetrical shape. This form was observed in several empirical studies (Markram et al., 1997; Bi & Poo, 1998, 2001; Zhang et al., 1998; Feldman, 2000; and Sjostr ¨ om ¨ & Nelson, 2002, for reviews) and has triggered numerous theoretical studies of STDP (Kempter, Gerstner, & van Hemmen, 1999; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Gerstner & Kistler, 2002b; see Abbott & Nelson, 2000, and Kepecs, Van Rossum, Song & Tegner, 2002, for reviews). Other forms exist, for example, symmetric (Egger, Feldmeyer, & Sakmann, 1999) or opposed (Bell, Han, Sugawara, & Grant, 1997), that could give rise to radically different trends. In the latter case, for example, where a pre-before-post pairing would induce depression and post-before-pre
Neurons Tune to the Earliest Spikes Through STDP
875
reinforcement, inhibiting the last spikes to make a cell fire might in fact increase the postsynaptic latency from one presentation to the other. A critical question concerns the handling of spike timing-dependent modifications when the same synapse receives more than one input spike that falls in the STDP window. Here we simply assumed that the effects of spike pairs would sum linearly, as in many theoretical studies (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Song, Miller, & Abbott, 2000; Senn, Markram, & Tsodyks, 2001), but the question extends to the problem of synaptic modifications in natural spike trains. A recent experimental study showed that the efficacy of each spike in modifiying synaptic strength was suppressed by the preceding spike in the same neuron: STDP favors the earliest of two spike pairs (Froemke & Dan, 2002), that is, in our case, reinforcement over depression. Besides, in a spike sequence following a period of inactivity, each spike decreases the probability of vesicle release for the following presynaptic action potentials, reducing the amplitude of their respective EPSPs in the process (Thomson & Deuchars, 1994). This form of short-term plasticity, synaptic depression, acts at the presynaptic level through the depletion of the pool of vesicles at the synaptic release site, where recycling is a relatively slow process compared to sustained activity. Not only does it by itself suggest a relative importance of first spikes as opposed to later ones, but it also interacts with long-term potentiation in a supportive way called synaptic redistribution (Markram & Tsodyks, 1996). While increasing, or decreasing, the probability of transmitter release, STDP would at the same time decrease (resp. increase) the availability of readily releasable vesicles for later spikes: when they get stronger, synapses depress more and vice versa (Markram & Tsodyks, 1996; Volgushev, Voronin, Chistiakova, & Singer, 1997). As such, “synaptic redistribution can significantly enhance the amplitude of synaptic transmission for the first spikes in a sequence,” thus giving a predominant role to the first spikes in terms of synaptic plasticity (Abbott & Nelson, 2000). Applied to the present framework, these more realistic simulations of synaptic modifications would no doubt reinforce the “back-in-time” dynamics of STDP, as illustrated by latency reduction, a necessary consequence of its very form. 5 Conclusion The most important implications of our work concern the way the brain stores and uses information. Much work in neural coding assumes that information is encoded using either firing rate or synchrony. This study raises doubts about this assumption by demonstrating the unambigous prevalence of the earliest firing inputs. This sensitivity to spatiotemporal spike patterns, obviously inherent in STDP rules, emphasizes the importance of temporal structure in neural information, as proposed and argued by several authors (Perkel & Bullock, 1968; Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Engel, Konig, Kreiter, Schillen, & Singer, 1992; de Ruyter
876
R. Guyonneau, R. Van Rullen, and S. Thorpe
van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Bair, 1999; Panzeri, Petersen, Schultz, Lebedev, & Diamond, 2001; Van Rullen & Thorpe, 2001). As stated by Froemke and Dan (2002), “Timing of the first spike in each burst is dominant in synaptic modifications.” Latencies do indeed seem to play a distinct role in terms of neural processing, as shown throughout this study. We showed how spatially distributed synapses can be made to integrate a spike wave coming from an afferent population and evoke a fast and selective response in the postsynaptic neuron, even in the presence of background noise. As a consequence, the fact that STDP naturally leads a neuron to respond rapidly and selectively on the basis of the first few spikes in its afferents lends support for the idea that even complex visual recognition tasks can be performed on the basis of a single wave of spikes (Van Rullen & Thorpe, 2002; Thorpe & Imbert, 1989). Acknowledgments This research was supported by the CNRS, the ACI Neurosciences Computationelles et Int´egratives, and SpikeNet Technology SARL. We also thank our reviewers for their critical comments. References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nat. Neurosci. 3, (Suppl.), 1178–1183. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Bair, W. (1999). Spike timing in the mammalian visual system. Curr. Opin. Neurobiol., 9, 447–453. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Comput., 8, 1185–1202. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G. Q., & Poo, M. M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu. Rev. Neurosci., 24, 139–166. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Cateau, H., & Fukai, T. (2003). A stochastic method to predict the consequence of arbitrary forms of spike-timing-dependent plasticity. Neural. Comput, 15, 597–620. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805– 1808.
Neurons Tune to the Earliest Spikes Through STDP
877
Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nat. Neurosci., 2, 1098–1105. Engel, A. K., Konig, P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. TINS, 15, 218-226. Evarts, E. V. (1964). Temporal Patterns of Discharge of Pyramidal Tract Neurons During Sleep and Waking in the Monkey. J. Neurophysiol., 27, 152–171. Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Gautrais, J., & Thorpe, S. (1998). Rate coding versus temporal order coding: A theoretical approach. Biosystems, 48, 57–65. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–81. Gerstner, W., & Kistler, W. M. (2002a). Spiking neuron models. Cambridge: Cambridge University Press. Gerstner, W., & Kistler, W. M. (2002b). Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Gerstner, W., Kreiter, A. K., Markram, H., & Herz, A. V. (1997). Neural codes: Firing rates and beyond. Proc. Natl. Acad. Sci. USA, 94, 12740–12741. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New-York: Wiley. Heil, P. (1997). Auditory cortical onset responses revisited. I. First-spike timing. J. Neurophysiol., 77, 2616–2641. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Hubel, D. H. (1959). Single unit activity in striate cortex of unrestrained cats. J. Physiol., 147, 226–238. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kepecs, A., Van Rossum, M. C., Song, S., & Tegner, J. (2002). Spike-timing-dependent plasticity: Common themes and divergent vistas. Biol. Cybern., 87, 446–458. Liu, R. C., Tzonev, S., Rebrik, S., & Miller, K. D. (2001). Variability and information in a neural code of the cat lateral geniculate nucleus. J. Neurophysiol., 86, 2789–2806. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213– 215. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical neurons. Nature, 382, 807–810. Meister, M., & Berry, M. J., II. (1999). The neural code of the retina. Neuron, 22 (3), 435–450. Nakamura, K. (1998). Neural processing in the subsecond time range in the temporal cortex. Neural Comput., 10, 567–595.
878
R. Guyonneau, R. Van Rullen, and S. Thorpe
Panzeri, S., Petersen, R. S., Schultz, S. R., Lebedev, M., & Diamond, M. E. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Perkel, D. H., & Bullock, P. H. (1968). Neural coding. Neurosci. Res. Prog. Sum., 3, 405–527. Petersen, R. S., Panzeri, S., & Diamond, M. E. (2001). Population coding of stimulus location in rat somatosensory cortex. Neuron, 32, 503–514. Rao, R. P. N., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), In Advances in neural information processing systems, 12. (pp 164–170).Cambridge, MA: MIT Press. Richmond, B. J., & Optican, L. M. (1990). Temporal encoding of two-dimensional patterns by single units in primate primary visual cortex. II. Information transmission. J. Neurophysiol., 64, 370–380. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Senn, W., Markram, H., & Tsodyks, M. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Comput., 13, 35–67. Sestokas, A. K., Lehmkuhle, S., & Kratz, K. E. (1991). Relationship between response latency and amplitude for ganglion and geniculate X- and Y-cells in the cat. Int. J. Neurosci., 60, 59–64. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Sjostrom, P. J., & Nelson, S. B. (2002). Spike timing, calcium signals and synaptic plasticity. Curr. Opin. Neurobiol., 12, 305–314. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3, 919–926. Steriade, M., Oakson, G., & Kitsikis, A. (1978). Firing rates and patterns of output and nonoutput cells in cortical areas 5 and 7 of cat during the sleep-waking cycle. Exp. Neurol., 60, 443–468. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends Neurosci., 17, 119-126. Thorpe, S. J., & Imbert, M. (1989). Biological constraints on connectionist models. In R. Pfeifer, Z. Schreter, & F. Fogelman-Soukli´e (Eds.), Connectionism in perspective (pp. 63–92). Amsterdam: Elsevier. Tsodyks, M. (2002). Spike-timing-dependent synaptic plasticity—the long road towards understanding neuronal mechanisms of learning and memory. Trends Neurosci., 25, 599–600. van Rossum, M. C., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. VanRullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: What the retinal ganglion cells tell the visual cortex. Neural Comput., 13, 1255–1283. VanRullen, R., & Thorpe, S. J. (2002). Surfing a spike wave down the ventral stream. Vision Res., 42, 2593–2615.
Neurons Tune to the Earliest Spikes Through STDP
879
Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. J. Neurophysiol., 76, 1310–1326. Volgushev, M., Voronin, L. L., Chistiakova, M., & Singer, W. (1997). Relations between long-term synaptic modifications and paired-pulse interactions in the rat neocortex. Eur. J. Neurosci., 9, 1656–1665. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., and Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395 37–44.
Received January 30, 2004; accepted August 25, 2004.
LETTER
Communicated by Alex Reyes
Rate and Synchrony in Feedforward Networks of Coincidence Detectors: Analytical Solution Shawn Mikula
[email protected] Ernst Niebur
[email protected] Zanvyl Krieger Mind/Brain Institute and Department of Neuroscience, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
We provide an analytical recurrent solution for the firing rates and crosscorrelations of feedforward networks with arbitrary connectivity, excitatory or inhibitory, in response to steady-state spiking input to all neurons in the first network layer. Connections can go between any two layers as long as no loops are produced. Mean firing rates and pairwise crosscorrelations of all input neurons can be chosen individually. We apply this method to study the propagation of rate and synchrony information through sample networks to address the current debate regarding the efficacy of rate codes versus temporal codes. Our results from applying the network solution to several examples support the following conclusions: (1) differential propagation efficacy of rate and synchrony to higher layers of a feedforward network is dependent on both network and input parameters, and (2) previous modeling and simulation studies exclusively supporting either rate or temporal coding must be reconsidered within the limited range of network and input parameters used. Our exact, analytical solution for feedforward networks of coincidence detectors should prove useful for further elucidating the efficacy and differential roles of rate and temporal codes in terms of different network and input parameter ranges. 1 Introduction Much of the theoretical basis for understanding information processing in complex biological systems is based on computational modeling—numerical solutions of the underlying model equations. While this approach has proven extremely useful and is the only practical one in many cases, analytical solutions would nearly always be preferable if they were available. Naturally, this is the case only for carefully selected systems of sufficient simplicity. In this report, we present one such system: an interconnected network of ideal coincidence detectors of arbitrary depth. Our solution is limited to feedforward networks, but otherwise the connectivity is arbitrary: Neural Computation 17, 881–902 (2005)
© 2005 Massachusetts Institute of Technology
882
S. Mikula and E. Niebur
neuronal thresholds can be arbitrary, synapses can be of arbitrary strengths, they can be excitatory and inhibitory, and they can be between any two neurons in any two layers provided that no loops are formed (this is the feedforward condition). The input to the network is characterized in terms of the average firing rate and the pairwise correlation, and we present iterative, closed-form solutions for all neurons (and pairs) in the network in the same form. The model is based on our previous exact, analytical solutions for the output firing rate of an individual coincidence detector receiving excitatory and inhibitory inputs, in both the presence and absence of synaptic depression (Mikula & Niebur, 2003a, 2003b, 2004) After defining methods and notations in section 2, we present our main result, the closed-form expressions for mean rate and cross-correlation, in section 3. Example networks are studied in sections 4 and 5. Implications for neural coding are discussed in section 6. 2 Methods 2.1 Notational Conventions. Matrix notation is used throughout our derivation. Matrices are denoted by boldfaced, uppercase letters; row and column vectors by boldface, lowercase letters; and scalars by regular lowercase letters. Matrices and vectors that are functions of one or more variables have their arguments denoted by subscripts, whereas scalars have their ar j denotes a matrix that is guments within parentheses. Thus, for instance, a function of one argument, j; the vector ci, j is a function of two arguments, i and j; and q (i, j; s, t) denotes a scalar that is a function of four arguments, i, j, s, and t. 2.2 Model Neurons: Coincidence Detectors. The model neurons used in this study are coincidence detectors. A coincidence detector is a computational unit that fires at time t if the number of unit excitatory postsynaptic potentials (EPSPs) received within the window (t − T, t) equals or exceeds the threshold θ. In many cases, it makes sense to think of T as a period on the order of 10 ms. This is the timescale of fast ionic synaptic conductances, and it is at this timescale that synaptic events superpose and interact. We do not, however, make use of this specific setting in our analysis, other than requiring that it be sufficiently small so that a maximum of one spike can be generated in a period of this length, and our results do not depend on this setting. 2.3 Binomial Spike Trains with Specific Cross-Correlation. In a previous report Mikula & Niebur, 2003a), we introduced a systematic method for the generation of an arbitrary number of spike trains with specified pair-wise mean cross-correlations and firing rates. Action potentials are distributed according to binomial counting statistics in each spike train.
Rate and Synchrony in Feedforward Networks
883
Mean firing rates and cross-correlations are the same for all spike trains (or all pairs of spike trains, respectively), but they can be chosen independently of each other. We describe the procedure here for the convenience of the reader. Let m be the number of input spike trains, each having n time bins. All bins are of equal length δt, chosen sufficiently small so that each contains a maximum of one spike; that is, each bin is guaranteed to contain either one or zero spikes. The makes the decision whether to fire within one time bin; therefore, T = δt. Assuming a firing rate of f =
p , δt
(2.1)
the probabilitythat a spike is found in any given time bin is p; no spike is found with probability(1 − p). Bins in any given spike train are independent, which implies that the following analysis can be limited to a single time bin. A physiologically important special case is obtained if the rate of incoming spikes is low and convergence is high; the binomial statistics that governs the spikes generated by a coincidence detector is then approximated by Poisson statistics. We further note that throughout this letter, we often refer to the spike probability, p, simply as an input or output firing rate, with the understanding that the actual firing rate is obtained by dividing p by the time bin size, δt, as in equation 2.1. 2.4 Network Architecture. An example feedforward network is shown in Figure 1. To characterize the connectivity between layers in the network, j , which quantifies the let us introduce the connectivity matrix, denoted C connectivity from the j layer to the j + 1th layer, and in which the (k, l)th entry contains the numerical value of the connection to the kth neuron in j is m j+1 x m j , where layer j + 1 from the lth neuron in layer j. The size of C m j denotes the number of neurons in layer j. The values of the connectivity matrix are real numbers—positive for excitatory connections and negative for inhibitoryconnections. We define the connectivity vector, a row vector, denoted ci, j , to be the j that the length of ci, j j . It follows from the definition of C ith row from C is m j and that the kth element of ci, j contains a real number describing the connection from the kth neuron in layer j to the ith neuron in layer j + 1. Although the example network in Figure 1 is limited to connections between subsequent layers, that is, between layers j and ( j + 1) for all j, we note that our formalism is sufficiently general to allow connections between any two layers j and j + J with J ≥ 1, provided layer j + J exists in the network. This can be seen by first considering a network with connections only between subsequent layers (as in Figure 1), that is, with J = 1. The connections with J > 1 are then added by formally adding virtual neurons with one incoming synapse whose strength exceeds its threshold.
884
S. Mikula and E. Niebur
Figure 1: A simple n+1-layer feedforward network with cyclical boundary conditions, period 4. Coincidence detectors are represented by a circle with a () symbol in them, and connections are shown as directed arrows between coincidence detectors. Input is shown as stylized spike trains at the very bottom. Comma-separated pairs of numbers indicate layer (second number) and neuron in this layer (first number).
Finally, we denote the output firing probability(which, by equation 2.1, is directly proportional to output firing rate) for the ith neuron located in the jth layer by the scalar firing rate, p(i, j). 3 Results 3.1 Computation of Firing Rates for the Network. As discussed in section 2.3, the m0 initial inputs to our feedforward network are binomial and originate at the zeroth layer (see Figure 1). Because each of the m0 inputs is the outcome of a Bernoulli trial, the total number of possible input states is 2m0 .1 We will find it useful to enumerate these 2m0 input states and to assign corresponding probabilities for their occurrence. 1
A Bernoulli trial is defined as a single random event for which there are two and only two possible outcomes that are mutually exclusive and have a priori fixed probabilities that sum to unity. In our case, the outcome of a trial will be either zero or exactly one
Rate and Synchrony in Feedforward Networks
885
0 . Each row of this Toward this end, we define the input state matrix, matrix represents one input state in the form of a binary row vector whose kth entry consists of a zero or a one, denoting the absence or presence of an input spike in the kth input, respectively. From the above considerations, 0 must be of size 2m0 × m0 . we know that We now determine how the different input states are transformed at layer 1 of our network. That is, we compute how each input state affects suprathreshold events for all of the m1 neurons in layer 1. The solution is based on the known connectivities between the m0 inputs and the m1 neu 0 (see section 2.4), and on the thresholds of the neurons in rons in layer 1, C layer 1, listed in the matrix θ1 . This matrix has 2m0 identical rows, each consisting of the vector of all thresholds in layer 1; thus, the element in column j (and all rows) of this matrix is the threshold of neuron j in layer 1 (the usefulness of this construct will become obvious in equation 3.1). These net 0 , into work dynamics transform the input states, contained in the matrix the resulting states in layer 1 of the network. We collect these states in the 1 , which describes the states of layer 1 and is computed as follows: matrix 0 − θ1 ). 0C 1 = (
(3.1)
By the definition of matrix products, the element in row i and column j 0 is the input to neuron j in layer 1 when state i is present in layer 0. 0C of Furthermore, we define the operator () in this equation as the Heaviside function acting on the individual entries of its matrix argument. It takes on the value of unity if an entry within its argument is nonnegative and zero 1 are 2m0 × m1 and that the otherwise.2 Note that the dimensions of 1 is the layer 1 state corresponding to the kth row of 0 . That is, kth row of the transformation of the kth input state at layer 1 is given by the kth row of 1 . To be sure, while 0 is an exhaustive list of all states in the input layer, 1 is, in general, not a matrix containing all states in layer 1. Depending on m0 , m1 and the connectivity and thresholds, this matrix in general contains only a subset of all possible states in layer 1, and each of these states may appear more than once. We can recursively apply the same method used in equation 3.1 to generate the state matrices for other layers of our network to arrive at the general j , the layer j state matrix, as the following: expression for j−1 − θ j ), j−1 C j = (
(3.2)
spike of an input neuron in a given time bin; the former happens with probability p where 0 ≤ p ≤ 1 and the latter with probability (1 − p). 2 We note that this component-wise procedure is not an operation defined in a vector space. We nevertheless use the terminology of vector spaces (rather than introducing terms like arrays or n-tuples) for convenience.
886
S. Mikula and E. Niebur
j gives us the layer j state corresponding to the kth where the kth row of j−1 and terminates 0 . The iteration defines j in terms of j−1 and C row of 0 , as in equation 3.1. 0 and C at j , which is of length 2m0 , gives us the neuronal reThe ith column of sponses of the ith neuron in layer j to the 2m0 input states. Let this column i, j , the state vector for the ith neuron in layer j; then vector be denoted ψ the following holds: i, j = ( j−1 ci, j−1 − θ(i, j)), ψ
(3.3)
where the scalar θ (i, j) is the threshold of neuron i in layer j. Note that the only difference between the right-hand sides of equations 3.3 and 3.2 j−1 and the threshold is that the latter substitutes the connectivity matrix C matrix θ j for the connectivity vector ci, j−1 and the threshold scalar θ (i, j), respectively. Having determined how input states are transformed across the network by equations 3.1 and 3.2, we now proceed to compute mean output spiking probabilities. We define the input state probability vector pin , a column vector whose kth element is the probability for the occurrence of the kth input state. The output spiking probability for any neuron in the network is obtained by summing over all input states, leading to suprathreshold events times the corresponding input state probability (given by pin ). Thus, the output spiking probabilities for all neurons located in layer j can then be written as a column vector p j , computed as Tj pin , p j =
(3.4)
where the ith entry of p j is the output spiking probability for the ith neuron j. Tj is the transpose of located in layer j and where Alternatively, by using equation 3.3, we can obtain an explicit expression for the output spiking probability of a single neuron located anywhere in the network. The output spiking probability for the ith neuron located in layer j, the scalar p(i, j), is given by i,T j pin . p(i, j) = ψ
(3.5)
Note that p(i, j) is the same as the ith entry of p j from equation 3.4. With equations 3.4 and 3.5, we have thus obtained an analytical recurrent solution for the output spiking probabilities (or firing ratesas per equation 2.1) of the coincidence detectors comprising a feedforward network that receives an arbitrary number of binomial inputs modulated with respect to both mean rate and cross-correlation.
Rate and Synchrony in Feedforward Networks
887
3.2 Computation of Cross-Correlations. The standard Pearson correlation coefficient, denoted q (x, y), is defined by the following for random variables x and y, xy − x y q (x, y) = , var(x) var(y)
(3.6)
where x denotes the mean value of x, and var(x) denotes the variance of x. If x is a Bernoulli random variable (i.e., takes on values 0 or 1), then its variance is given by var(x) = x − x2 ,
(3.7)
and we may rewrite equation 3.6 as the following (Mikula & Niebur, 2003b): q (x, y) =
prob(xy = 1) − prob(x = 1) prob(y = 1) 2 2 prob(x = 1)− prob(x = 1) prob(y = 1)− prob(y = 1) (3.8)
where prob(x=1) denotes the probability that x takes the value of unity, and prob(xy=1) denotes the probability that both x and y take the value of unity. In the notation established earlier,the spiking probability scalar p(i, j) of neuron j in layer i takes the place of prob(x=1) in equation 3.8, taking into account that it is based on a Bernoulli process. In this equation, to compute the correlation with neuron t in layer s, the probability p(s, t) takes the place of prob(y=1). And finally, since the joint probability for the ith neuron in layer j and the sth neuron in layer t to fire action potentials is given +ψ s,t − 2)T pin , as can be seen by direct substitution, it follows by (ψ i, j s,t − 2)T pin . We can therefore i, j + ψ that prob(xy=1) is equivalent to (ψ rewrite equation 3.8 as the following expression for the cross-correlation, denoted q (i, j; s, t), between the ith neuron in layer j and the sth neuron in layer t, i, j + ψ s,t − 2)T pin − p(i, j) p(s, t) (ψ q (i, j; s, t) = , p(i, j) − p(i, j)2 p(s, t) − p(s, t)2
(3.9)
i, j is obtained using equation 3.3. where ψ We have thus obtained an exact recurrent solution for the cross-correlation between any two coincidence detectors, or between any coincidence detector and an input, located within a feedforward network receiving an arbitrary number of binomial inputs modulated with respect to both mean rate and cross-correlation.
888
S. Mikula and E. Niebur
3.3 Large Networks. On a practical note, we find that it is useful to 0 for the purpose of making consider methods for reducing the size of our solutions for rate and cross-correlation applicable to larger feedforward 0 is 2m0 × m0 , networks. As we saw in section 3.1, for m0 inputs, the size of which increases quickly for increasing m0 . We assume that the connectivity of the network is given. Under this condition, we have discerned three 0 . One of them is to include practical methods for reducing the size of only suprathreshold events, which means excluding those input states in which a subthreshold number of spikes occurs; all of these are mapped on a single state with a zero everywhere. Assuming uniform thresholds for 0 to all neurons,which we will denote by θ, then this method reduces −1 m0 ) × m , which can be a significant reduction as θ/m size (2m0 − θj=0 0 0 j approaches unity. The mirror image of this simplification is obtained in the case of small θ/m0 ; in this case, a large number of input states is mapped onto one state, namely, “one” in all positions. The second method for reducing 0 is valid for low input rates (i.e., the Poisson regime), in which the size of case the input states characterized by many spikes can be ignored due to the unlikely probability of such a state. The third method is to reduce the number of inputs while maintaining a relatively large number of neurons in each layer. In many cases, this is justified since for large convergence (the physiologically relevant case), only a small number of neurons can be simultaneously active (within one time bin of length T) to avoid the need for unrealistically high thresholds or saturation of the next layer. 4 Simple Example Let us consider the simple n-layer excitatory feedforward network shown in Figure1 with four neurons in each layer, cyclical (periodic) boundary connectivity conditions, and θ(i, j) = θ = 2 ∀i, j. We further assume that each connection is excitatory and has a weight of +1. The connectivity patterns between layers are the same for all layers; that is, each neuron in layer j connects to the three closest neurons in layer j + 1. Recalling from section 2.4 that the (k, l)th entry of the connectivity matrix is defined as the weight of the connection to the kth neuron in layer j + 1 from the lth neuron in layer j, we obtain a connectivity matrix that is the same for each layer and given by the following: 1 1 C( j) = 0 1
1 1 1 0
0 1 1 1
1 0 . 1 1
(4.1)
For computing the total number of input states, we note that there are four binomial inputs, and thus 24 input states. The resulting input matrix 0 is
Rate and Synchrony in Feedforward Networks
0 0 0 0 0 0 0 0 0 = 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
889
0 1 0 1 0 1 0 1 . 0 1 0 1 0 1 0 1
(4.2)
0 is the input state for having zero input spikes Thus, the first row of for all four inputs, the second row is the input state for having zero input spikes for inputs 1 through 3, whereas input 4 has a spike, and so on for all 0. of the 16 rows of Using the network connectivity and the input state matrices, we can now j and ψ i, j . The result is given in apply either equation 3.2 or 3.3 to derive Table 1, which shows all attainable states for all neurons in layers 1 through 3 0, 1, 2 , and of our network (plus the input layer, 0). We have placed 3 alongside each other in Table 1 to emphasize that this is a logical truth Table 1: Truth Table for the First Three Layers of the N-Layer Feedforward Network Shown in Figure 1. 3
2
1
0
13 ψ 23 ψ 33 ψ 43 ψ 12 ψ 22 ψ 32 ψ 42 ψ 11 ψ 21 ψ 31 ψ 41 ψ 10 ψ 20 ψ 30 ψ 40 ψ 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1
0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1
0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1
0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1
0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1
0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1
0 0 0 1 0 0 1 1 0 0 1 1 0 1 1 1
0 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1
0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1
0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 1
0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1
0 0 0 1 0 0 0 1 0 1 1 1 0 1 1 1
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
890
S. Mikula and E. Niebur
table. This should also make it easier to see how different input states, which 0 , are transformed across different layers, which correspond to the rows of 1, 2 , and 3. correspond to the rows of It turns out that Table 1 characterizes the behavior not only of the first three but of all layers of the (infinite) network. The reason is that all higher 2 , and similarly, all higher odd layers are even layers are identical to 1 . This periodic behavior is not a peculiarity of this specific identical to network, is a property of all feedforward networks of coincidence detectors with identical layer properties (connectivity and thresholds) and finite numbers of neurons per layer; since they can have only a finite number of states, networks of infinite depth need to show periodic behavior. This can be seen easily by considering that there are 2m0 possible states in the input layer, layer 0 (see section 3.1), and in all higher layers. Since all layers are assumed to have the same number of neurons, at least two layers between the input layer and layer (2m0 + 1) must have identical activity patterns. Let us assume the first one of these is layer l and the second one is layer l . Then, since the connectivities between all layers are identical, the pattern in layer (l + 1) is the same as in (l + 1), the pattern in (l + 2) is as in (l + 2), and so on; activity is thus periodic with a periodicity (l − l) starting at layer l. This is what was to be shown. We note in passing that periodicity includes the case of period 1, for example, when the activity “dies out” in layer l because all activity is below threshold and therefore all higher layers have zero activity. The next step is the computation of column vector pin , whose ith entry 0 ). We contains the probability for the ith input state (i.e., the ith row of will first assume uncorrelated inputs (correlated input is discussed later; see equation 4.6 for an example). In this case, the components of pin are simple binomial probabilities. We have the following for pin : (1 − P10 )(1 − P20 )(1 − P30 )(1 − P40 ) (1 − P10 )(1 − P20 )(1 − P30 )P40 (1 − P10 )(1 − P20 )P30 (1 − P40 ) (1 − P10 )(1 − P20 )P30 P40 (1 − P10 )P20 (1 − P30 )(1 − P40 ) (1 − P10 )P20 (1 − P30 )P40 (1 − P10 )P20 P30 (1 − P40 ) (1 − P )P P P 10 20 30 40 . pin = P (1 − P )(1 − P )(1 − P ) 10 20 30 40 P (1 − P )(1 − P )P 10 20 30 40 P (1 − P )P (1 − P ) 10 20 30 40 P10 (1 − P20 )P30 P40 P10 P20 (1 − P30 )(1 − P40 ) P10 P20 (1 − P30 )P40 P10 P20 P30 (1 − P40 ) P10 P20 P30 P40
(4.3)
Rate and Synchrony in Feedforward Networks
891
Figure 2: Effect of uniform, uncorrelated input firing rate. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.
To alleviate the notation, we replaced p(i, 0) by Pi0 for i ∈ {1, . . , 4}, that is, the mean firing probabilities for the ith input from the zeroth layer. The values of Pi0 can be freely chosen depending on the question under study (subject to the conditions 0 ≤ Pi0 ≤ 1 since they are probabilities). For example, we may choose uniform inputs, in which case all Pi0 would be equal. Alternatively, we may study how spatially localized high rate inputs are dissipated in higher levels of the network. In this case, we set some of the Pi0 to high values relative to the remaining Pi0 . Both cases are illustrated in the examples. The results for the case of uncorrelated inputs of uniform rate are shown in Figure 2. Figure 2A shows the mean rates for the input, even, and odd layers. Note that rates get propagated to arbitrarily high levels of the feedforward network. Besides the initial attenuation of rates at layer 1, there is no further attenuation. Figure 2B shows nearest-neighbor cross-correlations for the network, and Figure 2C shows cross-correlation matrices for each layer. As expected, because we have uncorrelated inputs of uniform rate, nothing much interesting happens with the cross-correlation at higher layers, except for a uniform increase in all cross-correlations, which is expected due to shared connections. What is more interesting is that crosscorrelation is not monotonically increasing as a function of network layer; rather, it remains constant from layer 1 to all higher layers. Figure 3 shows how rates and cross-correlations in higher layers vary as a function of input rates. In Figure 3A, we see that rates in higher layers are approximately linearly related to input rates, whereas in Figure 3B, we see that cross-correlations in higher layers depend on input rates in an inverted-U shaped manner, with maximum cross-correlations occurring for input rates of 0.5.
892
S. Mikula and E. Niebur
Figure 3: (A) Higher-layer output rate versus input rate curve. (B) higher-layer cross-correlation versus input rate curve for the case of uncorrelated inputs of uniform rate.
We now briefly consider the case of spatially inhomogeneous inputs. As an example, let the mean rates of the second and third inputs into our feedforward network, P20 and P30 or, in the notation of equation 3.5, p(2, 0) and p(3, 0), be equal, that is, p23 = P20 = P30 , where this equation defines p23 . Let us further assume that these inputs are correlated with a correlation coefficient q 23 , with 0 < q 23 ≤ 1 (the case q = 0 is the case of spatially inhomogeneous, uncorrelated input, which is of less interest here). To simplify upcoming expressions, we make the following definitions: √ q 23
(4.4)
q = (1 − q ).
(4.5)
q=
Rate and Synchrony in Feedforward Networks
893
Figure 4: Effect of spatially localized high input rates. (A) Mean firing rates for the network. (B) nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.
The equivalent of equation 4.3 for pin is now (see the appendix to Mikula and Niebur, 2003b, for detailed derivations)
(1− P10 )((1− p23 )((1− p23 )+ p23 q )2 + p23 ((1− p23 )(1 − q ))2 )(1− P40 ) (1− P10 )((1− p23 )((1− p23 )+ p23 q )2 + p23 ((1− p23 )(1 − q ))2 )P40 (1− P10 )2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))(1− P40 ) (1− P10 )2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))P40 ) (1− P10 )2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))(1− P40 (1− P10 )2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))P40 (1− P10 )((1− p23 )( p23 q q )2 + p23 ( p23 +(1 − p23 )q )2 )(1− P40 ) (1− P10 )((1− p23 )( p23 q q )2 + p23 ( p23 +(1 − p23 )q )2 )P40 pin = P ((1− p )((1− p )+ p q )2 + p ((1− p )(1 − q ))2 )(1− P ) . 23 23 23 23 23 40 10 P10 ((1− p23 )((1− p23 )+ p23 q )2 + p23 ((1− p23 )(1 − q ))2 )P40 P10 2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))(1− P40 ) P 2((1− p ) p q ((1− p )+ p q )+ p ( p +(1− p )q )((1− p )q ))P 10 23 23 23 23 23 23 23 23 40 P10 2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))(1− P40 ) P10 2((1− p23 ) p23 q ((1− p23 )+ p23 q )+ p23 ( p23 +(1− p23 )q )((1− p23 )q ))P40 P ((1− p )( p q q )2 + p ( p +(1 − p )q )2 )(1− P ) 10 23 23 23 23 23 40 P10 ((1− p23 )( p23 q q )2 + p23 ( p23 +(1 − p23 )q )2 )P40
(4.6) The results for the case of uncorrelated, spatially localized high input rates are shown in Figure 4. Figure 4A shows the mean rates for the input, even, and odd layers. Notably, spatially localized high input rates get propagated to arbitrarily high levels of the feedforward network. Besides the initial attenuation of firing rates at layer 1, there is no further attenuation. Figure 4B shows nearest-neighbor cross-correlations for the network, and
894
S. Mikula and E. Niebur
Figure 5: Effect of halving input rates of Figure 4. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Crosscorrelation matrices for each layer.
Figure 4C shows cross-correlation matrices for each layer. Due to the convergence of inputs to higher layers, uncorrelated, spatially localized high input rates appear as cross-correlations in higher layers (see Figure 4B). That is, not only does rate information get propagated across the network, it also gets propagated as cross-correlation information. The case shown in Figure 5 is identical to that in Figure 4 except that now the input rates are halved. Note that qualitatively, the results shown are identical to those of Figure 4, which means that for even relatively low spatially localized input rates, with p(2, 0) = p(3, 0) = 0.2, rate information still gets propagated to arbitrarily high levels of the feedforward network. In addition, this rate information also appears as cross-correlation information in higher layers. Figure 6 shows the case for spatially localized cross-correlated inputs with uniform input rates. For this case, inputs 2 and 3 are cross-correlated to a value of q 2 = 0.4, and as can be seen in Figure 6B, this spatially localized cross-correlation gets propagated to arbitrarily high levels of the feedforward network. In Figure 6A, we see that the cross-correlation between inputs 2 and 3 is also represented in higher layers as higher rates. Note the appearance of increased cross-correlation between units 1 and 4 in layers 1 and higher in this and subsequent figures. It is due to the periodic boundary conditions and the specific connection scheme used: since input unit 2 projects to layer 1 unit 1, and input unit 3 projects to layer 1 unit 4, and units 2 and 3 are partially correlated, units 1 and 4 are also partially correlated but with a lower correlation coefficient than units 2 and 3, which both receive input from both correlated input units rather than from only one.
Rate and Synchrony in Feedforward Networks
895
Figure 6: Effect of spatially localized high cross-correlation (q = .4) between inputs 2 and 3. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.
Figure 7: Effect of spatially localized high cross-correlation (q = .8) between inputs 2 and 3. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.
The case shown in Figure 7 is identical to that in Figure 6 except that now the cross-correlation between inputs 2 and 3 is doubled to 0.8. This does not lead to much of an increase in higher-layer cross-correlations beyond what was seen in Figure 6B. Interestingly, in Figure 6A, we see that higher-layer rates are almost double what they were in Figure 6A. What this means is that doubling the input cross-correlation is reflected in an almost doubling
896
S. Mikula and E. Niebur
Figure 8: Effect of spatially localized high cross-correlation (q = .6) and high firing ratebetween inputs 2 and 3. (A) Mean firing rates for the network. (B) Nearest-neighbor cross-correlations for the network. (C) Cross-correlation matrices for each layer.
of the higher-layer rates, as opposed to the relatively smaller increase in higher-layer cross-correlations. Finally, in Figure 8, we see the case for spatially localized high crosscorrelation (q = 0.6) and high firing rate between inputs 2 and 3. Figure 8A shows that rates are propagated with little attenuation, which is in contrast to the cases in Figures 4 and 5, where the high input rates were not crosscorrelated. We see in Figure 8B that cross-correlations also get propagated across our network, though we also see relatively high cross-correlations between inputs 1 and 4 at higher layers, an effect likely attributed to the cyclic connectivity boundary conditions of our feedforward network.
5 More Complex Networks The simple example of section 4 is useful for making explicit the details involved with the application of equations 3.4, 3.5, and 3.9. In this section, we look at a larger feedforward network, which has the advantage of being more realistic than the previous one, but with the trade-off that many of the details cannot be made explicit due to space limitations. Our feedforward network is a slightly scaled-down version of the one introduced by Litvak, Sompolinsky, Segev, and Abeles (2003). We consider a 100-layer feedforward network containing exactly 1000 coincidence detectors per layer—500 excitatory and 500 inhibitory. The connectivity matrix is
Rate and Synchrony in Feedforward Networks
897
different between each layer3 and is randomly constructed with precisely balanced excitation and inhibition such that each neuron receives exactly 50 excitatory inputs and 50 inhibitory inputs. Further, pairs of neurons in each layer share 5% of their inputs from the preceding layer. Each excitatory connection carries a weight of +1, and each inhibitory connection has a weight of −1. We now define the inputs and the input states. For simplicity, we assume just two different input states. One state is defined with zero activity in all neurons of the input layer.4 In the other state, 10% of the inputs provide perfectly correlated input spike trains (i.e., q = 1), with a spiking probability within each spike train of pin . We designate these inputs as active inputs; since the total number of inputs is 1000, the number of active inputs is 100. All other input neurons (i.e., the other 90%) have zero spiking probability. Our requirement that the active inputs are perfectly correlated greatly simplifies what would otherwise be a difficult calculation. For example, if the active inputs were completely uncorrelated, then we would have 2100 different input states to deal with, a clear impossibility. Assuming perfectly correlated active inputs is a simple and effective way to dramatically reduce the number of input states. The results of this model using 100 different randomly generated connectivity matrices are shown in Figure 9 for average input spiking probabilities of .02 and .04 and using a threshold of 3. We show the results for only the first 20 layers of the network because higher layers are essentially the same as would be expected from Figure 9 (we studied up to layer 100). Spiking probabilities increase rapidly and immediately level off at about layer 2, remaining nearly constant for all higher layers. Because of the perfect correlation between all input neurons (q = 1), the mean firing rates at the higher layers of our network depend linearly on the initial input rates pin . For example, in Figure 9a, pin = .02 results in a mean output spiking probability of .055 at higher layers, whereas in Figure 9b, pin = .04 results in a mean output spiking probability of .11. Note the large standard deviations, indicating the richness and complexity of behavior for different connectivity matrices. The mean cross-correlations are propagated in a similar manner as the mean rates shown in Figure 9: since all active neurons in our network are perfectly correlated with other, the mean cross-correlation must be a function of the number of active neurons per layer, as is true for the mean rate per layer. Thus, if the fraction of active neurons in a given layer is r , the cross-correlation in that layer is r 2 . The propagating rates and 3 The connectivity used by Litvak et al. (2003) is apparently repeating (i.e., identical between each pair of subsequent layers). We studied this case as well and found results that were very similar to those with connectivity matrices that vary between each pair of layers and that are shown here. 4 For all nonzero thresholds, the same results would be achieved with all neurons active in the input layer.
898
S. Mikula and E. Niebur
Figure 9: Mean layer output rates as a function of network layer. (A) Mean input rate of 0.2. (B) mean input rate of 0.4.
Rate and Synchrony in Feedforward Networks
899
cross-correlations show no sign of decrement at higher layers (layer 100 looks the same as layer 3 in terms of rates and cross-correlations). 6 Discussion This article extends our previous analytical results (Mikula & Niebur, 2003a, 2003b, 2004) for an individual coincidence detector to a feedforward network of coincidence detectors receiving steady-state cross-correlated binomial inputs at the zeroth layer. Thus, our derivation is valid only for steady-state neuronal responses and does not tell us anything about transient responses, which also appear to be important. For example, Diesmann, Gewaltig, and Aertsen (1999) studied temporal structures of spike trains in feedforward networks of spiking neurons. They showed that in different parameter regimes, synchronous packets of spikes can either travel from layer to layer without loss of coherency or, alternatively, disperse within a few layers. The objectives of their study are complementary to ours as it focuses on transient activity while ours is concerned with constant firing rates and correlations. Likewise, the approach taken by Diesmann et al. is complementary to what we have taken insofar as they use a more complex single-neuron model, a leaky integrate-and-fire neuron rather than the simpler coincidence detector, but it requires a numerical solution while ours is analytical and recurrent and, furthermore, does not require introducing any approximations. In spite of the obvious limitations of feedforward networks, they are of considerable interest for theoretical considerations. Likewise, coincidence detectors are an old and tested but simple model for the units of neural networks. However, although computational approaches allow the use of much more realistic neuron models (an extreme case perhaps being a recent study by Reyes, 2003, in which replicas of an actual biological neuron were used as the underlying units), the use of the much simpler coincidence detectors as building blocks allowed us to develop a closed form for the mean rate and pairwise cross-correlation of each neuron and neuron pair, respectively. The details of our results apply only to the specific networks that we have studied here as examples to illustrate the general method we introduced. This method should, however, allow the analysis of other feedforward networks of coincidence detectors that are relevant for a specific problem. We emphasize that the methods described in section 3.3 permit one to find solutions for networks with much more realistic connectivity, including much higher numbers of synaptic connections. One way in which our approach may prove useful is in the elucidation of the neural code, which involves addressing the question of how biological neural networks represent and transform information in their patterns of activity (Perkel & Bullock, 1968). Most research in the primate cortex has focused on two coding schemes, rate and temporal coding. In rate
900
S. Mikula and E. Niebur
coding, information is coded purely in terms of average spiking activity, and variability of neural discharges is regarded as a form of noise. In temporal coding, neurons make use of the temporal structure of spike trains. Much experimental, computational, and theoretical work has been devoted to discussing this question, with evidence existing in favor of both rate codes and temporal codes. One of the tools for studying these competing models is the numerical simulation of feedforward networks. Unfortunately, two such recent simulation results involving biologically inspired feedforward networks yielded conflicting results, in which the group conducting one study (Litvak et al., 2003) could not reproduce previously published numerical results of another group (Shadlen & Newsome, 1998). As we have shown in section 5, our approach can make a contribution to the resolution of this dispute, bearing in mind that the basic units employed in those studies are different from the coincidence detectors we employed. The availability of an analytical form for the solution provided by our approach clearly precludes similar disputes. In a network of coincidence detectors, we find that aspects of both rate coding and temporal coding may be found. At least in the simple network studied in section 4, rate information is propagated up to arbitrarily high layers, even when spatially localized rate information is absent from the inputs and only spatially localized cross-correlation information is present. Also, in the case of uncorrelated inputs of uniform firing ratesshown in Figure 3A, we find that the output firing ratesof higher layers are almost linearly related to the input firing rates. Both would support that rate coding is at least a viable possibility, as proposed by Shadlen and Newsome (1998). The latter study may overemphasize the importance of rate coding, however: Salinas and Sejnowski (2000) pointed out that the influence of cross-correlations is relatively low in the Shadlen and Newsome study. The reason is that the latter study employed exactly equal correlation coefficients between inhibitory and excitatory populations (ρ E E = ρ I I = ρ E I in equation 20 of Salinas and Sejnowski, 2000), and the variance in the output layer becomes small since the contributions to the variance resulting from correlations between these populations cancel out (ρ E E + ρ I I − 2ρ E I = 0; Salinas & Sejnowski, 2000). Of course, the model described in section 5 works in the same parameter regime. On the other hand, we find that rate coding is not the only possibility of propagating information across the layers of the network. Our results show that cross-correlation information is propagated across the feedforward network to arbitrarily high layers, even when spatially localized cross-correlation information is absent from the inputs and only spatially localized rate information is present. Our results thus support the viability of temporal codes, as proposed by Litvak et al. (2003). For the types of networks we studied in sections 4 and 5, we conclude that both rate and cross-correlation information is propagated across
Rate and Synchrony in Feedforward Networks
901
feedforward networks and, furthermore, that they will interact and tend to occur together, such that when only rate or only cross-correlation information is present in the inputs, this information will appear in higher layers in the form of both rate and cross-correlation information. As such, rate and cross-correlation codes may well be intertwined and interdependent. An example of such a mixed code in complex nervous systems might be the representation of selective attention in the primate cortex (Niebur, 2002). Selective attention has been shown in electrophysiological studies to be correlated with both rate changes and changes in the fine temporal structure (on the order of milliseconds or tens of milliseconds) of neural activity (Moran & Desimone, 1985; Steinmetz et al., 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Salinas & Sejnowski, 2001; Niebur, Hsiao, & Johnson, 2002). It will take more experimental as well as theoretical work to come to a conclusive answer as to which of the proposed neural coding schemes are used by the different nervous systems. Acknowledgments This work was supported by NIH/NINDS R01-NS43188-01A1 and NIH/ NEI R01-EY16281. References Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–11563. Litvak, V., Sompolinsky, H., Segev, I., & Abeles, M. (2003). On the transmission of rate code in long feedforward networks with excitatory-inhibitory balance. J. Neurosci., 23, 3006–3015. Mikula, S., & Niebur, E. (2003a). The effects of input rate and synchrony on a coincidence detector: Analytical solution. Neural Computation, 15, 539–547. Mikula, S., & Niebur, E. (2003b). Synaptic depression leads to nonmonotonic frequency dependence in the coincidence detector. Neural Computation, 15(10), 2339– 2358. Mikula, S., & Niebur, E. (2004). Correlated inhibitory and excitatory inputs to the coincidence detector: Analytical solution. IEEE Transactions on Neural Networks, 15(5), 957–962. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Niebur, E. (2002). Electrophysiological correlates of synchronous neural activity and attention: A short review. Biosystems, 67(1–3), 157–166. Niebur, E., Hsiao, S. S., & Johnson, K. O. (2002). Synchrony: A neuronal mechanism for attentional selection? Current Opinion in Neurobiology, 12(2), 190–194. Perkel, D., & Bullock, T. H. (1968). Neural coding. Neurosci. Res. Programm. Bull., 6, 221.
902
S. Mikula and E. Niebur
Reyes, A. D. (2003). Synchrony-dependent propagation of firing rate in iteratively constructed networks in vitro. Nat. Neurosci., 6, 593–599. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193– 6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nature Reviews Neuroscience, 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Steinmetz, P. N., Roy, A., Fitzgerald, P., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal firing in primate somatosensory cortex. Nature, 404, 187–190.
Received December 18, 2003; accepted August 10, 2004.
LETTER
Communicated by Alain Destexhe
Independent Variable Time-Step Integration of Individual Neurons for Network Simulations William W. Lytton
[email protected] Department of Physiology, Pharmacology, and Neurology, State University of New York, Downstate, Brooklyn, NY 11203-2098, U.S.A.
Michael L. Hines
[email protected] Department of Computer Science, Yale University, New Haven, CT 06520-8001, U.S.A.
Realistic neural networks involve the coexistence of stiff, coupled, continuous differential equations arising from the integrations of individual neurons, with the discrete events with delays used for modeling synaptic connections. We present here an integration method, the local variable time-step method (lvardt), that uses separate variable-step integrators for individual neurons in the network. Cells that are undergoing excitation tend to have small time steps, and cells that are at rest with little synaptic input tend to have large time steps. A synaptic input to a cell causes reinitialization of only that cell’s integrator without affecting the integration of other cells. We illustrated the use of lvardt on three models: a worst-case synchronizing mutual-inhibition model, a best-case synfire chain model, and a more realistic thalamocortical network model. 1 Introduction We have previously demonstrated some advantages of the global variable time-step integrators CVODE and CVODES (Cohen & Hindmarsh, 1994) over traditional fixed-step methods such as Euler, Runge-Kutta, or CrankNicholson for simulating single cells (Hines & Carnevale, 2001). The major advantage was due to the fact that neuron activity features spikes, requiring short time steps, followed by interspike intervals, which allow long time steps. The associated speed-up in the single-cell integration realm does not extend to simulation of networks, however. A major reason is that the global time step is governed by the fastest changing state variable. In an active network, some cell is usually firing, requiring a small time step for the entire network. A related reason is that synaptic events generally cause a discontinuity in a parameter or state variable. This requires a reinitialization, as the integrator must start again with a new initial condition problem. In a Neural Computation 17, 903–921 (2005)
© 2005 Massachusetts Institute of Technology
904
W. Lytton and M. Hines
network simulation, this means reinitialization of the entire network due to a single state variable change in one cell. With reinitialization, the integrator is working without any past history. Hence, the first step can be only firstorder accurate and must be very short. We demonstrate here that the poor network performance of the global variable time-step method can be overcome by giving each neuron in the system an independent variable time-step integrator. Thus, a single cell’s individual integrator uses a large dt when a neuron is quiescent or changing slowly, even though activity in other neurons in the network may cause those other integrators to proceed forward with many steps (small dt). When a cell receives a synaptic event, only that cell’s integrator has to be reinitialized. The critical problem in the implementation of the local time-step method (lvardt) is to ensure that when an event arrives at a cell at time te , all the state variables for the receiving cell are also at time te . This requires coordinating individual integrators so that one cell does not get so far ahead that it cannot receive a synaptic signal from another cell. To maximally challenge lvardt, we used the mutual inhibition model, a model that fully synchronizes. In this case, the expected superiority of multiple integrators is expected to be negated by the fact that all integrators are doing the same thing at the same time. At the other extreme, we show that synfire chains enjoy a dramatic performance improvement when using lvardt. We generalize the synfire chain in a simulation of multiple delay line rings to help understand the computational complexity of lvardt. Additionally, we demonstrate the performance improvement obtained using lvardt on a more biologically detailed thalamocortical network. 2 Methods The techniques and simulations described here are implemented in the NEURON simulator (http://www.neuron.yale.edu). The simulator provides several global time-step integration schemes. For global fixedstep methods (Hines, 1984; Hines & Carnevale, 1997), one can select either a first-order backward Euler integration scheme that is numerically stable for all reasonable neural models or the second-order Crank-Nicholson method. There are two global variable-step methods, both part of the Livermore SUNDIALS package , CVODES and IDA (http://www.llnl.gov/ CASC/sundials/; Hindmarsh & Serban, 2002). The current implementation of lvardt uses CVODES, which solves ordinary differential equations (ODEs) of the form d y = f (y, t). dt
(2.1)
Local Variable Time-Step Method
905
CVODES, like many other variable-step integrators, has an important property for the present usage: it allows rapid interpolation within the interval of the just-executed time step. Using a fully implicit fixed-step method, accuracy is proportional to dt. With the variable-step method, an absolute error tolerance (atol) is used to bound the error. In NEURON, absolute error tolerance is used in preference to proportionate error in order to avoid infinitesimal error tolerance near zero for state variables, notably voltage, that approach or pass zero. We replicated a fully connected homogeneous inhibitory interneuron network that shows rapid synchronization through mutual inhibition (Wang & Buzsaki, 1996). As a shorthand, we will call this the mutual-inhibition model. Variations on this model have been widely studied (Traub, Jefferys, & Whittington, 1999). The basic simulation used parameters identical to those in the original article (Wang & Buzsaki, 1996). Porting this simulation to lvardt required minor alterations detailed in section 3. We replicated the single-cell model from the network, demonstrating a currentfrequency response curve identical to that reported in the article (Wang & Buzsaki, 1996). We then used both NEURON’s fixed-step method and global variable time-step methods to demonstrate comparable activity in the network simulation (precise activity is dependent on randomized initial conditions). Both the original simulation and the lvardt version are available in runnable open-source form at the ModelDB web site (http://senselab.med.yale. edu/senselab/ModelDB; Hines, Morse, Migliore, Carnevale, & Shepherd, 2004). We also present a synfire simulation available as an example in the NEURON simulation package and a thalamocortical simulation based on published sources (Bazhenov, Timofeev, Steriade, & Sejnowkski, 1998). Simulations were run on 2.40 and 2.80 GHz Intel CPUs under Linux and Solaris operating systems.
3 Results The global variable time-step method has advantages in any simulation where periods of intense simulation activity alternate with periods where state variables remain relatively constant for a period of time. In general, this situation is more likely to occur during simulation of a single cell rather than a network. The larger the network simulation, the greater is the likelihood that a neuron somewhere in the network is showing spike activity. Using a global variable time-step method, this activity slows the entire simulation to tend to that one neuron’s integration needs. Neurons that are not active will be integrated with an unnecessarily short time step. These considerations suggested the development of the local variable time-step method (lvardt) to integrate a network piecemeal, providing short time-step integration for active neurons and long time-step integration
906
W. Lytton and M. Hines
A 40
mV
Global 5 ms
−80
B 40 Local 5
−80 Figure 1: Voltage trajectories for two cells are shown for the global variable-step method (A) and lvardt(B). dt (interval between marks) varies together at top and separately at bottom. The network is shown in schematic: the stimulator cell (filled circle) fires at time 0 and drives the bottom cell weakly with a delay of 0.1 ms (+ on the graphs), and the right cell strongly with a delay of 1 ms (vertical lines on the graphs). The right cell drives the bottom cell moderately with a delay of 0.1 ms. The two cells have Hodgkin-Huxley dynamics. The right cell spike threshold is −10 mV, and the second-order correct threshold event is marked with a—on the bottom panel. The threshold event does not necessarily lie on an integration time-step boundary.
for inactive neurons, while maintaining consistent integration accuracy throughout. Neurons that fire at different times get their state variables calculated at different times and, more important, different intervals (see Figure 1). Using the global method (top graph), the two cells have their trajectories calculated at the same times. With lvardt (bottom graph), integration points are independent. This is most obvious at the beginning of the simulation, when the cell that fires first (at right on the schematic;
Local Variable Time-Step Method
907
vertical lines on graph) has only 2 integration points, while the other cell (bottom; crosses) has 12 integration points. Where the trajectories cross, the first-spike cell integrator is called frequently, while the second-spike cell integrator is using longer dt. At the peak of the first spike, both cells are being updated frequently since the second-spike cell is coincidentally approaching threshold. At the peak of the second spike, however, the firstspike cell is at the falling phase of its spike and has far fewer integration points. State variables change quickly near threshold and at the peak of the action potential. At these times, the integrators use short time steps to accurately follow the trajectory through state space. At other times, much larger dt’s can be used to achieve the same accuracy for the more slowly varying state variables. Using lvardt, a neuron that is inactive does not waste CPU time. Overall performance evaluation for this simple simulation demonstrates that the global method integrates its 8 state variables (the 4 Hodgkin-Huxley variables m, h, n, V for each cell) 177 times, for a total of 1416 state-variable integrations. The integrators in the lvardt example integrate 4 state variables 138 times (first-spike cell) and 115 times (second-spike cell) for a total of 4 · (138 + 115) = 1012 state-variable integrations. This suggests the possibility of a 40% speed-up. The lvardt method creates a separate CVODES integrator for each of Nc cells in the network. Although there are many more integrators, each integrator is more compact since it only has to handle the state variables belonging to its particular neuron. Whether using one or Nc integrators, the total number of state variables remains the same. Although the expected relative performance gain with lvardt by function call statistics in Figure 1 is 40%, there is constant overhead for each step associated with each integrator (total overhead proportional to Nc ) and overhead required to determine which cell is to be integrated next, proportional to log(Nc ) for each step (total overhead proportional to Nc · log(Nc )). We discuss queue overhead in the section, “Estimating simulation complexity,” but except for very large numbers of cells, it reduces performance only slightly. Because the system now is being calculated forward in time by multiple, independent integrators, an integration coordinator is used to maintain the overall coherence of the integration. If the various neurons in the network are not connected, as in the case of testing parameter variation over a set of neurons, such coordination is not needed. However, in a network, the integration coordinator is vital to permit synaptic signals to be communicated at appropriate times. 3.1 Handling Events. Handling events with lvardt requires that when an event arrives at a cell, all of the state variables for that cell are at their appropriate values for that time. This is accomplished with three standard variable-step-integrator operations: single-step integration, interpolation, and reinitialization. Using these operations, we ensure that incoming events
908
W. Lytton and M. Hines
are always within reach of the receiving cell’s integrator and individual integrators do not move too far beyond the network as a whole. The individual integrators maintain state and derivative information on the interval of the most recent time step. That is, each neuron’s integrator can access states over an interval between the beginning ta and the end tb of a time step: tbi − tai = dt i for the ith neuron. This gives each individual integrator the ability to provide fast, high-order interpolation to a state at any time within the interpolation range defined by the two bounding times. The integration coordinator ensures that there is always overlap in j these interpolation ranges: tai ≤ tb ∀ i, j. We define tb/e min as the time of the earliest event te or least advanced integration bound tb . To guarantee interpolation range overlap, the integration coordinator either handles the least time event or single-steps the least advanced integrator, whichever is earlier. In this way, no integrator’s tb , and no event, ever falls behind any ta . Figure 2 illustrates most of these operations in the context of a sample integration for six cells. Note that an input event normally requires a threestep sequence (1) interpolation to the event time, (2) handling the event (or all the outstanding events to the cell with that delivery time), and (3) reinitializing the integrator. This full three-step sequence is required only for events that alter the course of the integration. By contrast, an event that only records the value of a state requires only the interpolation step, since the recorded cell’s integration can then continue from tb rather than from the interpolation point. Similarly, threshold detection does not affect the integrator. However, it must be noted that threshold detection remains tentative until the threshold time is reached by tb/e min, because an event received by the cell in the interim may reinitialize the cell to a time prior to the tentatively calculated threshold time. 3.2 Porting the Mutual-Inhibition Model to lvardt. The mutualinhibition model is an all-inhibitory network with full connectivity. Each neuron is a single compartment (point neuron) with spike-generating sodium and potassium voltage-dependent currents of the Hodgkin¬Huxley type. The dynamics of the mutual inhibition model permits initially asynchronous firing to coalesce into synchronous firing within a few spikes. In the original versions of the mutual-inhibition model (Wang & Buzsaki, 1996), the entire network is implemented as one large, continuous set of linked ODEs. This is done by making the opening rate (kC→O in /ms) of the postsynaptic conductance a continuous and continuously differentiable Boltzmann function of presynaptic voltage: kC→O = 12.0/(1 + e xp(−(Vpr e /2))).
(3.2)
This continuous-activation synapse model has the advantage of making the entire simulation somewhat more tractable analytically. However,
Local Variable Time-Step Method
cell # 0 1 2 3 4 5
909
A
t min
time
b
trigger
B
0 1 2 3 4 5 t
min
b/e
0
C 12 3 4 5 t
min
b/e
0
D 12 3 4 5 t
min
b/e
Figure 2: Typical sequence of local integration steps in a six-cell example. The ta to tb intervals of possible interpolation are shown as black rectangles. The tb/e min is shown as a vertical dashed line. (A) Integration coordinator requests integration for lagging cell (# 0 with minimum tb ). Integrator advances by dt (length of hashed rectangle). In that step, we suppose cell 0 crosses the threshold, and a threshold event is generated, labeled “trigger” in panel B. This is an event whose time is tentative since unprocessed synaptic events could still influence this cell. (B) Cell 5 has minimum tb and integrates forward. (C) The trigger event is at tb/e min. The handling of this trigger creates three events to be delivered to cells 3,4,5 at varying delays (short vertical lines). We suppose that its delay places the cell 3 input event earlier than any tb . (D) The event in cell 3 is now tb/e min. Cell 3 back-interpolates, the event is handled, and cell 3 reinitializes, giving ta3 = tb3 = tb/e min. Cell 3 will be the next cell to integrate forward.
910
W. Lytton and M. Hines
1 ms Figure 3: Comparison between event-triggered (solid) and continuously activated (dashed) synaptic conductance elicited by a presynaptic action potential. Curves superimpose except for slight deviations at initiation and peak, demonstrated by 50-fold blow-ups at these locations. Threshold = −5.7 mV, Cdur = 0.41 ms.
the continuous-activation model can be criticized as being nonbiophysical, since it represents postsynaptic conductance as being activated by somatic presynaptic voltage at all voltage levels. (Destexhe, Malnen, & Sejnowski, 1994a; 1994b). At rest, this activation is generally infinitesimal and will have no effect on the simulation. A more important disadvantage of the continuous-activation synapse model is that it does not allow explicit definition of axonal and synaptic delays. In order to implement the mutual-inhibition model using lvardt, we needed to translate the continuous-activation synapses to event-driven synapses. Equation 3.2 for the continuous-activation synapse gives a steeply rising sigmoid. Thus, the transmitter release is significant only in the period in which the presynaptic action potential is above some threshold around 0 mV. Furthermore, since the action potential trajectory in this region is relatively insensitive to changing synaptic inputs, the transmitter release is well approximated by a threshold-triggered stereotypical pulse of transmitter of duration, Cdur . This latter synapse model has been extensively used (Destexhe et al., 1994a, 1994b; Lytton, 1996). Adjusting the event threshold and pulse duration parameters to least squares best fit the continuousactivation synapse conductance (see Figure 3), gives synaptic conductance trajectories so similar that simulations with the two kinds of synapses produce graphically identical results. Spike time deviations between the event-driven simulation and the original continuous simulation were minimal: 108 ± 64 µs (mean ± standard deviation), well below the duration of an action potential. Improving the fit by, for example, adjusting the maximum kC→O was unnecessary. For
Local Variable Time-Step Method
911
the mutual-inhibition model, event-driven simulations run a bit faster than the equivalent continuous simulations. This is due to the use of a single generalized synapse for each cell that accepts all of the connecting input event streams and discontinuously changes just two state variables when an event arrives. This is orders of magnitude more efficient than the continuous model, where there are about Nc synapses per cell, each with an ODE that requires an evaluation of equation 3.2 every time step. Given that fixed time-step methods remain the simulation standard, we compared the fixed time-step performance with lvardt for the mutual-inhibition model (see Figure 4). We found that a fixed time step of 0.0025 ms gave results closely comparable to those of lvardt with absolute error tolerance (atol) of 1 · 10−3 or 1 · 10−5 . In this simulation, firing of all cells is powerfully drawn into the synchronizing attractor, making the result qualitatively similar for any integration method or tolerance that produced reasonable spike trajectories for the individual cells. Figure 4A demonstrates that the lvardt simulations are relatively slow in the early phase of the simulation, where irregular firing generates high-frequency input events in all cells but then becomes far more efficient once synchrony sets in. Figure 4B explains this by showing the time steps used during the simulations. The fixed dt methods are represented here by horizontal lines. The lvardt methods produces time steps that jump around during the initial presynchrony phase of the simulation and then settle down to an alternation between large dt (>1 ms with atol = 1 · 10−3 ) in the long intervals separating the population spikes and extremely short dt during the population spike itself. This is readily understood by noting the need to calculate not only individual cell spiking during the period of the population spike but also to handle the instantaneous (zerodelay) exchange of spikes and synaptic responses to these spikes during this same brief period. Profiling of these simulations demonstrates that lvardt consumes about 12% of its total CPU time performing these event deliveries and the associated interpolations. This relatively high figure is due to the fact that spikes that are tentatively triggered in a particular cell may then need to be taken back as other inputs into that cell arrive and alter the spike time. When the global variable time-step method is used, this species of thrashing behavior sometimes resulted in severe inefficiencies. Due to the all-toall connectivity, the near-simultaneous spiking of 100 neurons places 9900 (n2 − n since no self-connectivity) near-simultaneous events on the event queue after the occurrence of spikes in all of the cells. Using lvardt, these events are generally handled in sequential order with only occasional need to recalculate threshold time (see Figure 2C). However, the global variable time-step method attempts to reconcile the mutual influence of all of these competing events using the single integrator. This led to large computer time increases (up to 60-fold) in some simulations. It is possible to provide efficient handling of such massive event influx under the global method by artificially defining a resolution interval within which events would be
912
W. Lytton and M. Hines
CPU time (min)
A 20
dt=0.0025
atol=1e−5
10 atol=1e−3 dt=0.025
100
B
200
300
400 500 model time (ms)
log(dt) (ms)
0
−2
−4
−6
Figure 4: Comparison of fixed and variable time-step methods for the mutualinhibition model. (A) CPU time increases linearly with simulation time for fixedstep method (dashed lines for dt = 2.5µs – upper curve; dt = 25µs – lower). There is a reduction in CPU load at the onset of synchrony using the variablestep method (solid lines with absolute error tolerance 1 · 10−5 – upper; 1 · 10−3 – lower). (B) Time-step size as a function of time. Log(dt) is shown for fixed dt (horizontal dashed lines) and variable dt in one neuron (solid lines).
considered to be simultaneous. However, such an ad hoc event-handling approach would not be desirable for other types of simulations. Instead, we regard the lvardt as the natural implementation for the event-driven form of the mutual-inhibition model. We further note that the all-to-all connectivity and near-perfect synchrony of the model represents an extreme simulation situation. 3.3 Generalization of Mutual-Inhibition Model Using lvardt. The use of the event-driven simulation for the mutual-inhibition model provides
Local Variable Time-Step Method
913
the desirable side effect of allowing arbitrary delays to be introduced into the simulation. We have begun exploring the effects of delays primarily in order to ensure that these would be handled readily without introducing unexpected errors or inefficiencies. We found that CPU times using lvardt decreased slightly with the introduction of a 2 ms delay (from 1.43 min to 1.07 min). CPU times were also similar with introduction of inhomogeneity in the cells’ intrinsic frequencies (1.33 min), introduction of variability in the delays (1.33 min), or variability in both intrinsic frequency and delay times (1.57 min). Introduction of brief delay shifted but did not otherwise interfere with synchronization in the homogeneous case. Introduction of a range of delays (1.9–2.1 ms; uniform distribution) also did not interfere with synchronization. Further increasing the delay range to 2 to 5 ms produced slightly broader population spikes. However, activity still synchronized within four to five cycles as before. These manipulations increased lvardt integration efficiency by 10% to 20%. With an inhomogeneous population of neurons having different natural firing frequencies, the population spike broadened considerably, again without interfering significantly with the number of cycles required to achieve synchrony. Here again, there was only a mild reduction of integration speed, comparable to that seen with randomization of delays. 3.4 Use of lvardt with a Synfire Chain. The synfire chain, introduced by Abeles (1991; Aviel, Mehring, Abeles, & Horn, 2003) is optimal for application of the lvardt method. In this classical simulation, sets of cells are fired sequentially due to synaptic connectivity density that is greatest from each set to a follower set. At any one time, activity is restricted to a small set of cells that are carrying the signal forward, while other cells in the simulation are quiescent or are firing at lower background rates. Our evaluation of simulations consisting of 100 single-compartment Hodgkin-Huxley cells showed a 20-fold speed-up using lvardt as compared to fixed dt method with similar accuracies. In general, synfire simulations can be expected to show speed-ups of one to two orders of magnitude depending on the size of the chain, background firing rates, and forward versus lateral connectivity densities. Profiling of these simulations demonstrated that event-handling overhead was insignificant. 3.5 Use of lvardt in Thalamocortical Simulation. The highly structured, stereotyped simulations described above were meant to highlight situations in which lvardt would be particularly useful (synfire chain) or would be likely to encounter problems (mutual-inhibition model). We also benchmarked lvardt with a more complex thalamocortical simulation more closely related to activity in the nervous system. This simulation features four cell types: cortical pyramidal neurons and interneurons, thalamocortical cells, and thalamic reticular neurons (Bazhenov et al., 1998). The two
914
W. Lytton and M. Hines
thalamic cell types produced bursts with prolonged interburst intervals, a situation particularly advantageous for the use of the lvardt algorithm. To preserve accurate spike times out to 150 ms of simulation time, we had to use high-accuracy simulations: an error tolerance (atol) of 1 · 10−6 for the lvardt method and a dt of 1 · 10−4 for the fixed time-step method. (The concept of accuracy is somewhat problematic when considering these complex network simulations, as will be discussed further below.) The results were striking: 10 hour 20 minute simulation time for fixed dt and 6 minutes 13 seconds for lvardt, a 100-fold speed-up. Less dramatic results were obtained when comparing to a more typical fixed dt of 1 · 10−2 , comparable in this simulation to the lvardt method with error tolerance of 1 · 10−3 . In this case, the simulations took 2 minutes for lvardt and 5 minutes 53 seconds for fixed dt, a 3-fold speed-up. As in the original Bazhenov et al., (1998) model, several cell parameters were randomized to introduce variability into the model. In general, added variability would be expected to be advantageous for lvardt, increasing the likelihood that the integrators would require different time steps at a given point in the simulation. In practice, repetitive drive in this simulation dominated dynamics so that changing the degree of cellular variability made little difference. Similarly, addition of noisy inputs had little effect in this simulation. In general, addition of strong, high-frequency noisy inputs would be expected to remove lvardt advantages, requiring interrupts and reinitialization at the input frequencies. Such an extreme case would also adversely affect performance of the global variable time-step method. Although the fixed dt method would be unaffected, it would be inaccurate: all events would be rounded off to the nearest dt, an aliasing producing false input synchrony. Addressing this by reduction to an appropriately small fixed dt would again leave the variable time steps with an advantage. 3.6 Estimating Simulation Complexity. In order to perform detailed benchmarking and complexity evaluation, we needed a hybrid simulation that would allow us to scale simulation size without altering the simulation pattern. This was not readily done with the thalamocortical simulation, where scaling to greater numbers of neurons produced activity spread that depended critically on boundary conditions and parameter-scaling choices, despite preservation of qualitative activity pattern. We therefore went back to the synfire chain, reconfiguring it as a set of rings (see Figure 5A) to make it scale well and produce continuous activity. Each neuron in this rings simulation has 10 active compartments with Hodgkin-Huxley sodium, potassium, and leak conductances, that is, 40 states. Increasing the number of neurons in a ring increases the size of the simulation without increasing the amount of parallel activity: each ring is a delay line in which only one neuron is active at a given time. An increase in the number of rings increases the amount of activity occurring simultaneously.
Local Variable Time-Step Method
915
A
128 64 32
ba
l
2
16
fix e
d
dt
glo
CPU time: log (sec)
B
8 4
1 2 1 1
2 3 4 sim size: log (# cells)
Figure 5: (A) Schematic of a rings simulation using 8 rings of 10 neurons. The 10 tightly coupled compartments of each neuron have standard Hodgkin-Huxley channels. Activity is passed around each ring independently. (B) Log-log plot of simulation time versus simulation size. Simulations range in size from 10 to 20,480 neurons (total of 400 to 819,200 state variables) in 1 to 128 rings. With global and fixed dt, results for different number of rings overlap. With lvardt (dashed lines), simulation time increases with increased number of rings (number above each line). The asterisk shows the simulation represented in A with the dashed vertical line indicating run time with fixed and global methods. Times are on a Pentium CPU running at 3 GHz.
As expected, simulation time for the global and fixed methods depends on only the number of cells, not on how they are divided into rings. Therefore, the “global” and “fixed dt” curves for different number of rings all overlap in Figure 5B. The relative location of these curves reflects the usual choice of time step = 0.025 ms for the fixed method and atol = 1 · 10−3 for the global variable-step method. The lvardt curves indicate an enormous speed advantage when run with a small number of rings (lower set of dashed curves labeled 1, 2, 4) where
916
W. Lytton and M. Hines
only 1, 2 respectively 4 cells will be active at any given time. As we increase the number of rings, hence increasing the number of cells simultaneously active, lvardt takes more computation time to simulate a given number of cells. As the number of simultaneously active cells approaches the total number of cells, lvardt will be placed at a disadvantage compared to the other integration methods, as it wastes time maintaining the integrator queue and because of the Nc –fold increase in constant integrator overhead. At the left side of each lvardt curve (small number of cells per ring), doubling the number of rings doubles the number of active cells and approximately doubles simulation time. As the number of cells per ring rises, simulation time increases only very slightly at first and then rises more steeply, causing the curves to converge somewhat as the number of cells per ring gets very large. With these very large rings, the time spent integrating the cell that is firing is swamped by the time doing maximum time steps in the large number of quiescent cells. Evaluation of the lvardt integration in the ring simulation allowed us to develop an empirical weighting of simulation time that indicates the complexity of the lvardt method:
tbig · θ + (1 − θ) · tsmall · Ts · s + To + Tq · log(Nc ) .
Tr un = Nc ·
tstop tsmall
(3.3)
Total simulation time (Trun ) for the lvardt method will be proportional to number of cells (Nc ) and total model time (tstop ). The characteristic inverse dependence on time step (tsmall ) must here be weighted by a θ factor, indicating the proportion of neuron time spent crawling through spikes at small dt. tbig is also included since inactive cells will be expected The proportion of tsmall to have a characteristic time step related to subthreshold synaptic input interval in the interspike interval. For the fixed-step method, tbig = tsmall . For the global variable-step method, θ is dependent on the proportion of active to inactive in the network as a whole. For all methods, simulation time will be dependent on the number of state variables s per neuron and the time Ts required to integrate a single state variable. lvardt has two more dependencies: general overhead time To for handling each integrator and queue handling overhead for determining which integrator goes next. This latter term scales as Nc · log(Nc ) rather than Nc , reflecting the scaling for a queue sorting algorithm. Profiling demonstrates that integrator queue effect on simulation time is minimal (coefficient Tq is relatively small). For 128 rings of size 160 each, management of the integrator queue is under 3% of the simulation time, while integration is 94%. When using 1-compartment instead of 10-compartment cells (4 instead of 40 state variables) with this
Local Variable Time-Step Method
917
size network, state integration still dominates the calculation, with queue time increasing to only 6% of the simulation time. In Figure 2, we depicted times for events and integration boundaries as they would appear on a single queue. In the NEURON implementation, events are maintained on an event queue while integrators are maintained on a separate integrator queue. Only the latter is considered in the Tq term of equation 3.3. We did not consider the time for the event queue in this equation since it is the same regardless of which of the three integration methods is being used. 4 Discussion The opposing demands of the continuous and event-triggered aspects of neural simulation suggest that the problem can be split up. Individual neurons are computationally demanding. By contrast, connectivity does not require much CPU time, although its representation may be demanding of memory (up to n2 of neuron number). Thus, it is natural to separate the simulation problem into calculation of the continuous neural potentials and calculation of events and their effects. When the problem is split in this way, it is immediately recognized that the computational demands of each neuron will differ among themselves at any given time. This suggests the use of the local time-step method. 4.1 Event-Driven Simulation. Prior work has demonstrated a variety of methods for efficient handling of event-driven neural network simulations (Makino, 2003; Mattia & Del Giudice, 2000; Watts, 1994). However, these networks have been restricted to use with artificial cells, which permit analytical solution or approximation of cell states based on values at an arbitrary prior time. In such a network, cell states are calculated at the time of event receipt based on values determined at the prior event. In addition to external events (te ), the event queue for an artificial network may also contain self-events. For example, an artificial cell may use an event to alert itself to the end of its refractory period, permitting resetting of an internal state flag. Such self-events are typically of fixed period and can be added effortlessly to the queue. In an artificial cell network, the simulator maintains a queue of scheduled events that are then evaluated in order. To handle an event, the simulator updates the states of follower cells and places on the queue any new events generated by these followers. Since many practical simulations involve only a small set of event delay times, the need for O(logNc ) queue sorting is avoided, and the event queue can make use of efficient algorithms such as the O(1) algorithm presented by Mattia and Del Giudice (2000). Similar to the above, a network in NEURON can be constructed entirely of event-driven artificial cells. In an artificial cell network without realistic cells, there is no need for integration. In this setting, lvardt creates no
918
W. Lytton and M. Hines
integrators and does not need an integrator queue, making use of only the event queue. States for artificial cells are computed analytically on the arrival of events, and output events are then added to the event queue. By contrast with the event queue, the integrator queue always contains Nc -ordered tb events, where Nc is the number of cells. The tb times are uncorrelated with each other and with any te times. This variability in event order means that the integrator queue requires a fully general algorithm with characteristic log(Nc ) complexity per time step. In NEURON, both the event and integrator queues are based on Jones’s (1986) implementation of the splay tree algorithm of Sleator and Tarjan (1983). As we have shown, integrator queue execution time is negligible even for simulations on the order of 10,000 cells. 4.2 Contrasting Example Simulations. In order to demonstrate the general usefulness of the method, we explored its performance in several examples. The mutual-inhibition model is relatively unsuited to the use of any variable time-step method, proceeding to full synchronization within the round-off error for the double precision representation available. With regard to lvardt, such synchronization is a worst-case scenario from a performance perspective, since perfect synchrony of identical neurons means that all integrators are redundantly performing the same calculations at the same time. In addition, the mutual-inhibition model places increasing demands on event accounting, as the event-delivery algorithm attempts to reconcile the mutual effects of the n2 − n events arriving nearly simultaneously. Performance on the mutual-inhibition model simulation can be greatly improved by providing a window wherein arriving events are considered simultaneous and can be handled in the order received. We have explored this implementation but did not pursue it since it is an ad hoc solution to a peculiar simulation with pathologically synchronized behavior and since the lvardt method performed adequately despite this handicap. Moving beyond this artificially handicapped situation, it can readily be seen that the best use of lvardt will be in larger simulations involving heterogeneous neural populations where activity bounces from one area to another or spreads across the network. The synfire chain simulation represents the best-case extreme with only a small subset of neurons being active at a given time. lvardt can integrate these few, devoting no resources to the many that are done or await their moment. The ring simulation generalizes the synfire chain to allow the simulation size and the extent of simultaneous activity to be independently scaled. With a large enough number of rings, lvardt will be expected to lose its advantage, as integrator switching between simultaneously active cells will dominate. The more complex thalamocortical simulation was also assessed to demonstrate that the new method has real virtual-world application. While performance of lvardt on this simulation produced excellent results, comparison of simulations run using different integration methods with different
Local Variable Time-Step Method
919
degrees of precision raised questions of the appropriate standard of accuracy for simulations. In general, network simulations will not converge to a single, correct result since long runs will invariably yield a near-threshold event, which will produce spikes at slightly different times or not at all, ultimately dependent on round-offs, an example of sensitivity to initial conditions. From this point onward, the divergence of activity in the single neuron may spread to alter firing patterns throughout the network. In the case of a synchronizing network such as the mutual-inhibition model, the strength of the synchronizing attractor will resolve such deviations. In the general case, however, these deviations result in entirely different spike trains after a certain point, regardless of the degree of accuracy requested. This invariable variability requires some metric other than strict spike occurrence times to identify firing pattern similarities and the adequacy of an integration method (Victor & Purpura, 1996).
4.3 Parallel Computation. The use of parallel integrators for different neurons naturally raises the issue of porting this simulation method to a parallel computer. The NEOSIM project (http://www.neosim.org) has developed a parallel discrete event sample implementation for the delayed delivery of spike events from a source cell on one CPU to a target cell on another. This implementation coexists with NEURON’s lvardt method. ij High performance requires a significant minimum spike delay time td , between source cell i and target cell j. In this case, our strict integration j j ij assertion property tai ≤ tb ∀ i, j is relaxed to tai ≤ tb + td . The NEURON portion of the NEOSIM + NEURON implementation merely accepts a request from NEOSIM to integrate a specific cell or group of cells to a specified stop time consistent with the relaxed assertion. When a NEURON cell fires ij at time t, it notifies NEOSIM. At this time, no cell has a ta > t + td . Note that in our zero-delay mutual-inhibition model, this parallel method would be useless. However, most network models have a significant minimum delay between different cells and can realize significant performance gains with the NEOSIM parallel discrete event delivery technique.
5 Conclusion In summary, lvardt offers substantial speed-ups for simulations of networks of realistic ion-channel-based neurons. The advantages will be greatest in situations where some neurons are quiescent during periods when other portions of the network are active. This would be the case for simulations involving serial activation of areas, as, for example, in hippocampus, or involving cell types with very different firing properties, as, for example, in a thalamocortical or basal ganglia simulation. lvardt’s mixture of event-driven and differential equation simulation also makes it ideal for implementation
920
W. Lytton and M. Hines
of hybrid networks, where artificial cells with analytically soluble states are combined with realistic neurons requiring full ODE integration. As with any other computational method, the suitability of lvardt is dependent on the exact problem to be solved. The individual user will want to benchmark a particular simulation across methods before deciding which one to use. A general assessment of suitability can be performed by considering the factors laid out in equation 3.3. Our heuristic conclusion is that for medium-size networks in which average synaptic input intervals to a single cell are much greater than a fixed step dt, the lvardt method will have better performance than the fixed-step method. Acknowledgments This work was supported by NINDS grants NS11613 and NS32187. We thank the reviewers for many helpful suggestions. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Aviel, Y., Mehring, C., Abeles, M., & Horn, D. (2003). On embedding synfire chains in a balanced network. Neural Computation, 15, 1321–1340. Bazhenov, M., Timofeev, I., Steriade, M., & Sejnowski, T. (1998). Computational models of thalamocortical augmenting responses. Journal of Neuroscience, 18, 6444–6465. Cohen, S., & Hindmarsh, A. (1994). Cvode user guide (Tech. Rep.). Livermore, CA: Lawrence Livermore National Laboratory. Destexhe, A., Mainen, Z., & Sejnowski, T. (1994a). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18. Destexhe, A., Mainen, Z., & Sejnowski, T. (1994b). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Comput. Neurosci, 1, 195–230. Hindmarsh, A., & Serban, R. (2002). User documentation for cvodes, an ode solver with sensitivity analysis capabilities (Tech. Rep.). Livermore, CA: Lawrence Livermore National Laboratory. Hines, M. (1984). Efficient computations of branched nerve equations. Int. J. BioMedical Computing, 15, 69–76. Hines, M., & Carnevale, N. (1997). The neuron simulation environment. Neural Computation, 9, 1179–1209. Hines, M., & Carnevale, N. (2001). Neuron: A tool for neuroscientists. Neuroscientist, 7, 123–135. Hines, M., Morse, T., Migliore, M., Carnevale, N., & Shepherd, G. (2004). Modeldb: A database to support computational neuroscience. J. Comput. Neurosci, 17, 73–77. Jones, D. (1986). An empirical comparison of priority-queue and event-set implementations. Comm. ACM, 4, 300–311.
Local Variable Time-Step Method
921
Lytton, W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Computation, 8, 501–510. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural Computing and Applications, 11, 210–223. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Computation, 12, 2305– 2329. Sleator, D., & Tarjan, R. (1983). Self adjusting binary trees. In Proc. ACM SIGACT Symposium on Theory of Computing (pp. 235–245). New York: ACM Press. Traub, R., Jefferys, J., & Whittington, M. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. Victor, J., & Purpura, K. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. J. Neurophysiol, 76, 1310–1326. Wang, X., & Buzsaki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci, 16, 6402–6413. Watts, L. (1994). Event-driven simulation of networks of spiking neurons. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6. (pp. 927–934). San Mateo, CA: Morgan Kaufmann.
Received December 18, 2003; accepted August 25, 2004.
LETTER
Communicated by Anthony Burkitt
Synaptic Shot Noise and Conductance Fluctuations Affect the Membrane Voltage with Equal Significance Magnus J. E. Richardson
[email protected] Wulfram Gerstner
[email protected] Laboratory of Computational Neuroscience, I&C and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne, CH-1015 Lausanne EPFL, Switzerland
The subthreshold membrane voltage of a neuron in active cortical tissue is a fluctuating quantity with a distribution that reflects the firing statistics of the presynaptic population. It was recently found that conductancebased synaptic drive can lead to distributions with a significant skew. Here it is demonstrated that the underlying shot noise caused by Poissonian spike arrival also skews the membrane distribution, but in the opposite sense. Using a perturbative method, we analyze the effects of shot noise on the distribution of synaptic conductances and calculate the consequent voltage distribution. To first order in the perturbation theory, the voltage distribution is a gaussian modulated by a prefactor that captures the skew. The gaussian component is identical to distributions derived using current-based models with an effective membrane time constant. The well-known effective-time-constant approximation can therefore be identified as the leading-order solution to the full conductance-based model. The higher-order modulatory prefactor containing the skew comprises terms due to both shot noise and conductance fluctuations. The diffusion approximation misses these shot-noise effects implying that analytical approaches such as the Fokker-Planck equation or simulation with filtered white noise cannot be used to improve on the gaussian approximation. It is further demonstrated that quantities used for fitting theory to experiment, such as the voltage mean and variance, are robust against these non-Gaussian effects. The effective-time-constant approximation is therefore relevant to experiment and provides a simple analytic base on which other pertinent biological details may be added. 1 Introduction Given a perfect model of the membrane response to synaptic input, it would be possible to infer from the distribution of the subthreshold, membranevoltage fluctuations many quantities of interest, such as the levels of activity and correlations in the excitatory and inhibitory presynaptic populations. Neural Computation 17, 923–947 (2005)
© 2005 Massachusetts Institute of Technology
924
M. Richardson and W. Gerstner
Early models of synaptic input (Stein, 1965) comprised a leaky integrator driven by a stochastic current, which generated postsynaptic potentials of fixed amplitude. Since then, great effort has been made to incorporate further biological details. Soon after the publication of Stein’s model, synaptic conductance effects began to be addressed (Stein, 1967; Johannesma, 1968; Tuckwell, 1979; Wilbur & Rinzel, 1983; Lansky & Lanska, 1987). These early models featured unfiltered, delta-pulse synapses and were primarily concerned with the statistics of the interspike interval distribution. Although the majority of studies used the diffusion approximation (i.e., the limit of high synaptic rates and low postsynaptic potential amplitudes), the effects of shot noise due to Poisson distributed pulse arrival at low rates have also been considered (see, e.g., Tuckwell, 1989) in the context of stochastic resonance (Hohn & Burkitt, 2001) and the neural response to correlations in the presynaptic population (Kuhn, Aertsen, & Rotter, 2003). Other studies have examined the filtering of the incoming pulses at the synapses and have shown it can lead to unexpected dynamical response properties: synaptic filtering can, paradoxically, allow neurons to follow high-frequency signals better (Brunel, Chance, Fourcaud, & Abbott, 2001; Fourcaud & Brunel, 2002). More recently, a number of experimental studies have directly measured the effect of synaptic drive on the membrane voltage (Kamondi, Acsady, Wang, & Buzsaki, 1998; Destexhe & Par´e, 1999; Sanchez-Vives & McCormick, 2000; Monier, Chavane, Baudot, Graham, & Fr´egnac, 2003; Holmgren, Harkany, Svennenfors, & Zilberter, 2003). The availability of such measurements has led to a renewed interest in the quantitative modeling of synaptic drive, with a view to infer presynaptic network states from voltage fluctuations (Stroeve & Gielen, 2001; Rudolph, Piwkowska, Badoual, Bal, & Destexhe, 2004), compare current and conductance-based models of synaptic drive (Tiesinga, Jos´e, & Sejnowski, 2000; Rauch, La Camera, Luscher, ¨ Senn, & Fusi, 2003; Rudolph & Destexhe, 2003; Jolivet, Lewis, & Gerstner, 2004; Richardson, 2004; La Camera, Senn, & Fusi, 2004; Meffin, Burkitt, & Grayden, 2004), and explore mechanisms for the gain control of the neuronal response (Chance, Abbott, & Reyes, 2002; Burkitt, Meffin, & Grayden, 2003; Destexhe, Rudolph, & Par´e, 2003; Fellous, Rudolph, Destexhe, & Sejnowski, 2003; Prescott & De Koninck, 2003; Grande, Kinney, Miracle, & Spain, 2004; Kuhn, Aertsen, & Rotter, 2004). In this letter, the combined effects on the membrane voltage of synaptic shot noise, filtering, and conductance increase will be examined. The central result is that the effects of synaptic shot noise on the membrane voltage statistics are as significant as those of synaptic conductance fluctuations and therefore either both (or neither) of these features of the synaptic drive should be taken into account for a consistent approach. This means that diffusion-level descriptions, such as numerical simulations or the FokkerPlanck approach, in which the drive is modeled as gaussian noise, cannot correctly describe detailed aspects of the membrane-voltage distribution, such as its skew.
Shot Noise and Conductance Fluctuations
925
2 Membrane Response to Synaptic Drive In this section, the full model of the membrane response to synaptic drive is introduced and two common approximations to this model outlined. An analysis of the aspects of the drive missed by these approximation schemes will motivate the development of a perturbative approach. 2.1 The Full Model. Following Stein (1967), the membrane voltage V(t) responds passively to synaptic drive: voltage gated channels, including spike-generating currents, are not included. The membrane is modeled by a capacitance C in parallel with a leak conductance g L and two fluctuating excitatory ge (t) and inhibitory gi (t) conductances with equilibrium potentials at E L , E e , and E i , respectively. This system therefore comprises three independent variables: C
dV = −g L (V − E L ) − ge (V − E e ) − gi (V − E i ) + Ia pp dt
(2.1)
τe
dge δ t − tke = −ge + c e τe dt {tk }
(2.2)
dgi = −gi + c i τi δ t − tki . dt {tk }
(2.3)
e
τi
i
The excitatory conductance is driven by pulses that arrive at the Poissondistributed times {tke } at a total rate Re summed over all input fibers. Each pulse provokes a quantal conductance increase c e , which then decays exponentially with a time constant τe . The inhibitory conductance is defined analogously. Any experimentally applied current is accounted for by Ia pp . In this letter, only the steady-state statistical properties will be considered. Thus, all expectations of a quantity x(t), written as x(t), denote either an average over an ensemble of statistically independent systems, in which any transients due to initial conditions are no longer present, or the temporal average of x(t) in a single system. 2.2 The Diffusion Approximation. For the case in which the rates Re , Ri are relatively high, the number of pulses that arrive within the timescales τe , τi will be approximately gaussian distributed. The replacement of the synaptic shot noise in equations 2.2 and 2.3 by a constant term and gaussian white noise constitutes the diffusion approximation. Thus, using excitation as an example, τe
√ dge ge0 − ge + 2σe ξe (t), dt
(2.4)
926
M. Richardson and W. Gerstner
where the gaussian white noise ξe (t) has a mean and autocorrelation function defined by ξe (t)ξe (t ) = τe δ(t − t ).
ξe (t) = 0
(2.5)
The Ornstein-Uhlenbeck process (see equation 2.4) has been shown to capture the statistics of conductance fluctuations at the soma of compartmentalized model neurons (Destexhe, Rudolph, Fellous, & Sejnowski, 2001). The average conductance ge0 and the standard deviation σe are related to the variables c e , τe , and Re through ge0 = c e τe Re ,
σe = c e
τe Re . 2
(2.6)
By construction, the first two moments of the diffusion approximation are identical to those of the shot noise process. Higher moments, however, are not correctly reproduced in the diffusion approximation. The conductance equation 2.4 is linear and can be integrated.1 The fluctuating component ge F of the conductance is ge F (t) ≡ ge (t) − ge0
√ 2σe 0
∞
ds −s/τe e ξe (t − s), τe
(2.7)
which yields (with equation 2.5) the gaussian distribution p D (ge ) =
(ge − ge0 )2 . exp − 2σe2 2πσe2 1
(2.8)
The subscript signifies that the calculation was made in the diffusion approximation. There are clearly some problems with distribution 2.8 if the conductance mean ge0 is of a similar magnitude to the standard deviation σe . In this regime, the diffusion approximation predicts negative conductances (Lansky & Lanska, 1987; Rudolph & Destexhe, 2003). In fact, the criterion for validity of the diffusion approximation is σe /ge0 1,
1
(2.9)
The Stratonovich formulation of stochastic calculus is used throughout this letter. However, for additive white noise or multiplicative colored noise, there is no difference between the Stratonovich or Ito forms. See, for example, Risken (1996).
Shot Noise and Conductance Fluctuations
927
suggesting that this approximation misses higher-order terms scaling with powers of σe /ge0 . Thus, the shot noise conductance fluctuations should read ge F (t) =
√ 2σe
∞
0
ds −s/τe σe ξe (t − s) + corrections ∝ , e τe ge0
(2.10)
where ξe (t) is the gaussian white noise defined in equation 2.5. 2.3 The Diffusion Approximation Is Inconsistent. The combination of the diffusion approximation of the synaptic drive (see equation 2.4 and its equivalent for inhibition) and the full voltage equation, 2.1, will now be examined. By separating the synaptic conductances into tonic components ge0 , gi0 and fluctuating components ge F , gi F , the voltage equation can be written as C
dV = −g0 (V − E 0 ) − ge F (V − E e ) − gi F (V − E i ), dt
(2.11)
where the total conductance g0 and drive-dependent equilibrium potential E 0 are defined by g0 = g L + ge0 + gi0
and
E0 =
1 (g LE L + ge0 E e + gi0 E i + Ia pp ). g0 (2.12)
The subscripts 0 anticipate that these quantities are correct at the zero order of a perturbation expansion that will be developed in a later section. The total conductance g0 suggests the introduction of an effective membrane time constant, τ0 = C/g0 .
(2.13)
This feature of the synaptic drive was identified in the early analytic treatment of Johannesma (1968). The fluctuation terms driving the voltage in equation 2.11 will now be examined. Taking excitation as an example, the voltage-dependent component of the drive can be expanded around the equilibrium potential E 0 , ge F (V − E e ) = ge F (E 0 − E e ) + ge F (V − E 0 ).
(2.14)
The two terms on the right-hand side have simple interpretations. The first is an additive noise term and therefore just a fluctuating current. The second is a multiplicative noise term and, in the context of equation 2.11, it can be
928
M. Richardson and W. Gerstner
seen that this term represents fluctuations in g0 , or equivalently in τ0 , the effective membrane time constant. These two noise terms are, however, not equally significant. The quantity V− E 0 grows (linearly) with the fluctuations ge F , gi F . So whereas the additive noise terms are of the order ge F , gi F , the multiplicative noise terms are of the order ge2F , gi2F , and ge F gi F . This suggests that (1) the multiplicative noise terms could be neglected if the noise strength was in some way small, and (2) if these terms were retained, the effects of the synaptic drive on the membrane voltage would be modeled in greater detail. Point 1 is valid, as will be seen in section 2.4. Point 2, however, is false due to an unexpected weakness of the diffusion approach with multiplicative noise. This will now be outlined. On reexamining equations 2.7 to 2.9, it is seen that relative to the tonic conductance, the fluctuations in the diffusion approximation scale with σe /ge0 . But equation 2.10 states that the terms missed by this approximation scale with the square of this quantity. Hence,
σe ge F /ge0 = A ge0
+B
σe ge0
2 + ···
(2.15)
where A is the diffusion-level term and B is the first-order correction due to shot noise. Given that σe /ge0 is the small quantity parameterizing the diffusion approximation, it is clearly inconsistent to neglect the secondorder term B in the additive noise ge F (E e − E 0 ) of equation 2.14 but keep the implicit A2 term in the multiplicative noise ge F (V− E 0 ) ∝ ge2F . This is, however, what occurs in the diffusion approximation. This result is surprising because it implies that although diffusion-based approaches (such as the Fokker-Planck equation or any simulation with filtered gaussian noise) purport to capture the effects of synapticconductance fluctuations, they miss equally important terms due to the shot noise. However, it should be stressed that almost all previous studies of conductance-based synaptic noise that used the diffusion approximation implicitly concentrated their analyses on the dominant effects coming from the tonic conductance increase and additive noise term; the conclusions of such studies remain valid. 2.4 The Effective-Time Constant Approximation. This is also known as the gaussian approximation of the voltage distribution. The treatment of the membrane voltage can easily be made consistent with the diffusion approximation of the synaptic conductance equations. This is achieved by dropping the multiplicative noise term, that is, by neglecting conductance fluctuations, to yield C
dV −g0 (V − E 0 ) + ge F (E e − E 0 ) + gi F (E i − E 0 ). dt
(2.16)
Shot Noise and Conductance Fluctuations
929
This voltage equation is of the form of a current-based model, but the dominant effect of the synaptic conductance is accounted for through the use of an increased effective leak g0 . This approximation is in widespread use, having been applied to white noise synaptic drive (Wan & Tuckwell, 1979; Lansky & Lanska, 1987; Burkitt & Clark, 1999; Burkitt, 2001; Burkitt et al., 2003, La Camera et al., 2004), alpha-pulse synapses (Manwani & Koch, 1999), and, more recently (Richardson, 2004), to the case of exponentially filtered synapses studied here. The equation set comprising the voltage equation 2.16 and the diffusion approximations for the conductances are simple to analyze and can be integrated to give V(t) − E 0
(E e − E 0 ) ∞ −s/τe ds e − e −s/τ0 ξe (t − s) (τe − τ0 ) 0 √ σi (E i − E 0 ) ∞ −s/τi + 2 ds e − e −s/τ0 ξi (t − s). g0 (τi − τ0 ) 0
√
2
σe g0
(2.17) This equation has an obvious interpretation: the quantities multiplying the noise are just the excitatory and inhibitory postsynaptic potentials for a membrane with an effective time constant τ0 . The fact that it is linear in the noise means that many quantities of interest can be easily calculated, including temporal measures such as the autocorrelation function. The distribution predicted for the voltage is the gaussian p0 (V) =
(V − E 0 )2 , exp − 2σV2 2πσ 2 1
(2.18)
V
where, for the case where there are no correlations between excitation and inhibition, the variance is (Richardson, 2004) σV2 =
σe g0
2 (E e − E 0 )2
τe + (τe + τ0 )
σi g0
2 (E i − E 0 )2
τi . (τi + τ0 )
(2.19)
If the limit τe , τi → 0 is correctly taken (by keeping the quantities c e τe /C and c i τi /C fixed), it can be shown that this variance is compatible with previous results derived for the gaussian approximation of white noise conductance-based synaptic drive (Burkitt et al., 2003). However, for filtered noise, the variance in equation 2.19 differs significantly from that derived in Rudolph and Destexhe (2003). In that study, a one-dimensional Fokker-Planck equation was used that could not capture the effects of synaptic filtering. Through the introduction of effective synaptic time constants (Rudolph et al., 2004), the one-dimensional Fokker-Planck equation can be
930
M. Richardson and W. Gerstner
made to yield results that correspond, at the gaussian level and in the steady state, to the distribution parameterized by equations 2.12 and 2.19. 2.5 The Aim of This Letter. The gaussian approximation provides a mathematically convenient approach to the analysis of conductance-based synaptic drive and is accurate for parameter values relevant to experiment (Richardson, 2004). Given the analysis presented above, it is clear that to improve on the gaussian approximation, both shot noise and conductance fluctuations must be included. The goal of the next two sections will be to develop a perturbative method that allows for the consistent calculation of the conductance and voltage distributions at a higher order than the gaussian approximation. These higher-order calculations will yield the skew of the voltage distribution, a quantity that is measurable experimentally. More important, the approach will provide information on the validity of fitting gaussian-level analytical forms for the mean and variance to voltage traces of cortical neurons. To aid readability, only the results of the calculations are given in the main body of the article. However, the methods developed here are applicable to other areas of theoretical neuroscience, such as the distribution of amplitudes at depressing synapses (Hahnloser, 2003) or the shape of synaptic weight distributions (van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001), and are therefore presented in full in the appendixes. 3 Synaptic Shot Noise and Conductance Distributions In this section, the effects of shot noise on the synaptic conductance distributions will be analyzed. It should be noted that relaxation processes with shot noise, for which equations 2.2 and 2.3 are examples, have been well studied, and an exact solution (Gilbert & Pollak, 1960) for the distribution, in the form of a recursion relation, does exist. However, the aim of the approach (in section 4) is to incorporate the shot-noise conductance fluctuations into a model of the membrane voltage. A perturbative approach is better suited to this purpose. For this reason, the full solution for the shot-noise distribution p S (ge ) will not be presented here but, when needed, will be obtained by numerical simulation of equation 2.2. 3.1 The Diffusion Approximation Misses the Skew. In the limit where the standard deviation σe has a similar magnitude to the conductance mean ge0 , the diffusion approximation, unlike the full model, predicts negative conductances. A second source of difference between the statistics of shot noise and the diffusion approximation is also seen in the same limit; the distribution of the shot noise conductance becomes skewed, an effect that is obviously missed by the gaussian distribution given in equation 2.8. In order to get some intuition about the skew of the distribution, a
Shot Noise and Conductance Fluctuations
C
A
Distribution
931
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
shot noise pS diff. approx. pD perturb. theory pP
0.5 0.4 0.3 0.2 0.1 0
-1
0
1
2
3
4
-1 0 1 2 3 4 5 6 7
Conductance ge /ce
Conductance ge /ce B
D
0.12
0.06
| pD - pS | | pP - pS |
Difference
0.10
0.05
0.08
0.04
0.06
0.03
0.04
0.02
0.02
0.01
0.00
0.00 -1
0
1
2
3
Conductance ge /ce
4
-1 0 1 2 3 4 5 6 7
Conductance ge /ce
Figure 1: Distribution of shot noise conductance fluctuations; the perturbation theory improves on the diffusion approximation. (A) Comparison of the full distribution p S generated by the simulation of equation 2.2 to the diffusion approximation p D (see equation 2.8) and the perturbation theory p P (see equation 3.4) for the case σe /ge0 = 0.60 (Re τe = 1.5). (B) The corresponding absolute difference between the diffusion approximation and full solution | p D − p S | and also the perturbatively generated distribution and the full solution | p P − p S |. The perturbative distribution reduces the error caused by both the negative conductances and the skew. (C, D) Analogous measures for the case σe /ge0 = 0.41 (Re τe = 3.0) for which the theoretical approaches can be expected to be more accurate. Details of the simulations are given in appendix A.
comparison can be made between the full and approximate distributions shown in Figure 1A. In this case (for which Re τe = 1.5 implying σe /ge0 = 0.60), the peak of the shot noise distribution is to the left of that of the gaussian. Because both distributions have the same mean conductance ge /c e = 1.5, the shot noise distribution is skewed; it leans to the left with
932
M. Richardson and W. Gerstner
a longer tail to the right. Any improvement of the diffusion approximation should address both the negative conductivity and the skew of the conductance distribution. 3.2 Accounting for the Shot Noise. The corrections identified in equation 2.10 will now be accounted for. A stochastic variable ζe (t), analogous to gaussian white noise ξe (t), τe
√ dge ge0 − ge + 2 σe ζe (t), dt
(3.1)
can be constructed that has statistics that capture the shot noise fluctuations correctly up to the next order missed by the diffusion approximation. It can be shown that such a quantity must obey the same first- and second-order correlators as gaussian white noise, ζe (t) = 0,
ζe (t)ζe (t ) = τe δ(t − t ),
(3.2)
but also a new third-order correlator, √ ζe (t)ζe (t )ζe (t ) = 2
σe ge0
τe2 δ(t − t )δ(t − t ).
(3.3)
It is this third-order correlator, proportional to σe /ge0 , that provides the leading-order correction to the diffusion approximation. All higher-order correlators of products of ζe (t) factorize in terms of these first-, second-, and third-order correlators. Using the rules in equations 3.2 and 3.3, the conductance distribution can be shown (see appendix B) to be 2
1 he h 4 σe h 3e p P (h e ) = √ − exp − e , 1+ 3 ge0 3! 2! 2 2π
(3.4)
where h e = (ge −ge0 )/σe is the normalized conductance and the subscript P denotes that the result was derived as a perturbative expansion in the small variables σe /ge0 . The distribution takes the form of a gaussian modulated by a prefactor. To zero order in σe /ge0 , the prefactor is equal to one, and the gaussian distribution 2.8 is recovered. The prefactor terms proportional to σe /ge0 now allow for the moments of the distribution to be calculated at higher order. The mean and variance are unchanged, as would be expected given the previous comments about the exactness of these two moments. The first new result of the perturbation theory is the skew Sge of the distribution: Sge =
4 σe 1 (ge − ge0 )3 = h 3e = . σe3 3 ge0
(3.5)
Shot Noise and Conductance Fluctuations
933
A useful aspect of the perturbation theory is that this skew is exact. The distribution itself and its higher moments are, however, correct only at the given order of the series expansion in σe /ge0 . Two examples comparing the numerically generated conductance distribution p S , diffusion approximation p D and perturbation theory p P , are plotted in Figure 1. 4 The Subthreshold Voltage Distribution The model of synaptic conductance studied in the previous section can now be incorporated into the membrane voltage equation. This will allow the voltage distribution to be calculated at the next order beyond the gaussian approximation. The method involves a perturbative solution to the voltage equation 2.1, the excitatory synaptic conductance equation 3.1, and its inhibitory analog. For the perturbative calculation of the voltage distribution, it is convenient to use the following small parameters, xe = σe /g0
xi = σi /g0 ,
and
(4.1)
which are linearly related (in σe , σi ) to the small parameters of the conductance expansion σe /ge0 and σi /gi0 . The calculation for the voltage distribution is given in appendix C and, in terms of v = V − E 0 , can be written in the form p P (v) =
1 2πσV2
v 1+ σV
µV S − σV 2!
v3 S v2 + 3 exp − 2 , 2σV σV 3!
(4.2)
where the subscript P denotes the perturbatively generated result. The voltage appears only through the ratio v/σV , and the other terms µV /σV and S are parameters proportional to xe , xi : this distribution generates moments vm /σVm that are correct up to order xe , xi . The quantity µV is the leading-order correction to the voltage mean E 0 and stems from the conductance fluctuations only: the shot noise does not influence the mean voltage. The standard deviation, given by equation 2.19, is identical to the gaussian value σV and is therefore unaffected by shot noise or multiplicative conductance at this order in the perturbation expansion. Thus, V − E 0 = µV
and
(V − V)2 = σV2 .
(4.3)
The third-order moment of the distribution 4.2 gives the skew of the voltage distribution, 1 (V − V)3 = S = SSN + SC F . σV3
(4.4)
934
M. Richardson and W. Gerstner
From the expression given in appendix C, equation C.20, it can be seen that two distinct contributions to the skew naturally arise: one from the shot noise S SN and a second one from the conductances fluctuations SC F . These two contributions to the skew are equally significant because they are both proportional to xe , xi . This illustrates one of the central points of this study: the diffusion approximation of a conductance-based model with multiplicative noise is inconsistent because it misses the shot noise contribution SSN . The full set of equations for µV , σV , and S is given in appendix D. 4.1 An Example with Relevance to Experiment. To illustrate the effects of shot noise and conductance fluctuations, a scenario is considered in which the fluctuations due to the inhibitory component of the drive can be neglected. There are two different situations that allow this action to be taken. The first is when inhibition is absent. The second, and more interesting, case is relevant to experiments designed to isolate the effect of excitation on the membrane voltage (Silberberg, Wu, & Markram, 2004). In such experiments, the neuron is hyperpolarized through the injection of current so that the mean voltage E 0 is near the reversal of inhibition E i . In such cases, the factor E i − E 0 multiplying all inhibitory contributions to membrane fluctuations is relatively small, and such contributions can be dropped without significant loss of accuracy. Inhibition enters only through an increase of the tonic conductance g0 and the corresponding decrease of the effective time constant τ0 . For either of these scenarios, the moments that parameterize the distribution in equation 4.2 take the values τe (τe + τ0 ) τe σV2 = xe2 (E e − E 0 )2 (τe + τ0 )
µV = −xe2 (E e − E 0 )
8 g0 (τe + τ0 )2 τe 3 ge0 (τe + 2τ0 )(2τe + τ0 ) (τe + τ0 ) 2 3τe + 6τe τ0 + 2τ02 τe = −4xe . (τe + 2τ0 )(2τe + τ0 ) (τe + τ0 )
(4.5) (4.6)
S SN = xe
(4.7)
SC F
(4.8)
Equations 4.7 and 4.8 give the positive and negative contributions to the skew (see equation 4.4) that come from the shot noise and conductance fluctuations, respectively. For the case of purely excitatory drive, g0 = g L + ge0 , the relative importance of these contributions can be gauged by examining the ratio S SN τL (τe + τ0 )2 = 2 , S 3 (τ L − τ0 ) 3(τe + τ0 )2 − τ02 CF
(4.9)
Shot Noise and Conductance Fluctuations
935 2
Low-conductance state: g0=0.0667mS/cm , τ0=15ms 80 70 60 50 40 30 20 10 0 -0.02 0
B 0.12
diff. approx. perturb. theory shot-noise sim.
0.10
C 0.4
ETC approx. perturb. theory full model sim.
0.08 0.06 0.04
2
0.0
SSN+ SCF
-0.2
0.02 0.02 0.04 0.06
SSN
0.2
Skew
Distribution
A
0.00 -75 -70 -65 -60 -55 -50 -45
SCF
-0.4
0 0.05 0.1 0.15 0.2
Voltage V (mV)
Conductance ge (mS/cm )
xe=σe/g0 2
High-conductance state: g0=0.2mS/cm , τ0=5ms. 7 6 5 4 3 2 1 0 -0.1 0 0.1 0.2 0.3 0.4
E
F
0.04
0.4
0.03
0.0
Skew
Distribution
D
0.02
2
Conductance ge (mS/cm )
SSN+ SCF
-0.4
0.01
-0.8
0.00 -120 -100 -80 -60 -40 -20
-1.2
Voltage V (mV)
SSN
SCF 0
0.1 0.2 0.3 0.4
xe=σe/g0
Figure 2: Distribution of the membrane voltage; perturbation theory captures the skew. A neuron is subject to a purely excitatory synaptic drive with a current Iapp applied such that E 0 = −60 mV. (A, B) The conductance and voltage distributions for a low conductance state (ge0 = 0.0167, g L = 0.05 mS/cm2 ) with noise strength xe = σe /g0 = 0.2. The perturbative conductance distribution (see equation 3.4) is not accurate because σe /ge0 = 0.8. The weak skew of the corresponding voltage distribution (B) is, however, correctly predicted by the perturbation theory (see equation 4.2) because the underlying conductance skew is exact. (C) The voltage skew (see equations 4.7 and 4.8) is plotted as a function of xe for the same parameters, but with increasing noise σe . The shot noise S SN and conductance-fluctuation SC F contributions to the skew nearly cancel, explaining the almost gaussian voltage distribution in B. (D, E) A high conductance state (ge0 = 0.15 mS/cm2 ) with xe = σe /g0 = 0.4. (E) The large skew of the voltage distribution is captured by the perturbation theory. (F) The voltage skew is negative for the high-conductance case because SC F dominates. Details of the simulations are given in appendix A.
where τ L = C/g L is the leak time constant. The ratio is a monotonically increasing function of the effective time constant τ0 . In the limit of low conductance states, for which τ0 → τ L , the ratio diverges, and the contribution due to conductance fluctuations becomes negligible. For high-conductance states, for which τ0 → 0, the ratio converges to a constant value of 2/9. These
936
M. Richardson and W. Gerstner
results underline the fact that the effect of shot noise is nonnegligible: even in extremely high conductance states it still comprises just under a third, S SN /S = 2/7, of the net skew. These results are illustrated graphically in Figure 2. 5 Discussion The effect that shot noise synaptic drive has on the membrane voltage distribution was examined. A perturbative approach was developed that was first used to capture the statistics of filtered shot noise conductance fluctuations beyond both the gaussian effective-time-constant approximation and the diffusion approximation. These synaptic conductances were then incorporated into a model of the membrane voltage response. The approach allowed for the analysis of nongaussian features of the voltage distribution, such as its skew. In particular, it was shown that shot noise and synaptic conductance fluctuations affect the membrane at the same order: both effects need to be taken into account for a consistent approach. The regime in which the effects of shot noise on the voltage and firing rate might be clearly seen experimentally, is one of low presynaptic rate and large, sharp excitatory postsynaptic potentials (EPSPs). This is typical of the excitatory drive experienced by certain neocortical interneurons (Silberberg et al., 2004) for which isolated EPSPs can be many millivolts and there is little dendritic filtering. For a case in which the effects of shot noise are strong (outside the perturbative regime considered here), the voltage distribution can be considerably positively skewed with increased probability to be near threshold. It is expected that in such a case, the statistics of the generated action potentials would differ significantly from those predicted using a gaussian model of the membrane fluctuations with the same mean and variance. The gaussian, or effective-time constant approximation for the membrane distribution, is, however, mathematically simple: the mean (see equation 2.12) and variance (see equation 2.19) are transparent functions of the model parameters. Such gaussian distributions are therefore ideal to fit to experimental data (Rudolph et al., 2004) in cases where the shot noise effects are weak. The functional form of the distribution that takes into account the shot noise and conductance fluctuations is, however, somewhat less transparent as can be seen in equations D.5 and D.6 for the skew. So the question should be asked: To what extent would weak higher-order effects interfere with an attempt to fit the mean and variance to an experimental distribution? This question can be answered in the framework presented here. First, it is seen from equations 4.5 and D.3 that the correction to the mean voltage due to shot noise and conductance fluctuations is of order xe2 , xi2 , V = E 0 + µV + · · · = E 0 + O xe2 , xi2 .
(5.1)
Shot Noise and Conductance Fluctuations
937
Hence, the mean is not affected at first order. The same is true for the measured variance, (V − V)2 = (V − E 0 )2 − (V − E 0 )2 = σV2 1 + O xe2 , xi2 ,
(5.2)
which also increases only with xe2 , xi2 , despite the fact that the skew grows linearly with xe , xi . These results demonstrate that information extracted from the voltage mean using equation 2.12 and variance using equations 2.19 is not strongly affected by shot noise and conductance fluctuations missed in the gaussian approximation. Hence, fitting the gaussian-level moments to voltage traces is a robust method, given that equations 2.1 and 2.4 and their inhibitory counterpart provide a sufficiently realistic model of the effect of synaptic drive on the membrane voltage. In summary, the gaussian effective-time-constant approximation provides an accurate description of the voltage fluctuations and is a convenient tool for fitting theory to experiment. For most situations, its description of the stochastic voltage dynamics due to conductance-based synaptic drive is adequate, and it can be easily extended to include many biological details (such as voltage-dependent currents, dynamic synapses, heterogeneity, nontrivial temporal correlations in the drive, and others) missed in the simplified model considered here. Nevertheless, for the purposes of detailed modeling of conductance-based synaptic drive, it should not be overlooked that shot noise and conductance fluctuations are equally important. Our results demonstrate that diffusion-based approaches such as the FokkerPlanck equation or simulation using multiplicative filtered gaussian noise are inadequate for the description of the nongaussian statistics of the voltage. If the aim is to model or simulate the statistics of voltage fluctuations beyond the gaussian, effective-time-constant approximation, then synaptic shot noise must be included. Appendix A: Details of the Simulations The parameters used for the simulations were τe = 3 ms for the excitatory synaptic filtering, C = 1 µF/cm2 for the membrane capacitance, and g L = 0.05 mS/cm2 for the leak conductance. The reversal potentials used were E L = −80 mV for the leak and E e = 0 mV for synaptic excitation. Simulations were performed using the Euler method with the Poissonian synaptic shot noise implemented by integrating the conductance equation 2.2 to yield ge (t + dt) = ge (t) −
dt ge (t) + c e P(Re dt), τe
(A.1)
where c e is the postsynaptic conductance amplitude for a single pulse and Re is the total rate of incoming pulses. The quantity P(Re dt) is the random
938
M. Richardson and W. Gerstner
number of incoming pulses that arrive within the time step dt. The number is drawn from a Poisson distribution characterized by the mean Re dt. Appendix B: Filtered Poissonian Shot Noise The method for expanding higher-order gaussian correlators is first reviewed. The first- and second-order correlators are given in equation set 2.5. All higher odd-order correlators vanish, and higher even-order correlators (of order 2n) factorize into (2n)!/(2n n!)
(B.1)
permutations of products of n second-order correlators. As an example, and writing ξ (t1 ) = ξ1 for simplicity, the fourth-order correlator is ξ1 ξ2 ξ3 ξ4 = ξ1 ξ2 ξ3 ξ4 + ξ1 ξ3 ξ2 ξ4 + ξ1 ξ4 ξ2 ξ3 .
(B.2)
A fluctuating quantity ζe (t) is now introduced with statistics that are constructed so as to capture the effects of shot noise at a higher order than gaussian white noise ξ (t). The factorization properties of high-order correlators of ζe (t) can be derived from its first-, second-, and third-order correlators defined in equation set 3.2 and equation 3.3. These rules can be derived by expanding the Poissonian distribution of the shot-noise Ornstein-Uhlenbeck equation 2.2 and by keeping terms beyond the usual diffusion approximation (see, e.g., Rodriguez, Pesquera, San Miguel, & Sancho, 1985). To order σe /ge0 , higher even-order correlators obey the usual gaussian factorization rules and higher odd-order correlators can be decomposed into permutations of a product of a single third-order correlator and an appropriate number of second-order correlators. As an example, and using the shorthand ζe (t1 ) = ζ1 , the seventh-order correlator is factorized as follows: ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 ζ7 = ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 ζ7 + permutations,
(B.3)
where for this case, there are 7 · 6 · 5 permutations of the indices of the third-order correlator. Each fourth-order correlator can then be decomposed using the usual gaussian rules (see eq. B.2). It is important to note than no further third-order correlators are extracted out of the remaining even-order product. Otherwise, this would produce terms that go beyond the σe /ge0 correction. Hence, for a (2n + 3)-order correlator, there are (2n + 3)(2n + 2)(2n + 1) ·
(2n)! 2n n!
(B.4)
Shot Noise and Conductance Fluctuations
939
permutations. The first set of three terms comes from the different ways of arranging the single third-order correlator, and the final term comes from the gaussian statistics of the reduction of the remaining even-order correlator. B.1 The Conductance Distribution and Correlators. The normalized conductance variable h e = (ge −ge0 )/σe is introduced to simplify the following analysis. It obeys the equation τe
√ dh e = −h e + 2 ζe (t), dt
(B.5)
which can be integrated to yield √ h e (t) = 2
t −∞
ds −(t−s)/τe e ζe (s). τe
(B.6)
From this, the correlators of the conductance are found to be h e (t) = 0 h e (t)h e (t ) = exp(−|t − t |/τe ) h e (t)h e (t )h e (t ) =
4 σe exp(−|t − t |/τe ) exp(−|t − t |/τe ), 3 ge0
(B.7)
with higher-order correlators derivable from these using the underlying factorization rules for ζe (t). The steady-state distribution of the variable h e (t) can be obtained by calculating the probability density that h e (t) is found having a value near he : p(h e ) = δ(h e − h e (t)) =
∞
−∞
dq −iq h e iq h e (t) e . e 2π
(B.8)
The exponential is expanded to give ∞ ∞ iq h e (t) (iq )2m 2m (iq )3+2m 3+2m e = (t) . h e (t) + h 2m! (3 + 2m)! e m=0 m=0
(B.9)
The structure of the correlators allows this to be rewritten as
m ∞
1 −q 2 4σe e iq h e (t) = 1 + (iq )3 m! 2 3ge0 m=0 4σe 2 = 1 + (iq )3 e −q /2 , 3ge0
(B.10)
940
M. Richardson and W. Gerstner
which can be inserted into equation B.8, p(h e ) =
1−
4σe d 3 3ge0 dh 3e
∞
−∞
dq −iq h e −q 2 /2 , e 2π
(B.11)
to yield the distribution given in equation 3.4. Appendix C: The Membrane Distribution The statistics of the conductance fluctuations (given in equation 3.1) now are incorporated into a model of a passive membrane (see equation 2.1). For the following analysis, it is convenient to use the shifted voltage v = (V − E 0 ), with normalized conductances h e , h i defined in equations B.5 and B.6, τ0 v˙ + v(1 + xe h e + xi h i ) = xe Ee h e + xi Ei h i ,
(C.1)
where τ0 = C/g0 , E 0 are defined by equation 2.12, Ee = E e − E 0 , and xe = σe /g0 provides the small parameter used for the perturbative analysis of the voltage (with a similar definition of xi ). Because these small parameters are linearly related to those used for the conductance perturbation theory, corrections due to shot noise and conductance fluctuations will be simultaneously accounted for. Equation C.1 can be integrated to give v(t) =
t
−∞
t ds −(t−s)/τ0 − α(s) e s e τ0
dr τ0
β(r )
,
(C.2)
where the terms α(s) generate corrections to voltage-like quantities and β(r ) generates corrections to the effective time constant: α(s) = xe Ee h e (s) + xi Ei h i (s) β(r ) = xe h e (r ) + xi h i (r ).
(C.3)
The voltage distribution can now be obtained by evaluating the expectation p(v) = δ(v − v(t)) =
∞
−∞
dq −iq v iq v(t) e e , 2π
(C.4)
to the appropriate order in xe and xi . No correlations are assumed to exist between excitation and inhibition. This simplifying assumption can be relaxed, and the method used here easily extended to account for such correlations.
Shot Noise and Conductance Fluctuations
941
C.1 The Leading-Order Voltage Distribution. The derivation (Richardson, 2004) of the leading-order contribution to the voltage distribution of equation set 2.1 to 2.3 is first reviewed. The fluctuations of the voltage from its mean value are written as v(t) = σ (t) + O(xe2 , xi2 ) where σ (t) =
t −∞
ds −(t−s)/τ0 e (xe Ee h e (s) + xi Ei h i (s)) . τ0
(C.5)
In this approximation, the leading-order probability density is a gaussian, as can be seen by examining p0 (v) =
∞
−∞
dq −iq v iq σ (t) e e , 2π
(C.6)
where the expectation
q2 q4 1 2 2 e iq σ (t) = 1 − σ (t)2 + σ (t)4 · · · = e − 2 q σV 2 4!
(C.7)
is evaluated using the gaussian relation σ (t)2n = (2n)!σ (t)2 n /2n n! At this order, there are no contributions from the shot noise. From equation C.5, the expectation σ 2 (t) = σV2 is time independent and takes the value σV2 = xe2 Ee2
τe τi + xi2 Ei2 . (τe + τ0 ) (τi + τ0 )
(C.8)
Reinserting the result, equation C.7, into the probability density, ∞ σV2 v2 v 2 dq p0 (v) = exp − 2 , exp − q −i 2 2 2σV σV −∞ 2π
(C.9)
and evaluating the integral gives a gaussian voltage distribution: 1 (V − E 0 )2 . exp − p0 (V) = 2σV2 2πσV2
(C.10)
C.2 The Next-Order Correction to the Distribution. From the previous section, it is seen that the typical difference between the voltage and its mean scales with xe , xi . To develop a systematic expansion, the dimensionless variable y = v/σV is therefore introduced. At the next order,the expansion
942
M. Richardson and W. Gerstner
can be written y(t) = σ y (t) − φ y2 (t) + O xe2 , xi2 + · · · , where σ y (t) = σ (t)/σV and φ y2 takes the form φ y2 (t) =
1 σV
t
−∞
ds − (t−s) e τ0 τ0
s
t
dr α(s)β(r ). τ0
(C.11)
This gives the probability density correct to order xe , xi as p0 (y) + p1 (y) =
∞
−∞
dq −iq y iq (σ y (t)−φ y2 (t)) e e . 2π
Again the exponential within the expectation will be expanded and then evaluated to first order in φ y2 : ∞ ∞
(iq )2m 2m (iq )3+2m 3+2m 2 e iq (σ y (t)−φ y (t)) = σy + σ (2m)! (3 + 2m)! y m=0 m=0
−
∞ (iq )1+2m 2m 2 σ y φ y + O xe2 , xi2 . (2m)! m=0
(C.12)
The first term on the right-hand side of equation C.12 is the zero-order gaussian component treated above, the second term is the correction due to the Poissonian nature of the noise, and the third term is the correction due to the conductance-based drive. The second term is straightforward to analyze. Using the rules for the permutation of correlators, this term can be expanded out to give m ∞ ∞ (iq )3+2m 3+2m 1 −q 2 = (iq )3 σ y3 , σy (3 + 2m)! m! 2 m=0 m=0
(C.13)
which takes the form of a gaussian with a prefactor. To obtain the third term of equation C.12, expectations of the form σ y2m φ y2 need to be calculated. An examination of the structure of the integrals comprising this term shows that they can be written as
σ y2m φ y2 = σ y2m φ y2 + 2m·(2m − 1) ψ y4 σ y2m−2 .
(C.14)
Shot Noise and Conductance Fluctuations
943
The expectation φ y2 can be calculated from the form given above, and ψ y4 is defined by 4 1 ψy = 3 σ
t
−∞
ds1 ds2 ds3 τ0 τ0 τ0
t
s3
dr3 α(s1 )α(s3 )α(s2 )β(r3 ), τ0
(C.15)
where ψ y4 ∼ O(xe , xi ). An explicit form for this quantity will be given in appendix D. Substitution of the form C.14 into the third term of the expansion C.12 gives m ∞ ∞ (iq )1+2m 2m 2 2 1 −q 2 σ y φ y = iq φ y + (iq )3 ψ y4 . (2m)! m! 2 m=0 m=0
(C.16)
Inserting the results of equations C.13 and C.16 into the expansion C.12 gives
1 2 2 e iq (σ y (t)−φ y (t)) = 1 + (iq )3 σ y3 − iq φ y2 − (iq )3 ψ y4 e − 2 q ,
(C.17)
where the fact that σ y2 = 1 has been used. Inserting this into the leading order correction to the distribution, ∞ 1 2 dq −iq y (iq )3 σ y3 − iq φ y2 − (iq )3 ψ y4 e − 2 q p1 (y) = e −∞ 2π ∞ 2 d 4 3 d 3 dq −iq y − 1 q 2 = φy e 2 + ψy − σy e dy dy3 2π −∞ 1 2 = − φ y2 y + ψ y4 − σ y3 (y3 − 3y) √ e −y /2 , 2π
(C.18)
yields the perturbatively generated distribution, correct to order xe , xi , with the following functional form, 1 p(y) = √ 2π
S 1 + y µy − 2!
2 y S +y exp − , 3! 2 3
(C.19)
where µ y is the correction to the mean voltage and S is the skew, µ y = y = − φ y2 and
S = (y − y)3 = 6 σ y3 − ψ y4 , (C.20)
and the equalities hold to first order in the perturbation theory. The first correction to the mean of y is affected only by the synaptic conductance. However,there are two components of the skew S = SSN + SCF : a contribution
944
M. Richardson and W. Gerstner
SSN = 6σ y3 from the Poissonian nature of the drive and a contribution SCF = −6ψ y4 from synaptic conductance fluctuations. The functional forms of µ y and S will be evaluated by the quantities φ y2 , σ y3 and ψ y4 in the next section. Appendix D: The Voltage Mean and Skew At this order in perturbation theory, any of the higher-order cumulants of the voltage distribution can be expressed in terms of the mean µ y and the skew S, y2m =
(2m)! 2m m!
and
y2m+1 =
(2m + 2)! m + µ S , y 2m+1 (m + 1)! 3
(D.1)
where m = 0, 1, 2 · · · Only the odd correlators are different from the gaussian approximation at this order. D.1 Voltage Mean. The first quantity to be evaluated is the correction to the mean. Because of equation C.20, the integral
1 φ y2 = σ
t −∞
ds − (t−s) e τ0 τ0
s
t
dr α(s)β(r ) τ0
(D.2)
must be evaluated. These integrals can be performed using the equation set B.7 and yield for µV = v: µV = −
xe2 Ee
τe τi 2 + xi Ei . (τe + τ0 ) (τi + τ0 )
(D.3)
D.2 Voltage Skew: The Poissonian Contribution. Due to equation C.20, this requires the evaluation of
σ y3 =
3 t ds ds ds − (3t−s−sτ −s ) 1 0 e α(s)α(s )α(s ), σ −∞ τ0 τ0 τ0
(D.4)
which can be performed using the result for the third-order correlator given in equation set B.7. This yields for S SN = 6σ y3 , S SN =
1 σ3
8Ei3 τi3 xi4 (g0 /gi ) 8Ee3 τe3 xe4 (g0 /ge ) + . 3(τe + 2τ0 )(τ0 + 2τe ) 3(τi + 2τ0 )(τ0 + 2τi )
(D.5)
Shot Noise and Conductance Fluctuations
945
D.3 Voltage Skew: The Conductance Contribution. This is given by −6ψ y4 and therefore requires the evaluation of the integral given in equation C.15. After some algebraic effort, the result can be written in the form SC F
2 2 3τe + 6τe τ0 + 2τ02 (τe + 2τ0 )(2τe + τ0 ) 2 3τi2 + 6τi τ0 + 2τ02 4xi4 Ei3 τi − σ3 τi +τ0 (τi + 2τ0 )(2τi + τ0 ) 2x 2 x 2 E 3 Ei τe τi (2τe τi +τ0 (τe +τi ))(2τe (τi +τ0 ) − τi τ0 ) 2+ − 3 e i e σ (τe +τ0 )(τi +τ0 ) (2τe + τ0 )(2τi + τ0 )(τe τi + τe τ0 + τi τ0 ) 2 2 3 2xi xe Ei Ee τi τe (2τi τe +τ0 (τi +τe ))(2τi (τe +τ0 ) − τe τ0 ) 2+ . − 3 σ (τi +τ0 )(τe +τ0 ) (2τi + τ0 )(2τe + τ0 )(τi τe + τi τ0 + τe τ0 )
4x 4 E 3 = − e3 e σ
τe τe +τ0
(D.6) If only one synaptic input type is present or if the average voltage is near the reversal of inhibition such that Ei = E i − E 0 0, this result greatly simplifies. This case is given in equation 4.8 and compared to simulations of the full model in Figure 2. References Brunel, N., Chance, F. S., Fourcaud, N., & Abbott, L. F. (2001). Effects of synaptic noise and filtering on the frequency response of spiking neurons. Phys. Rev. Lett., 86, 2186–2189. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-fire neurons with reversal potentials Biol. Cybern., 85, 247–255. Burkitt, A. N., & Clark, G. M. (1999). New technique for analyzing integrate and fire neurons. Neurocomputing, 26–27, 93–99. Burkitt A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531–1547. Destexhe, A., Rudolph, M., Fellous, J.-M., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neuroscience, 107, 13–24. Destexhe, A., Rudolph, M., & Par´e, D. (2003). The high-conductance state of neocortical neurons in vivo. Nature Rev. Neurosci., 4, 739–751. Fellous, J.-M., Rudolph, M., Destexhe, A., & Sejnowski T. J. (2003). Synaptic background noise controls the input/output characteristics of single cells in an in vitro model of in vivo activity. Neuroscience, 122, 811–829.
946
M. Richardson and W. Gerstner
Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and- fire neurons. Neural Comput., 14, 2057–2110. Gilbert, E. N., & Pollak, H. O. (1960). Amplitude distributions of shot noise. Bell. Syst. Tech. J., 39, 333–350. Grande, L. A., Kinney, G. A., Miracle G. L., & Spain W. J. (2004). Dynamic influences on coincidence detection in neocortical pyramidal neurons. J. Neurosci., 24, 1839– 1851. Hahnloser, R. H. R. (2003). Stationary transmission distribution of random spike trains by dynamical synapses. Phys. Rev. E, 67, 022901. Hohn, N., & Burkitt, A. N. (2001). Shot noise in the leaky integrate-and-fire neuron. Phys. Rev. E, 63, 031902. Holmgren, C., Harkany, T., Svennenfors, B., & Zilberter, Y. (2003). Pyramidal cell communication within local networks in layer 2/3 of rat neocortex. J. Physiol. London., 551, 139–153. Johannesma, P. I. M. (1968). Diffusion models for the stochastic activity of neurons. In E. R. Caianello (Ed.), Neural networks (pp. 116–144). New York: Springer. Jolivet, R., Lewis, T. J., & Gerstner, W. (2004). Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy. J. Neurophysiol., 92, 959–976. Kamondi A., Acsady, L., Wang, X.-J., & Buzsaki, G. (1998). Theta oscillations in somata and dendrites of hippocampal pyramidal cells in vivo: Activity-dependent phaseprecession of action potentials. Hippocampus, 8, 244–261. Kuhn, A., Aertsen, A., & Rotter, S. (2003). Higher-order statistics of input ensembles and the response of simple model neurons. Neural Comp., 15, 67–101. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24, 2345–2356. La Camera, G., Senn, W., & Fusi, S. (2004). Comparison between networks of conductance and current-driven neurons: Stationary spike rates and subthreshold depolarization. Neurocomputing, 58–60, 253–258. Lansky, P., & Lanska, V. (1987). Diffusion approximation of the neuronal model with synaptic reversal potentials. Biol. Cybern., 56, 19–26. Manwani, A., & Koch, C. (1999). Detecting and estimating signals in noisy cable structures, I: Neuronal noise sources. Neural Comp., 11, 1797–1829. Meffin, H., Burkitt, A. N., & Grayden, D. B. (2004). An analytical model for the “large, fluctuating synaptic conductance state” typical of neocortical neurons in vivo. J. Comput. Neurosci., 16, 159–175. Monier, C., Chavane, F., Baudot, P., Graham, L. J., & Fr´egnac, Y. (2003). Orientation and direction selectivity of synaptic inputs in visual cortical neurons: A diversity of combinations produces spike tuning. Neuron, 37, 663–680. Prescott, S. A., & De Koninck, Y. (2003). Gain control of firing rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. P. Natl. Acad. Sci., 100, 2076–2081. Rauch, A., La Camera, G., Luscher, ¨ H.-R., Senn, W., & Fusi, S. (2003). Neocortical pyramidal cells respond as integrate-and-fire neurons to in vivo–like input currents. J. Neurophysiol., 90, 1598–1612. Richardson, M. J. E. (2004). Effects of synaptic conductance on the voltage distribution and firing rate of spiking neurons. Phys. Rev. E, 69, 051918.
Shot Noise and Conductance Fluctuations
947
Risken, H. (1996). The Fokker-Planck equation. New York: Springer-Verlag. Rodriguez, M. A., Pesquera, L., San Miguel, M., & Sancho, J. M. (1985). Master equation description of external Poisson white noise in finite systems. J. Stat. Phys., 40, 669–724. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Rudolph, M., & Destexhe, A. (2003). Characterization of subthreshold voltage fluctuations in neuronal membranes. Neural Comput., 15, 2577–2618. Rudolph, M., Piwkowska Z., Badoual, M., Bal, T., & Destexhe, A. (2004). A method to estimate synaptic conductances from membrane potential fluctuations. J. Neurophysiol., 91, 2884–2896. Sanchez-Vives, M. V., & McCormick, D. A. (2000). Cellular and network mechanisms of rhythmic recurrent activity in neocortex. Nat. Neurosci., 3, 1027–1034. Silberberg, G., Wu, C. Z., & Markram, H. (2004). Synaptic dynamics control the timing of neuronal excitation in the activated neocortical microcircuit. J. Physiol-London, 556, 19–27. Stein, R. B. (1965). A theoretical analysis of neuronal activity. Biophys. J., 5, 173–193. Stein, R. B. (1967). Some models of neuronal variability. Biophys. J., 7, 37–68. Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductance-based integrate-and-fire neurons due to common and synchronous presynaptic firing. Neural. Comp., 13, 2005–2029. Tiesinga, P. H. E., Jos´e, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated currents. Phys. Rev. E, 62, 8413–8419. Tuckwell, H. C. (1979). Synaptic transmission in a model for stochastic neural activity. J. Theor. Biol., 77, 65–81. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: SIAM. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. C. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Wan, F. Y. M., & Tuckwell, H. C. (1979). The response of a spatially distributed neuron to white noise current injection. Biol. Cybern., 33, 39–55. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike distribution. J. Theor. Biol., 105, 345–368.
Received August 4, 2004; accepted October 1, 2004.
LETTER
Communicated by Michael Cohen
Dynamical Behaviors of a Large Class of General Delayed Neural Networks Tianping Chen
[email protected] Wenlian Lu
[email protected] Laboratory of Nonlinear Mathematics Science, Institute of Mathematics, Fudan University, Shanghai, 200433, China
Guanrong Chen
[email protected] Electronic Engineering Department, City University of Hong Kong, Kowloon, Hong Kong, China
Research of delayed neural networks with varying self-inhibitions, interconnection weights, and inputs is an important issue. In the real world, self-inhibitions, interconnection weights, and inputs should vary as time varies. In this letter, we discuss a large class of delayed neural networks with periodic inhibitions, interconnection weights, and inputs. We prove that if the activation functions are of Lipschitz type and some set of inequalities, for example, the set of inequalities 3.1 in theorem 1, is satisfied, the delayed system has a unique periodic solution, and any solution will converge to this periodic solution. We also prove that if either set of inequalities 3.20 in theorem 2 or 3.23 in theorem 3 is satisfied, then the system is exponentially stable globally. This class of delayed dynamical systems provides a general framework for many delayed dynamical systems. As special cases, it includes delayed Hopfield neural networks and cellular neural networks as well as distributed delayed neural networks with periodic self-inhibitions, interconnection weights, and inputs. Moreover, the entire discussion applies to delayed systems with constant self-inhibitions, interconnection weights, and inputs. 1 Introduction Recurrently connected neural networks, sometimes called GrossbergHopfield neural networks, have been extensively studied, and many applications in different areas have been found. It is well known that many applications depend heavily on the dynamical behaviors of the networks. Therefore, analysis of these dynamic behaviors is a necessary step toward practical design of these neural networks. Neural Computation 17, 949–968 (2005)
© 2005 Massachusetts Institute of Technology
950
T. Chen, W. Lu, and G. Chen
A recurrently connected neural network is described by the following differential equations: n dui (t) a i j g j (u j (t)) + Ii = −di ui (t) + dt j=1
i = 1, . . . , n,
(1.1)
where g j (x) are activation functions, di , a i j are constants, and Ii are constant inputs. The dynamical behaviors of recurrently connected neural networks have been studied since the early period of research on neural networks. For example, multistable and oscillatory behaviors were studied by Amari (1971, 1972) and Wilson and Cowan (1972). Chaotic behaviors were studied by Sompolinsky and Crisanti (1988). Hopfield and Tank (1984, 1986) studied the dynamical stability of symmetrically connected networks and showed their practical applicability to optimization problems. Cohen and Grossberg (1983) gave more rigorous results on the global stability of the neural networks. A number of researchers have studied local and global stability of asymmetrically connected networks (e.g., Hirsch, 1989; Matsuoka, 1992; Michel & Gray, 1990; Chen & Amari, 2001; Chen, Lu, & Amari, 2002). Wang and Zou (2002) and Lu and Chen (2003) discussed the global stability of CohenGrossberg neural networks. In practice, the system is unsynchronized. Therefore, one often needs to investigate the following delayed dynamical systems with fixed delays or time-varying delays: n dui (t) a i j g j (u j (t)) = − di ui (t) + dt j=1
+
n
b i j f j (u j (t − τi j )) + Ii
i = 1, . . . , n
(1.2)
j=1
There are also many articles discussing dynamical behaviours of delayed neural networks. The delayed system (see equation 1.2) was investigated by Belair (1993), Belair, Campbell, and van den Driessche (1996), Cao and Wu (1996); Cao and Zhou (1998), Cao (2000), Chen and Amari (2001), Chen (2001), Gopalsamy and He (1991), Marcus and Westervelt (1989), Zhang and Jin (2000), and others. In the literature, various conditions ensuring global and exponential stabilities have been given. For more general delayed functional-differential equations, see early work by Hale (1965) and Miller (1965).
On a Large Class of Delayed Neural Networks
951
In this letter, we discuss a large class of delayed dynamical systems, which can be represented as n dui a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1
+
n j=1
∞
f j (u j (t − τi j − s))ds K i j (t, s) + Ii (t)
(1.3)
0
or n dui a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1
+
n j=1
∞
f j (u j (t − τi j (t) − s))ds K i j (t, s) + Ii (t),
(1.4)
0
where for any fixed t, ds K i j (t, s) are Lebesgue-Stieljies measures, which satisfy dsK i j (t + ω, s) = ds K i j (t, s), di (t), a i j (t), b i j (t), Ii (t), τi j (t) : R+ → R are continuously periodic functions with period ω > 0, that is, di (t + ω) = di (t), a i j (t) = a i j (t + ω), b i j (t) = b i j (t + ω), Ii (t) = Ii (t + ω), τi j (t + ω) = τi j (t) for all t > 0 and i, j = 1, 2, . . . , n. In section 3, we investigate the existence of a periodic solution for systems 1.3 and 1.4 and its global stability and exponential stability. 2 Some Definitions and Assumptions In this section, we give some definitions and assumptions that are used throughout the letter: Definition 1. Lip(G) denotes the class of functions with Lipshitz constant G > 0. Definition 2. The following two norms are defined for every t ∈ (−∞, ∞): 1. {ξ, 1}-norm. u(t){ξ,1} = i |ξi ui (t)|, where ξi > 0, i = 1, . . . , n. 2. {ξ, ∞}-norm. u(t){ξ,∞} = max i |ξi−1 ui (t)|, where ξi > 0, i = 1, . . . , n. Definition 3. For any real number a , denote a + = ma x{a , 0}. ¯ Throughout this article, we assume |ds K i j (t, s)| ≤ |dK¯ i j (s)|, ∞where dK i j (s) is a Lebesgue-Stiejies measure independent of t and satisfies 0 s|dK¯ i j (s)| < +∞, di (t) > di > 0 with di being constants, |a i∗j | = sup{0 0, t2 > 0, we have
t2
lim
m→∞ t 1
|g(u j (t + mω)) − g(v j (t))|dt
≤ G j lim
t2
m→∞ t 1
lim
m→∞ t 1
t2
|u j (t + mω) − v j (t)|dt = 0
(3.12)
| f (u j (t + mω)) − f (v j (t))|dt
≤ F j lim
m→∞ t 1
t2
|u j (t + mω) − v j (t)|dt = 0
(3.13)
On a Large Class of Delayed Neural Networks
∞
|ds K¯ i j (s)|
lim
m→∞ 0
t1 ∞
≤
t2
| f (u j (t + mω − τi j − s)) − f (v j (t − τi j − s))|dt
|ds K¯ i j (s)| lim
t2
m→∞ t 1
0
955
| f (u j (t + mω − τi j − s))
− f (v j (t − τi j − s))|dt = 0.
(3.14)
Let t1 > 0, t2 > 0 be two points where equation 3.11 holds. Then, vi (t1 ) − vi (t2 ) = lim {ui (t1 + mk ω) − ui (t2 + mk ω)} mk →∞
= lim
t1
mk →∞ t 2
+
n j=1
t1
=
∞
−di (t)ui (t + mk ω) +
f j (u j (t + mk ω − τi j − s))ds K i j (t, s) + Ii (t) dt
0
−di (t)vi (t) +
n
a i j (t)g j (v j (t))
j=1 n j=1
a i j (t)g j (u j (t + mk ω))
j=1
t2
+
n
∞
f j (v j (t − τi j − s))ds K i j (t, s) + Ii (t) dt.
(3.15)
0
Therefore, vi (t) is absolutely continuous on the set where equation 3.11 is valid. Modifying vi (t) at those points t where equation 3.11 might be invalid (a set of Lebesgue measure 0) by continuity, we conclude that every vi (t) is absolutely continuous and n dvi (t) a i j (t)g j (v j (t)) = −di (t)vi (t) + dt j=1
+
n j=1
∞
f j (v j (t − τi j − s))ds K i j (t, s) + Ii (t).
(3.16)
0
In summary, v(t) = [vi (t), . . . , vn (t)] is a periodic solution of the dynamical system 1.3. Now we will prove that for every solution u(t) = [ui (t), . . . , un (t)], we have lim ui (t) = vi (t).
t→∞
(3.17)
956
T. Chen, W. Lu, and G. Chen
In fact, define ui∗∗ (t) = ui (t) − vi (t), g ∗∗ j (u j (t)) = g j (u j (t)) − g j (v j (t)), f j∗∗ (u j (t)) = f j (u j (t)) − f j (v j (t)), i = 1, . . . , n, and construct a Lyapunov function, L ∗1 (t) =
n
ξi |ui∗∗ (t)| +
i=1
N
ξi
i, j=1
∞
0
t
t−τi j −s
| f j∗∗ (u j (y))|dy|d K¯ i j (s)|. (3.18)
∗ Replacing ui (t + ω) − ui (t) by ui∗∗ (t), g ∗j (u j (t)) by g ∗∗ j (u j (t)), f j (u j (t)) by f j∗∗ (u j (t)), by using the same argument in proving equation we can prove
lim [ui (t) − vi (t)] = 0.
(3.19)
t→∞
In the following, we investigate the exponential stability of the delayed networks 1.3 and 1.4. We prove two theorems. Theorem 2. Suppose thatα > 0 , G i , Fi > 0 , gi (·) ∈ Li p(G i ) and is nondecreas∞ ing, f i (·) ∈ Lip(Fi ), and 0 e αs |d K¯ i j (s)| < ∞, i, j = 1, 2, . . . , n. If there are positive constants ξ1 , . . . ,ξn such that ξ j (−d j + α) + ξ j a j j (t) +
i= j
+
n i=1
ξi
∞
+ ξi |a i∗j |
e α(s+τi j ) |d K¯ i j (s)|F j ≤ 0
Gj j = 1, . . . , n,
(3.20)
i0
then the dynamical system 1.3 has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) of equation 1.3, there hold ||u(t + jω) − v(t)||{ξ,1} = O(e − jωα ).
(3.21)
Proof. Let ui∗ (t) = ui (t + ω) − ui (t), gi∗ (ui (t)) = gi (ui (t + ω)) − gi (ui (t)) f i∗ (ui (t)) = f i (ui (t + ω)) − f i (ui (t)). Direct calculations lead to the following equations: d{e αt u∗ (t)} αt a i j g ∗j (u j (t)) = e −di (t)ui∗ (t) + αui∗ (t) + dt j n ∞ ∗ f j (u j (t − τi j − s))ds K i j (t, s) i = 1, . . . , n. + j=1
0
On a Large Class of Delayed Neural Networks
957
Define a Lyapunov function by L 2 (t) =
n
ξi |e αt ui∗ (t)|
i=1
+
n
ξi
∞
t−τi j −s
0
i, j=1
t
e α(s+y+τi j ) | f j∗ (u j (y))|dy|d K¯ i j (s)|.
Differentiating L 2 (t), we have
L˙ 2 (t) =
n
ξi sign{ui∗ (t)}e αt
−di (t)ui∗ (t) + αui∗ (t)
i=1
+
n
a i j g ∗j (u j (t))
+
n
j=1
+
− τi j − s))ds K i j (t, s)
e αt | f j∗ (u∗j (t − τi j − s))||d K¯ i j (s)|
0
i, j=1
≤
∞
ξi
f j∗ (u j (t
e α(s+t+τi j ) | f j∗ (u j (t))||d K i j (s)|
0
n
n
∞
ξi
i, j=1
−
0
j=1
n
∞
ξi e
αt
(−di + α)|ui∗ (t)| + a ii (t)sign{ui∗ (t)}gi∗ (ui (t))
i=1
+ sign{ui∗ (t)}
+
n
|
n j=1, j=i
f j∗ (u j (t))|
j=1
=
n
|a i∗j |g ∗j (u j (t))
∞
e
α(s+τi j )
|d K¯ i j (s)|
0
e
αt
ξ j (−d j + α)|ui∗ (t)|
j=1
+ ξ j a j j (t) +
i= j
+
n i=1
ξi |a i∗j |
ξi | f j∗ (u j (t))|
∞
e 0
sign{u∗j (t)}g ∗j (u j (t)) α(s+τi j )
|d K¯ i j (s)|
958
T. Chen, W. Lu, and G. Chen n
≤
+ ξi |a i∗j | G j ξ j (−d j + α) + ξ j a j j (t) + i= j
j=1
+
n
∞
ξi F j 0
i=1
e α(s+τi j ) |d K¯ i j (s)| e αt |u∗j (t)| ≤ 0.
Therefore, L 2 (t) is bounded, and n
ξi |ui (t + ω) − ui (t)| = O(e −αt ).
(3.22)
i=1
Now define a function v(t) = [v1 (t), . . . , vn (t)]T by vi (t) = lim ui (t + jω). j→∞
Because of ui (t + jω) = ui (t) +
j
{ui (t + kω) − ui (t + (k − 1)ω)}
k=1
and equation 3.22, v(t) is well defined and is a periodic function with period ω. Moreover, ||u(t + jω) − v(t)||{ξ,1} = O(e − jωα )
when
j → ∞.
Theorem 3. Suppose that G i , Fi > 0, gi (·) ∈ Li p(G i ) and is nondecreasing, ∞ f i (·) ∈ Li p(Fi ), 0 e αs |d K¯ i j (s)| < ∞, i, j = 1, 2, . . . , n. If there are positive constants ξ1 , . . . , ξn and α > 0 such that for i = 1, 2, . . . , n, −(di − α)ξi + ξi {a ii (t)}+ +
j=i
+
n j=1
ξj Fj
∞
|a i∗j |G j ξ j
∗
e α(s+τi j ) |d K¯ i j (s)| ≤ 0.
(3.23)
0
Then the dynamical system 1.4 has a unique ω-periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of equation 1.4, there hold ||u(t + jω) − v(t)||{ξ,∞} = O(e −jωα )
j → ∞.
(3.24)
On a Large Class of Delayed Neural Networks
959
Proof. Let C = C([−∞, 0], Rn ) be the Banach space of continuous functions, which map [−∞, 0] into Rn with norm ϕ =
sup ϕ(θ){ξ,∞} .
(3.25)
−∞≤θ ≤0
Define zi (t) = [ui (t + ω) − ui (t)], gi∗ (zi (t)) = gi (ui (t + ω)) − gi (ui (t)), f i∗ (zi (t)) = f i (ui (t + ω)) − f i (ui (t)), and M1 (t) = sup e α(t−s) z(t − s){ξ,∞} . 0≤s≤∞
It is clear that e α(t) z(t){ξ,∞} ≤ M1 (t) and M1 (t) is nondecreasing. In the following, we will prove that M1 (t) = M1 (0) for all t > 0. In fact, for every t0 > 0, there are two possible cases: Case 1. e αt0 z(t0 ){ξ,∞} < M1 (t0 ). In this case, M1 (t) is not increasing at t = t0 . Case 2. e αt0 z(t0 ){ξ,∞} = M1 (t0 ). In this case, let i t0 . be an index such that αt e 0 zi (t0 ) = e αt0 z(t0 ){ξ,∞} = M1 (t0 ). ξi−1 t0 t 0
Then we have
d|e αt zit (t)| dt × +
t = t0
= sign(zit0(t0 )e αt0
−αzit0 (t0 ) dit0 uit0 (t0 + ω) − dit0 uit0 (t0 )
n
a it0 j (t0 )g ∗j (z j (t0 ))
j=1
+
n j=1
∞
f j∗
0
z j t0 − τit0 j (t0 ) − s d K it0 , j (t0 , s)
n a ∗ G j |z j (t0 )| ≤ e αt0 − dit0 − α + a i+t it (t0 ) zit0 (t0 ) + i t0 j 0 0 j=i t0
+
n j=1
Fj 0
∞
z j t0 − τi j (t0 ) − s d K¯ i , j (s) t0 t0
960
T. Chen, W. Lu, and G. Chen
αt0 e zi (t0 ) ≤ − dit0 − α ξit0 + ξit0 a i+t it (t0 ) ξi−1 t0 t0 0 0 n
+
αt e 0 z j (t0 ) ξ j a i∗t j G j ξ −1 j 0
j=i t0
+
n
ξj Fj
α(t0 −τi j (t0 )−s) t0 e ξ −1 z j (t0 − τit0 j (t0 ) − s) j
∞
0
j=1
× e α(s+τit0 j (t0 )) d K¯ it0 , j (s)
n ∗ + ≤ −(dit0 − α)ξit0 + ξi0 a it it (t0 ) + ξ j a it j G j e αt0 z(t0 ){ξ,∞} 0 0 0 j=i t0
+
n
ξj Fj
e α(s+τit0 j (t0 )) d K¯ i0 , j (s)
∞
0
j=1
α(t0 −τi j (t0 )−s) t0 z(t0 − τit0 j (t0 ) − s){ξ,∞} × e
n − dit0 − α ξit0 + ξit0 a i+t it (t0 ) + ξ j a i∗t j G j
≤
0 0
+
n
ξj Fje
j=1
ατi∗t
0
j
∞
e 0
αs
j=i t
0
d K¯ it0 , j (s) M1 (t0 ) ≤ 0.
(3.26)
Therefore, M1 (t) is not increasing at t = t0 . In summary, we have M1 (t) = M1 (0) for all t > 0. Therefore, u(t + ( j + 1)ω) − u(t + jω){ξ,∞} ≤ e − jα sup u(s + ω) − u(s). s∈(−∞,0]
(3.27) Similar to the proof of the previous theorem, define a periodic function v(t) = [v1 (t), . . . , vn (t)]T by vi (t) = lim ui (t + jω). j→∞
We know that v(t) is well defined and ||u(t + jω) − v(t)||{ξ,∞} = O(e − jωα ) when
j → ∞.
On a Large Class of Delayed Neural Networks
961
As a special case, let ds K i j (t, 0) = b i j (t), and ds K i j (t, s) = 0, for s = 0. Then dynamical systems 1.3 and 1.4 reduce to n dui (t) a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1
+
n
b i j (t) f j (u j (t − τi j )) + Ii (t) i = 1, . . . , n
(3.28)
j=1
and n dui (t) a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1
+
n
b i j (t) f j (u j (t − τi j (t))) + Ii (t) i = 1, . . . , n,
(3.29)
j=1
respectively. It should be noted that model 3.28 has been investigated in detail by Zhou, Liu, and Chen (2004) in terms of L 2 norm and Zheng and Chen (2004) in terms of L p norm. In this special case, d K¯ i j (0) = |b i∗j |, d K¯ i j (s) = 0 for s = 0 and ∞ ∗ ∗ e α(s+τi j ) |d K¯ i j (s)| = e ατi j |b i∗j |. 0
Therefore, as consequences of theorems 2 and 3, we obtain following two theorems: Theorem 4. Suppose that G i , Fi > 0 , gi (·) ∈ Li p(G i ) and is nondecreasing, f i (·) ∈ Li p(Fi ), i = 1, 2, . . . , n. If there are positive constants ξ1 , . . . , ξn such that + −ξ j (d j − α) + G j ξ j a j j (t) + ξi |a i∗j | i= j
+ Fj
n i =1
ξi e ατi j |b i∗j | ≤ 0 ,
j = 1, . . . , n,
(3.30)
j = 1, . . . , n,
(3.31)
or in particular, −ξ j (d j − α) + G j
n i =1
+ Fj
n i =1
ξi |a i∗j |
ξi e ατi j |b i∗j | ≤ 0,
962
T. Chen, W. Lu, and G. Chen
then the dynamical system, 3.28, has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of that equation: ui (t) − vi (t) = O(e −αt ) i = 1, . . . , n.
(3.32)
Theorem 5. Suppose that G i , Fi > 0, for all i = 1, 2, . . . , n. g(x) = (g1 (x), . . . , gn (x))T , where gi (·) ∈ Li p(G i ) is nondecreasing, f (x) = ( f 1 (x), . . . , f n (x))T , where f i (·) ∈ Li p(Fi ). If there are positive constants ξ1 , . . . , ξn and α > 0 such that −(di − α)ξi + ξi a ii + (t) + ξ j |a i∗j |G j j=i
+
n
∗
ξi e ατi j |b i∗j |F j ≤ 0
i = 1, 2, . . . , n,
(3.33)
j=1
in particular, −(di − α)ξi +
n
|a i∗j |G j ξ j +
j=1
n
∗
ξ j e ατi j |b i∗j |F j ≤ 0 i = 1, 2, . . . , n,
j=1
(3.34) then the dynamical system 3.29 has a unique ω-periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of that equation, ui (t) − vi (t) = O(e −αt ) i = 1, . . . , n.
(3.35)
Suppose that ds K i j (t, s) = b i j (t)ki j (s)ds. Then dynamical systems 1.3 and 1.4 reduce to following dynamical systems with distributed delays: n dui (t) a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1
+
n
∞
b i j (t)
ki j (s) f j (u j (t − τi j − s)) ds + Ii (t)
(3.36)
0
j=1
and n dui (t) a i j (t)g j (u j (t)) = −di (t)ui (t) + dt j=1
+
n j=1
respectively.
∞
b i j (t) 0
ki j (s) f j (u j (t − τi j (t) − s)) ds + Ii (t),
(3.37)
On a Large Class of Delayed Neural Networks
963
In this special case, d K¯ i j (s) = |b i∗j ||ki j (s)|, and
∞
0
∗ e α(s+τi j ) |d K¯ i j (s)| = |b i∗j |
∞
∗
e α(s+τi j ) |ki j (s)| ds.
0
Therefore, as consequence of theorem 1, we have: Theorem 6. Suppose that G i , Fi > 0, for all i = 1, 2, . . . , n, g(x) = (g1 (x), . . . , gn (x))T , where gi (·) ∈ Li p(G i ) is nondecreasing, f (x) = ∞ ( f 1 (x), . . . , f n (x))T , where f i (·) ∈ Li p(Fi ), and 0 s|ki j (s)|ds < ∞. If there are positive constants ξ1 , . . . , ξn such that −ξ j d j + ξ j a j j (t) +
i= j
+ Fj
n
ξi |b i∗j |
∞
+
ξi |a i∗j |
Gj
|ki j (s)| ds < 0,
j = 1, . . . , n,
(3.38)
0
i=1
then the dynamical system 3.36 has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of that equation, lim ui (t) = vi (t) i = 1, . . . , n.
(3.39)
j→∞
Similarly, the following two theorems can be proven as consequences of theorems 2 and 3, respectively: Theorem 7. Suppose that G i , Fi > 0 , gi (·) ∈ Li p(G i ) and is nondecreasing, ∞ f i (·) ∈ Li p(Fi ), i = 1, 2, . . . , n, and 0 s|ki j (s)| ds < ∞. If there are positive constants ξ1 , . . . , ξn such that −ξ j (d j − α) + ξ j a j j (t) +
i= j
+ Fj
n i=1
ξi |b i∗j |
∞
+ ξi |a i∗j |
Gj
e α(s+τi j ) |ki j (s)|ds < 0,
j = 1, . . . , n,
(3.40)
0
then the dynamical system 3.36 has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)] and, for any solution u(t) = [u1 (t), . . . , un (t)] of that equation, there hold ui (t) − vi (t) = O(e −αt ) i = 1, . . . , n.
(3.41)
964
T. Chen, W. Lu, and G. Chen
Theorem 8. Suppose that G i , Fi > 0 , gi (·) ∈ Li p(G i ) and is nondecreasing, ∞ f i (·) ∈ Li p(Fi ), i = 1, 2, . . . , n, and 0 s|ki j (s)|ds < ∞. If there are positive constants ξ1 , . . . , ξn such that −(di − α)ξi + ξi a ii + (t) +
j=i
+
n
ξ j F j |b i∗j |
j=1
∞
ξ j |a i∗j |G j
∗
e α(s+τi j ) |ki j (s)|ds ≤ 0 i = 1, 2, . . . , n,
(3.42)
0
then the dynamical system 3.37 has a unique periodic solution v(t) = [v1 (t), . . . , vn (t)], and for any solution u(t) = [u1 (t), . . . , un (t)] of that equation, there hold ui (t) − vi (t) = O(e −αt ) i = 1, . . . , n.
(3.43)
Remark. Because constants can be regarded as ω-periodic functions with any period ω > 0, all the theorems apply to dynamical systems when di (t), a i j (t), b i j (t), and Ii (t) are constants. The details are omitted. It is natural to ask whether the conditions −ξ j d j + ξ j a j j +
+ ξi |a i j |
Gj
i= j
+
N
ξi
∞
|d K¯ i j (s)|F j < 0
j = 1, . . . , n
(3.44)
0
i=1
can guarantee exponential convergence of the dynamical system 1.3. The answer is no. Let us consider the following delayed differential equation, u(t) ˙ = −u(t) +
∞
u(t − s)dK (s),
(3.45)
0
where dK (τ ) = rewritten as
1 2
and dK (t) = 0 if t = τ . The differential equation can be
1 u(t) ˙ = −u(t) + u(t − τ ). 2
(3.46)
The solution of the differential equation is u(t) = e −αt ,
(3.47)
On a Large Class of Delayed Neural Networks
965
where α satisfies e ατ = 2(1 − α)
(3.48)
τ = log{2(1 − α)}/α.
(3.49)
or
If τ → ∞, then α → 0. It means that there exists no such fixed β that u(t) = O(e −βt ) holds for any delay τ . 4 A Numerical Experiment In this section, we present a numerical experiment to verify our main results. Consider the following delayed neural networks with two neurons: u˙ 1 = −2u1 (t) − 2 tanh(u1 (t)) + sin2 (t) tanh(u2 (t)) + | sin(2t)|arctg(u1 (t − 1)) +
1 cos2 (t)arctg(u2 (t − 2)) 2
+ sin(2t) + 3 u˙ 2 = −u2 (t) + 2 cos2 (t) tanh(u1 (t)) − tanh(u2 (t)) 1 2 sin2 (t)arctg(u1 (t − 1)) + | cos(t)|arctg(u2 (t − 2)) 3 3 + cos(t) + 4, +
(4.1)
where d1 = 2 ∗ b 11 =1
∗ ∗ d2 = 1 a 11 = −2 a 22 = −1 a 12 = 1 a 21 = 2 ∗ b 12 =
1 2
∗ = b 21
2 3
∗ = b 22
1 3
G 1 = G 2 = F1 = F2 = 1.
With ξ1 = ξ2 = 1, we have 2 1 = − . Once an orthonormal basis is given, this may be writ< qi − q j , qi − q i j ten as D(q , q ) = (qi − q j )t M(qi − q j ), where M is a hermitian, positive definite matrix representing the scalar product. Condition 4 imposes symmetry among the components of the vectors, which means that M must be proportional to the unit matrix. Therefore, out of all the distances that have a scalar product associated with them, the only one that fulfills condition 4 is the Euclidean distance—apart from a scale factor, fixing the units. What should be the meaning of the maximum subjective distance? The maximum distance should be reserved to those pairs of objects that the subject distinguishes unambiguously from one another: that is, to those s i and s j that are never confounded with a common stimulus. Mathematically, this means that for each k, either Q(s k |s i ) or Q(s k |s j ) (or both) must vanish, that is, whenever Q(s k |s i ) = 0, Q(s k |s j ) = 0 (and vice versa). In this case, whatever the response of the subject to stimulus s i , it never coincides with his or her response to stimulus s j . This situation corresponds to the intuitive notion of unambiguous segregation: the response of the subject to stimulus i is enough to ensure that the stimulus was not j. And vice versa, the response to stimulus j is enough to discard stimulus i. The fifth requirement, hence, reads: 5. If D(qi , q j ) is maximal if and only if s i and s j are unambiguously segregated. And conversely, if D(qi , q j ) is not maximal, then s i and s j are not unambiguously segregated. Imposing requirement 5 ensures that the stimuli that are unambiguously segregated are all at the same distance, no matter any other particular characteristic of the stimuli. And conversely, if two stimuli do not elicit segregated responses, they are not allowed to be at the maximum distance. Condition 5 establishes the cases that correspond to the maximum distance, in the same way that condition 1 does to the minimum distance. Adding the triangular inequality 3 ensures that those pairs of stimuli whose distance lies in between the minimal and the maximal one be consistently ordered. Condition 5 has two important consequences. In the first place, it ensures that the clustering structure present in Q is also reflected in the matrix of distances D. In D, cross-terms between different blocks are equal not to zero but to the maximum distance. Conversely, if D shows a block structure, it may be shown that Q has the same block structure. In the second place, if the maximum distance is finite, a similarity matrix S may be defined, S = d M I − D, where d M is the maximum distance and I the unit matrix. The matrix S also inherits, if present, the clustering structure of Q. This correspondence between the clustering structure of D (and of S) with the one of Q cannot be ensured if condition 5 is not fulfilled.
974
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
j 2 i The Euclidean distance DE (qi , q j ) = k (q k − q k ) does not fulfill condition 5. Taking into account that the q vectors are normalized (see equation 1.1), the √ maximum value of the Euclidean distance between two stimuli is 2. It can be attained, for example, for (q1 )t = (1, 0, 0, 0) and (q2 )t = (0, 1, 0, 0). In this example, in fact, stimulus 1 shares no common response with stimulus 2. However, not all stimuli with no common responses lie at the maximum Euclidean distance. Consider, for example, (q3 )t = (1/2, 1/2, 0, 0) and (q4 )t = (0, 0, 1/2, 1/2). Though showing disjoint response sets, their Euclidean distance is equal to 1, which is less than the maximum distance. This means that none of the distances that can be associated with a scalar product are useful as a measure of subjective dissimilarity. 3 Choosing a Subjective Distance There are still many distances fulfilling requirements 1 through 5. In what follows, a single one is selected on the basis of a maximum likelihood decoding. Imagine that someone observing the subject’s responses to either stimulus i or j has to guess which of the two has been presented. For the moment, we assume for simplicity that both stimuli appear with the same frequency; this requirement will be abandoned later. We assume the observer is familiar with the confusion matrix of the subject. There are several ways in which he or she can decide between stimuli i and j given the subject’s response. Here, a maximum likelihood strategy is considered, since this is the algorithm that maximizes the fraction of stimuli correctly identified. It consists of taking the choice of the subject—say, stimulus k—and deciding whether the actual stimulus was i or j on the basis of which of them has the larger qk component. If Q(s k |s i ) > Q(s k |s j ), the observer chooses stimulus i; if the opposite holds, the observer chooses stimulus j. If both conditional probabilities are equal, then the observer chooses any of the two stimuli, with equal probabilities. Under this scheme, the fraction of times the observer correctly identifies stimulus i is P(s i |s i ) =
q ki + j
k/q ki >q k
1 i q . 2 i j k
(3.1)
k/q k =q k
The fraction of times stimulus j is presented but the observer chooses stimulus i is P(s i |s j ) =
j
qk + j
k/q ki >q k
1 i q . 2 i j k k/q k =q k
(3.2)
A Subjective Distance Between Stimuli
975
Correspondingly, when the observer decides for stimulus j,
P(s j |s j ) =
j
qk +
j
k/q k >q ki
P(s j |s i ) =
j
k/q k >q ki
1 j q . 2 i j k
(3.3)
1 j q . 2 i j k
(3.4)
k/q k =q k
q ki +
k/q k =q k
The distance D(qi , q j ) is defined as the difference between the fraction of correct and incorrect maximum likelihood choices, namely, D(qi , q j ) = =
1 1 P(s i |s i ) − P(s i |s j ) + P(s j |s j ) − P(s j |s i ) 2 2 N 1 j |q i − q k |. 2 k=1 k
(3.5)
In other words, the distance between stimulus i and stimulus j is defined in terms of the performance of the maximum likelihood decoding, assuming that the response statistics of the subject are known. This definition is easily shown to fulfill all five conditions. A distance equal to zero means that the observer is deciding at chance between the two stimuli. Given that he or she uses a maximum likelihood strategy, that means that the two underlying vectors are equal. A distance equal to 1 implies that the observer always makes the right choice. In what follows, some mathematical properties of the distance D are analyzed. In order to get a geometrical flavor of D, Figure 2A shows a contour plot of the distance of all the vectors in D to the vectors (1/3, 1/3, 1/3) and Figure 2B to the vectors (0, 0, 1). The subjective distance D is translation invariant. That is, if the vectors qi and q j are displaced by a fixed vector q, the distance between them remains unchanged. Mathematically, D(qi , q j ) = D(qi + q, q j + q).
(3.6)
Equation 3.6 is valid for any displacement q. However, in the context here, the displacement should be such that both qi + q and q j + q fall inside the domain D. This means that the components of q must sum up to zero, and its magnitude must be bounded (the value of the bound depends on the location of qi and q j ). As a further characterization, the distance D between a stimulus that is perfectly identified by the subject—say, stimulus i—and another stimulus j, N j is given. We take (qi )t = (1, 0, 0, . . . , 0). In this case, D(qi , q j ) = k=2 qk =
976
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
Figure 2: Contour plot of the distance D to a fixed vector qi , in D. (A) qi = (1/3, 1/3, 1/3). (B), qi = (1, 0, 0).
j
j
1 − q 1 . The distance between qi and q j is fully determined by q 1 . It does not matter whether the probability of not selecting stimulus 1 is spread out over the last N − 1 components of q j or is entirely concentrated in a single one. This result generalizes to any vector qi having two or more null components: the distance depends on the sum of those same components of q j , not on their individual values. Finally, consider the case where the subject has a probability α of identifying any of the stimuli correctly (Q(s i |s i ) = α), and that whenever he or she makes a mistake, the error is equally distributed among all other stimuli (Q(s i |s j ) = β, for all i = j). The normalization condition equation 1.1 implies α + (N − 1)β = 1. In this case, D(s i , s j ) = |α − β|. 3.1 Extension to Continuous Stimuli. Consider the case where there is a continuous parameter x labeling the stimuli, such that stimulus i corresponds to an interval of x values ranging from i to (i + 1)/, with = 1/N. It is now convenient to vary i between 0 and N − 1. The scale of x is chosen in such a way that its maximum value is 1. For large N, the
A Subjective Distance Between Stimuli
977
confusion matrix can be written as Q(s j |s i ) = u( j|s i ), where u(x|s i ) is a piecewise continuous probability density. The distance between stimuli i and j reads D(qi , q j ) =
1 1 1 |u(x|s i )−u(x|s j )|d x, u(k|s i )−u(k|s j ) → 2 k 2 0 (3.7)
where the right-hand limit corresponds to making the number of stimuli N tend to infinity. In other words, the distance between stimuli i and j is equal to the area between the two corresponding densities. The normalization condition for u(x|s) ensures that D lies between 0 and 1. One can easily show that its value remains invariant when a different parameterization of the variable x is used—as long as the new variable is in a one-to-one relation to x. 3.2 Extension to Stimuli with Nonuniform Prior Probabilities. One could ask whether the measure D can be extended to stimuli that are not all presented with the same probability. Let Q(s i ) denote the probability of presenting stimulus s i . In this case, the maximum likelihood algorithm has to be substituted by a maximum a posteriori one. That is, the observer chooses stimulus s i upon response s k from the subject whenever pki = Q(s k |s i )Q(s i ) j j is larger than pk = Q(s k |s j )Q(s j ), and vice versa. Whenever the pki = pk , the choice between s i and s j is proportional to their corresponding priors. In this case, the difference between the correct and incorrect fractions of maximum likelihood estimations is D0 (qi , q j ) =
1 j i p − p k k. Q(s j ) + Q(s i ) k
(3.8)
This is, of course, also a valid distance between stimuli, since it fulfills requirements 1 through 5. Its maximum value is also 1, and it still carries the same block structure as Q. Notice that in this case, the distance not only depends on the subject’s perception, characterized by Q(s j |s i ), but also on the statistics of the stimuli (described by Q(s i )). 4 Comparison with Other Measures of Dissimilarity The distance D0 is no more than a geometrical view of the matrix Q(s i , s j ). It has the advantage of being true distance, that is, of obeying conditions 1 through 3. In addition, it fulfills the symmetry constraints imposed by condition 4 and preserves the block structure in Q, as ensured by condition 5. However, there are also other distances that still obey requirements 1 through 5; the angle between the vectors qi and q j can be taken
978
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
as an example. The advantage of D0 is that it has a simple interpretation in terms of maximum likelihood decoding performance. There have been previous attempts aiming at quantifying how different two stimuli are perceived. Perhaps the most similar to ours was proposed by Green and Swets (1966) in their definition of the discriminability d between two stimuli. Strictly speaking, their approach can be used only when the response to different stimuli is described by gaussian functions whose mean depends on the stimulus but whose variance remains fixed. They defined the discriminability d between two stimuli as the ratio between the difference of the two corresponding mean values to the standard deviation. Can the concept of d be extended to more general response distributions? In the gaussian case, one can make a correspondence between a given d value and the expected fraction of errors when estimating the stimulus from the response of the subject using a maximum likelihood decoding. In terms of this correspondence, d not only represents the distance at which the gaussian functions sit from one another, but more generally, the fraction of decoding mistakes, which is something that does not depend on the shape of the probability distribution of the responses of the subject. Large discriminability is associated with a small probability of making a mistake. The scale of the measure, however, since it is inherited from the gaussian case, has a non linear relationship with the fraction of mistakes. With this idea in mind, d can be extended to nongaussian stimuli (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Whatever the shape of P(s k |s i ) and P(s k |s j ), given the response s k , an observer can decide in favor of stimulus s i or s j depending on which of them has a higher probability of eliciting response s k . For each pair of stimuli s i and s j , there will be, on average, a certain fraction f e of errors. One could extend the definition of discriminability to be the d value that would give, in the gaussian case, the same fraction f e of errors. This extension, being intimately related to the performance of a maximum likelihood decoding, is grounded on the same rationale as our subjective distance D. The problem is that if one is constrained to choose the d scale to match the equivalent gaussian case, then the definition of discriminability does not obey the triangular inequality. It is easy to construct an example where P(s k |s i ) overlaps in a certain region with P(s k |s j ), which in turn overlaps in a different region with P(s k |s ), but such that P(s k |s i ) and P(s k |s ) are never simultaneously different from zero. In this case, d (s i , s j ) and d (s j , s ) are both finite, whereas d (s i , s j ) is infinite. Hence, though both D and d can be defined in terms of maximum likelihood performance, D is a proper distance, whereas d is not. Another well-known notion of distance can be defined (for continuous stimuli) in terms of the Fisher information metric tensor J . There is a natural scalar product associated with J and also a notion of distance in the space of stimuli (see Amari & Nagaoka, 2000). However, the entire Fisher geometry becomes meaningless for discrete stimuli. Our aim is to show that even in the discrete case, a definition of distance is possible.
A Subjective Distance Between Stimuli
979
The distance defined in terms of the Fisher metric tensor is a bilinear form. Its matrix elements are not constant, but depend on the point q that one is interested in. Hence, distances are defined in terms of a curvilinear integral, which may actually involve very difficult calculations. When two stimuli have disjoint response sets, the Fisher distance between them diverges. However, a Fisher distance between two stimuli equal to infinity does not necessarily imply that the responses to those two stimuli conform disjoint sets. The Fisher distance between two stimuli may diverge, for example, when the probability density u(x|s) is discontinuous. This implies a discrepancy with requirement 5. The Kullback-Leibler divergence (Cover & Thomas, 1991) between the vectors qi and q j can also be used to measure how different two stimuli are perceived. It has the appealing property of being intimately related to many concepts in information theory, and as such, it has an informationbased intuitive interpretation: it is a measure (in number of additional bits of the mean code length) of the inefficiency of assuming that the distribution of a given variable is qi when its true distribution is q j . It is not a distance, however, since it does not fulfill requirements 2 and 3. Its symmetrized version is sometimes called the Jensen-Shannon measure, which is still not a distance because it does not obey the triangular inequality 3. For this measure, condition 4 is always true. The maximum value of the JensenShannon divergence is infinite. This value, however, is not only reached for pairs of stimuli whose response sets are disjoint, but also whenever there j is a component k such that q ki = 0, and q k = 0. This means requirement 5 is not in general fulfilled. There has been another previous proposal of a pseudo-distance (Treves, 1997), which was also defined in terms of the confusion matrix. As opposed to D0 , and also to the Jensen-Shannon measure, the distance between stimuli s i and s j depends only on Q(s i |s j ), Q(s j |s i ), Q(s i |s i ) and Q(s j |s j ) (other stimuli do not appear). Just like the Jensen-Shannon divergence, however, it does not fulfill requirements 3 and 5. We claim that our measure allows a geometrical picture of a set of stimuli. Multidimensional scaling (Young & Hamer, 1994; Cox & Cox, 2000) also aims at this goal. The two approaches, however, bear certain differences. The use of D0 aims at a definition of a distance. It is particularly useful when one can vary a certain parameter in the experiment (in the example below, the number of cells being sampled) and wants to know the effect of that parameter in the structure of confusions. Nevertheless, D0 in itself does not provide a way of visualizing the stimuli in a particular space. Since the structure of confusions may depend on a very large number of factors (e.g., the semantic knowledge of the subject), there may be actually no small-dimensional space where the stimuli can be placed. Multidimensional scaling, in contrast, starts with a given matrix of distances, or dissimilarities. The algorithm is designed to place the stimuli in a finite dimensional space producing the minimum possible distortion of the
980
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
pairwise distances. Except for very particular sets of stimuli, it is an approximate method. What is gained is the optimal set of coordinates of the stimuli out of the matrix of distances. The subjective distance could be, in certain applications, a good starting point with which to feed the multidimensional scaling algorithm. Finally, we point out that our definition of subjective distance emphasizes the probability of confounding the stimuli, as opposed to other possible measures, that stress the differences in the elicited neural representations. In this sense, our approach is, broadly speaking, complementary to Victor and Purpura’s (1997) proposal of constructing a metric in the set of responses. There, different distances between spike trains were considered. Each proposed distance captured specific aspects of the neural response. For example, the distance between two spike trains was defined in terms of how different the timing of individual spikes was, or how different the interspike intervals were, and so forth. Their aim was to decide which of those distances could better cluster the neural responses corresponding to the different stimuli in their set and thus make inferences about the way stimuli were encoded into spike trains. In the back of this reasoning is the assumption that the stimuli themselves are all sufficiently different from one another. Here, in contrast, we are interested in the perceived distances and similarities of the stimuli, as can be deduced from the structure of confusions.
5 From Neural Representations to the Subjective Distance In order to use the distances defined in section 3, the conditional probabilities Q(s i |s j ) are needed. Such probabilities may be extracted, as described in section 1, from an experiment where the subject is asked to identify the stimuli. However, this is not the only way a matrix Q can be obtained. In the case of nonhuman subjects, many times the stimuli are presented while the activity of one or several neurons is being recorded by microelectrodes implanted in the animal’s brain. From these experiments, the conditional probability ρ(r k |s j ) of recording response r k during the presentation of stimulus s j may be calculated. One way of deriving Q from ρ is to use a decoding procedure, that is, to define a mapping going from the set of neural responses to the set of stimuli (Rolls & Treves, 1998; Dayan & Abbott, 2001; Rieke et al., 1997). Different rules defining such a transformation give rise to different decoding procedures. Among them, one can point out the maximum a posteriori k j k decoding, where r is associated with the stimulus that maximizes ρ(s |r ) = ρ(r k |s j )P(s j )/ ρ(r k |s )P(s ), that is, the conditional probability of having presented stimulus s j when response r k was measured. This rule is, among all possible decoding rules, the one that maximizes the fraction of correct decodings.
A Subjective Distance Between Stimuli
981
Another widely used option is the Bayesian approach, where Q(s j |s i ) =
ρ(s j |r k )ρ(r k |s i ).
(5.1)
k
Strictly speaking, in this case there is no decoding, since one does not choose a single stimulus for each response. One rather keeps all the probability distribution for each stimulus, given the neural response. The only assumption in the back of equation 5.1 is that the decoded stimulus is conditionally independent of the actual stimulus, that is, P(s j |r k , s i ) = P(s j |r k )P(r k |s i ). Passing from matrix ρ to matrix Q always means a loss of information. The amount of information that is lost can sometimes be estimated (Samengo, 2001). A maximum a posteriori decoding maximizes the fraction of correct identifications (that is, the trace f of the resulting Q) but probably loses some of the structure of ρ whenever the response r j is not the most probable response elicited by a given stimulus. A procedure like equation 5.1 is intended to preserve the statistical regularities of mistakes, but typically implies a lower fraction of correct choices f . Therefore, whenever neural responses are used to construct matrix Q, one must bear in mind that the results depend, at least partially, on how Q is calculated. The dependence of the Q matrix on the decoding may seem dangerous. Could one define a distance in terms of ρ without having to pass through matrix Q? Of course, this is possible in general terms. One option, for example, would be to calculate the mean response mi to stimulus i, and then define the distance between stimuli i and j as the distance Dn between the corresponding means.1 This is a distance between vectors living in the response set, not in D. The subindex n, hence, stands for neural. The distance Dn may be a sensible approach when one is interested in the neural representations of the stimuli. It is not very convenient, however, if the distance is meant to reflect the structure of confusions. As is shown below, Dn does not in general fulfill condition 5. Confusion is defined at the behavioral level; the subject is asked to identify the stimuli. At the neural level, one can talk about confusion between two stimuli only when a decoding procedure is introduced. In this sense, the process of decoding should not be viewed as an arbitrary step: one shifts the question of how distinguishable two stimuli are to the problem of inferring the stimulus from the neural response. To test whether a distance defined in the response set fulfills condition 5, we point out that for any reasonable decoding procedure, two stimuli that 1 Of course, even more sophisticated distances are possible, for example, one taking into account the covariance matrix, or even higher moments of the distributions ρ(r k |s i ). Here, we use only a very simple definition of a distance Dn in the response space, since we want to point out only that there is a qualitative difference between defining a distance in terms of the Q matrix and in terms of the ρ matrix.
982
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
elicit responses occupying convex, nonoverlapping portions of the response space are never confounded with one another. Is this enough to have them at the maximum Dn ? In general, no. Being Dn defined in terms of a distance in the response set, the more separate the responses to the two stimuli are, the larger Dn is. Hence, neural-based distances do not in general fulfill condition 5.
6 Application to Place Cells: Comparing the Actual and the Subjective Distance Between Two Locations in Space It is known that many of the pyramidal cells in the rodent hippocampus selectively fire when the animal is in a particular location of its environment (O’Keefe & Dostrovsky, 1971). Such a response profile allows one to define, for each cell, a place field, that is, the region in space for which the neuron is responsive. A classic experiment in the study of the rat’s hippocampal neurophysiology consists in letting the awake animal wander in a given environment while the activity of one or several of its hippocampal pyramidal cells is being recorded (for an overview see, e.g., O’Keefe & Dostrovsky, 1971, or Redish, 1999). In the experiment we analyze next, cells in the lateral septum have also been recorded. The lateral septum receives a massive projection from hippocampal pyramidal neurons. The activity of septal cells has also been shown to be informative about the location of the animal, though not as neat as in hippocampal neurons. The set of stimuli in this case consists of all the possible locations where the animal can be placed. This set is endowed with a natural metric: we know what the distance between two locations is. Other sets of stimuli, for example, a set of pictures of human faces, lack an obvious, natural distance. In order to test our definition of subjective distance, it is convenient to work with a metric set of stimuli, which allows the comparison of D0 with the natural physical distance in the set. A description of the experiment follows. Nine young adult Long-Evans rats were tested while moving in an eight-arm radial maze, as shown in Figure 3A. Each arm contained a small amount of chocolate milk in its distal part (for details, see Leutgeb & Mizumori, 2002). In a given trial, a rat initially placed in the center of the maze visited the eight arms in a random order (see Figure 3B), taking the food reward. Several trials were recorded per animal, sometimes in darkness and sometimes in light conditions. Septal and hippocampal cells were simultaneously registered while the animal moved in the maze. Single units were separated using an online and off-line separation software. Units were then classified according to their anatomical location and the characteristics of the spikes. Hippocampal pyramidal cells and lateral septal cells were identified. In what follows, each arm of the maze is taken as a different stimulus, and it is labeled by its angle, as in Figure 3A.The center of the maze was excluded so as to have eight similar
A Subjective Distance Between Stimuli
983
Figure 3: (A) Eight-arm maze where the animals move. (B) Path traveled by a rat in a given trial, as measured by the diode on its head. The rat goes straight to the end of the arm and drinks the chocolate milk. When turning around, it typically sweeps its head in a circular movement, peeping outside the maze.
and evenly visited stimuli. Thus, in the experiment, the physical distance between any two stimuli is the (actual) angle between the corresponding arms. The subjective distance is deduced from the responses. As an example, in Figure 4 the location of the animal is shown whenever a lateral septal (see Figure 4A) and a hippocampal place cell (see Figure 4B) fire a spike. It is clear that both cells fire selectively when the animal is in a particular location on the maze, the septal cell having a somewhat more distributed response.
984
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
Figure 4: Location of the animal when a (A) given lateral septal and (B) hippocampal pyramidal neuron fired a spike. The dots that fall outside the maze indicate that the nose of the animal is peeping outside the walls of the labyrinth.
The decoding procedure and the calculation of the Q matrix were carried out first for a single neuron and then for pairs, triplets, and sets of four cells measured simultaneously. The response of the animal was, correspondingly, a scalar, or a two-, three- and four-dimensional vector r, where the component rc stands for the firing rate of cell c. Hence, upon entrance to arm s j , rc was calculated as the ratio of the number of spikes fired by cell c to the time spent in the arm s j in that particular trial. To decode the arm corresponding to a given response r, the M responses most similar to r were taken into account. Those M firing rates included M0 , responses that corresponded to the animal in arm 0 degrees, M45 responsescorresponded to the animal in arm 45 degrees, and so forth. That is, M = i Mi , and the response r of the present trial is not included. The probability ρ(s j |r ) for the rat to be in arm s j when response r was observed was set as Mj /M. The
A Subjective Distance Between Stimuli
985
A 1
D0
From arm 0
o
180
o
0 1
D0
45
o
225
o
0 1
D0
90
o
270
o
0 1
D0
315 135
0 0
90
180
o
o
0
270
To arm
90
180
270
To arm
1
o
D0
From arm 180 From arm 0
0 1
D0
45
o
o
225
o
270
o
0 1
D0
90
o
0 1
D0
135
o
315
o
0 0
90
180
To arm
270
0
90
180
270
To arm
Figure 5: Matrix of distances D0 for the (A) lateral septal cell of Figure 4A, and (B) hippocampal pyramidal cell of Figure 4B.
decoded arm was the one maximizing ρ(s j |r ). If there was a draw between two arms, the decoded one was chosen at chance between those two, with probabilities that were proportional to the two corresponding priors. With this procedure, Q(s i |s j ) is defined as the fraction of times s i was decoded, whenever s j was the actual arm.
986
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
The procedure was carried out for several values of M ranging from 2 to 20. Each M gave rise to a different Q matrix. We observed that the fraction of correct decodings (the trace of Q) typically showed a maximum as a function of M. The M for which tr(Q) was maximal was taken to be the final one. With this procedure, we obtained a Q matrix for each neuron, each pair of neurons, each triplet, and each quadruplet. In what follows, the results for the subjective distance as derived from the chosen Q matrices are shown. As examples from the single neuron behavior we show, in Figure 5, the matrix D0 (qi , q j ), for the same cells of Figure 4. In each plot, the arm i is kept fixed, while the arm j is varied (it does not really matter which is i and which is j, since the distance is symmetrical). In all graphs, D0 = 0 corresponds to i = j. The plot in Figure 5A corresponds to a cell in the lateral septum. It may be seen that most arms are far away from the one at 135 degrees. Not surprisingly, this is precisely the arm where the cell fires most. If a strong, reliable response is obtained for a given stimulus and if this response differs from the response to all other stimuli, then there is little probability of misidentifying it. Among the remaining arms, the one closest to the one at 135 degrees is the one at 0 drgree, and it corresponds to the second-largest firing rate. In the case of the hippocampal cell of Figures 4B and 5B, the arm that is farthest away from all others is the one at 0 degree, the one where the place field is located. Surprisingly, however, its distinction from all other arms is not as clear as in the septal cell of Figures 4A and 5A even though Figure 4B indicates that this cell is much more selective to the location of the animal. The fact is that Figure 4 alone is not enough to depict the selectivity of the cell, because it does not show the statistics. The rat in Figures 4B and 5B entered 18 times into the arm at 0 degree, but in only half of those trials, the cell fired a burst of spikes. The other half had no response. Therefore, a burst of spikes most probably means the animal is the arm at 0 degree, but a silent response does not discard this arm. In all those trials when there was no activity upon entrance to the arm at 0 degree, this arm can well be confounded with any other arm. If all the silent entrances to the arm at 0 degree are discarded, then the matrix of distances changes drastically, as is shown in Figure 6. There, the distance from the arm at 0 degree to any other arm is shown to be equal to one, implying no mistakes at all. This shows that the trial-to-trial variability has an important influence on the distance between any two stimuli, even for those stimuli that may elicit very strong responses. Every cell has a different spatial distribution of responses and, hence, a different matrix of subjective distances. In what follows, therefore, instead of analyzing the specific characteristics of individual cells, the average behavior is studied. In Figure 7, the average distance of a given arm with all the others is shown. The numbers in the x-axis indicate the angle separating the two arms under consideration. Thus, the average D0 between all pairs of arms
A Subjective Distance Between Stimuli
987
D0
1
From arm 0
From arm 180
o
o
0
D0
1
45
o
90
o
225
o
0
D0
1
270
o
D0
0 1
135
o
315
o
0 0
90
180
To arm
270
0
90
180
270
To arm
Figure 6: Matrix of distances D0 for the cell of Figure 4B, when all the trials with silent entrances to the arm at 0 degree have been removed.
that lie at 45 degrees from one another is shown as a single point in the plot. Each data point represents an average among cells (62 units have been considered: 29 pyramidal cells in the hippocampus and 33 lateral septal cells). The error bars show the standard deviation of the cell average. In the lower curve, the arm is decoded from the activity of a single cell. As the curves rise, the decoding makes use of more cells (from 1 to 4) recorded simultaneously. The first thing that can be noticed is that as the number of neurons increases, all the subjective distances grow. This means that the representation of the different arms becomes more and more distinctive, and therefore the fraction of mistakes goes down. One can also observe that the single-neuron case looks pretty flat. That is, there is no evident structure in the set of stimuli, since each arm has roughly the same probability as being confounded with any other arm. An analysis of variance (ANOVA) of the distances obtained for pairs of arms at 45 and 180 degrees shows that they are not significantly different ( p > 0.3). In contrast, as the number of neurons increases, the curves begin to bend, showing larger distances for pairs of arms that lie farther apart. The ANOVA test shows that the distances obtained for pairs of arms at 45 degrees are significantly different from the ones at 180 degrees ( p < 0.0001). This means that if the response of four neurons is considered, then it is more probable
988
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
D0
0.5
1 cell 2 3 4
0.0 0
90 180 270 angle to neighbor
360
Figure 7: Average distance of a given arm with all the others, numbered counterclockwise, as calculated from 62 units (29 hippocampal pyramidal neurons and 33 lateral septal cells). The error bars are the standard deviations from the averages. Stars, triangles, squares, and circles correspond to 1, 2, 3, and 4 cells, respectively.
to confound a given location with a nearby location than with a distal one. Topography, hence, seems to emerge from population coding, not from the single-cell response. This example shows that the subjective distance is, when read from four simultaneously recorded cells, a monotonic function of the true distance. This indicates that near in the actual world corresponds to near in the subjective perception, and far in the actual world corresponds to far in the subjective perception. In this sense, one can assert that there is a continuous mapping between the actual and subjective spaces. If the subjective distance were strictly equal to the actual distance, however, the upper curve of Figure 7 would have a linear rise from 0 to 180 degrees, and then a similar linear decrease from 180 to 360 degrees. The curved shape of the upper trace of Figure 7 indicates that the scale in the subjective representations somehow shrinks as the actual stimulus moves farther away. This, in turn, shows that the system is better designed to make fine distinctions between nearby stimuli than between distal ones. The singlecell response instead makes no distinction at all between near or far. It
A Subjective Distance Between Stimuli
989
only recognizes whether the two arms are the same or not. All these nontrivial characteristics of the coding properties of hippocampal and septal cells have been visualized in terms of our definition of the subjective distance.
7 Summary The subjective distance as introduced here is a way of measuring how differently stimuli are perceived. The distance between any two elements may be interpreted in terms of the average performance when trying to infer the actual stimulus, if only the response of the subject is known. In such a performance, the trial-to-trial variability of the responses to each stimulus is as important as the mean responses. It should be noted that the probability of confusion depends not only on the characteristics of the two stimuli under consideration, but also on all the other stimuli in the set S. Hence, the distance between two given items may vary when the remaining stimuli in S are modified. So here, as well as in many other information or discrimination analyses the choice of the set of stimuli is a highly relevant (and sometimes difficult) issue in itself, which should not be neglected. The distances D and D0 may have several applications; for example, it may be of interest to compare the subjective distance with some other objective measure of distance. Here, as shown in section 6, D0 has proven useful to show that an increasingly topographic encoding of spatial location arises as the number of cells grows. In this case, as the population of neurons increases, the topography of real space seems to emerge. There are other cases, though, where the subjective perception of certain objects raises clear differentiating bounds between stimuli that are actually near, in the so-called physical space. For example, it is known that during the first year of life, the exposure of infants to their mother tongue builds up a very particular way of perceiving phonetic information. Two sounds that are physically very similar but correspond to two different phonemes in the child’s language are easily discriminated. Yet the researcher can design two sound waves differing even more in their physical characteristics, but that the infant cannot distinguish simply because they are not distinct building blocks (phonemes) in his or her own language (see, e.g., Kuhl, 1994). Another example is the perception of facial expression (Young et al., 1997), where although the researcher can continuously morph the picture of a happy face into that of an angry one, human observers have a tendency to categorize them into distinct emotions (full happiness or full anger). Our subjective distance would be a good way to quantify these effects. Another possible application would be to use the subjective distance as the input to a multidimensional scaling algorithm. This would allow placing
990
D. Oliva, I. Samengo, S. Leutgeb, and S. Mizumori
the stimuli in a finite-dimensional space and gaining further geometrical intuition about them. Acknowledgments We thank Alessandro Treves for his very useful suggestions. This work has been supported with a fellowship of the Alexander von Humboldt Foundation, a grant of Fundacion ´ Antorchas, one of the Human Frontier Science Program No. RG 01101998B, and NIMH grant 58755. References Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press and American Mathematical Society. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Cox, T. F., & Cox M. A. (2000). Multidimensional scaling, (2nd ed.). London: Chapman and Hall. Dayan, P., & Abbott L. F. (2001) Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press London. Green, D. M., & Swets J. A. (1966). Signal detection theory and psychophysics. New York: Willey. Kuhl, P. K. (1994). Learning and representation in speech and language. Curr. Op. Neurobiol., 4 812–822. Leutgeb, S., & Mizumori S. J. (2002). Context-specific spatial representations by lateral septal cells. J. Neurosci., 112 (3), 655–663. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely moving rat. Brain Res., 34, 171–175. Redish, A. D. (1999). Beyond the cognitive map: From place cells to episodic memory. Cambridge, MA: MIT Press. Rieke F., Warland D., de Ruyter van Steveninck R., & Bialek W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Samengo, I. (2001). The information loss in an optimal maximum likelihood decoding. Neural Comput., 14, 771–779. Treves, A. (1997). On the perceptual structure of face space. Biosystems 40, 189–196. Victor, J. D., & Purpura K. P. (1997). Metric-space analysis of spike trains: Theory, algorithms and application. Network: Comput. Neural Syst., 8, 127–164. Young F. W., & Hamer R. M. (1994) Theory and applications of multidimensional scaling. Hillsdale, NJ: Eribaum. Young, A. W., Rowland, D., Calder, A. J., Etcoff, N. L., Seth, A., & Perrett, D. I. (1997). Facial expression megamix: Tests of dimensional and category accounts of emotion recognition. Cognition, 63 271–313.
Received April 30, 2003; accepted August 10, 2004.
NOTE
Communicated by Emilio Salinas
Gain Control by Concerted Changes in IA and IH Conductances Denis Burdakov
[email protected] Neurosciences Division, Biological Sciences, University of Manchester, Manchester M13 9PT, U.K.
Stability of intrinsic electrical activity and modulation of input-output gain are both important for neuronal information processing. It is therefore of interest to define biologically plausible parameters that allow these two features to coexist. Recent experiments indicate that in some biological neurons, the stability of spontaneous firing can arise from coregulated expression of the electrophysiologically opposing IA and IH currents. Here, I show that such balanced changes in IA and IH dramatically alter the slope of the relationship between the firing rate and driving current in a Hodgkin-Huxley-type model neuron. Concerted changes in IA and IH can thus control neuronal gain while preserving intrinsic activity. 1 Introduction Maintaining stable intrinsic activity patterns in the absence of relevant stimuli is thought to be important for proper information processing in neurons and neural circuits (Turrigiano & Nelson, 2000; Davis & Bezproznanny, 2001; Marder & Prinz, 2002). But the ability of neurons to change gain, that is, to alter the slope of the relationship between driving stimulus and firing response, is also critical for computation in diverse neural systems (Salinas & Thier, 2000; Salinas & Sejnowski, 2001). It is therefore of interest to define biologically plausible parameters that allow the gain of a neuron to be modulated without changes in its intrinsic (stimulus-independent) activity. It was recently demonstrated, using biological and model neurons, that one of the ways to achieve such “silent” modulation of gain is by covarying the frequencies of excitatory and inhibitory background synaptic input currents (Chance, Abbott, & Reyes, 2002). Another recent report indicated that biological neurons can intrinsically covary the magnitudes of certain excitatory and inhibitory ionic currents: increasing the expression of the transient outward current (IA ) led to proportional increases in hyperpolarization-activated inward current (IH ) in lobster somatogastric neurons (MacLean, Zhang, Johnson, & Harris-Warrick, 2003). This biological coregulation of IA and IH did not result in significant changes in intrinsic firing properties (MacLean et al., 2003). Neural Computation 17, 991–995 (2005)
© 2005 Massachusetts Institute of Technology
992
D. Burdakov
Therefore, this note addresses the question of whether the concerted changes in IA and IH can act as a cellular mechanism for “silent” gain modulation. 2 Results and Discussion To explore the effects of concerted changes in IA and IH on neuronal gain, a recently published model of lobster somatogastric neurons (Prinz, Thirumalai, & Marder, 2003) was used. This is a Hodgkin-Huxleytype, single-compartment model that comprises seven membrane currents (INa , ICaS , IA , IKCa , IKd , IH , and Ileak ) and an intracellular calcium buffer, and exhibits tonic intrinsic firing (Prinz et al., 2003). All maximal conductances (except those for IA and IH , which were varied as indicated) were as in Prinz et al., and simulations were performed using MATLAB stiff systems numerical integrator ode15s, at time resolution of 25 µs. The maximal conductances (g) of IA and IH were changed from the values in Prinz et al. (10 and 0.05 mS/cm2 , respectively) in such a way so as to keep the intrinsic firing characteristics (tonic pattern and firing frequency) unaltered. The relationship between the two conductances for the latter condition is shown in Figure 1C. Changes in gA and gH markedly altered the firing responses of the model neuron to a tonic excitatory current: balanced increases in gA and gH decreased the firing response (see Figures 1A and 1B). This implies that the concerted changes in gA and gH selectively altered the gain, since the tonic nature of firing and the firing frequency remained unaffected. To quantify the changes in gain in a simplified but plausible way, the firing rate (f) was plotted against the driving excitatory current (I) for
Figure 1: (A, B) Examples of simulations illustrating that concerted changes in IA and IH affect neuronal gain without changes in unstimulated firing rate or pattern. (A) Neuron with a small IA and IH (gA and gH are 10 and 0.05 mS/cm2 , respectively) responds to a tonic excitatory current with a large increase in firing rate. (B) Increasing IA and IH in a balanced manner that leaves unstimulated firing unaltered (here, gA and gH are 50 and 1 mS/cm2 , respectively) leads to a marked reduction of the firing response to the same excitatory current. Simulation protocols used to elicit IA , IH , and firing responses in A and B are shown schematically below the corresponding traces. (C) The relationship between gA and gH that satisfies the condition of unaltered intrinsic firing characteristics. (D) Examples of f-I relationships of neurons with different balanced combinations of gA and gH (respectively, in mS/cm2 : black circles = 10 and 0.05, white squares = 20 and 0.18, black triangles = 50 and 1, white diamonds = 100 and 6). (E) Gain values (the slopes of f-I relationships such as those shown in D) plotted against the sum of gA and gH for combinations of gA and gH that do not alter intrinsic firing characteristics.
Gain Control by Concerted Changes in IA and IH Conductances
993
994
D. Burdakov
a range of combinations of gA and gH that resulted in the same intrinsic firing rate. The balanced increases in gA and gH led to progressive decreases in the gain (slope) of the f-I relationship (see Figure 1D). The gain was altered most steeply when the sum of gA and gH was below 100 mS/cm2 (see Figure 1E); this conductance range is consistent with physiological gA and gH values in lobster somatogastric neurons (MacLean et al., 2003; Prinz et al., 2003). Decreases in gain saturated when the sum of gA and gH exceeded about 100 mS/cm2 (see Figure 1E). These results indicate that concerted changes in IA and IH may allow neurons to vary gain while preserving intrinsic activity patterns. To the best of my knowledge, such cellular mechanism of gain control has not been previously reported. This mechanism is biologically plausible (MacLean et al., 2003), and could potentially be of general physiological importance considering that IA and IH are expressed together in many types of biological neurons and are under the control of a variety of neuromodulators (Yang et al., 2001; Ramakers & Storm, 2002; Schweitzer, Madamba, & Siggins, 2003; Frere & Luthi, 2004).
References Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35(4), 773–782. Davis, G. W., & Bezprozvanny, I. (2001). Maintaining the stability of neural function: A homeostatic hypothesis. Annu. Rev. Physiol., 63, 847–869. Frere, S. G., & Luthi, A. (2004). Pacemaker channels in mouse thalamocortical neurons are regulated by distinct pathways of cAMP synthesis. J. Physiol., 554, 111–125. MacLean, J. N., Zhang, Y., Johnson, B. R., & Harris-Warrick, R. M. (2003). Activityindependent homeostasis in rhythmically active neurons. Neuron, 37(1), 109–120. Marder, E., & Prinz, A. A. (2002). Modeling stability in neuron and network function: The role of activity in homeostasis. Bioessays, 24(12), 1145–1154. Prinz, A. A., Thirumalai, V., & Marder, E. (2003). The functional consequences of changes in the strength and duration of synaptic inputs to oscillatory neurons. J. Neurosci., 23(3), 943–954. Ramakers, G. M., & Storm, J. F. (2002). A postsynaptic transient K(+) current modulated by arachidonic acid regulates synaptic integration and threshold for LTP induction in hippocampal pyramidal cells. Proc. Natl. Acad. Sci. U. S. A, 99(15), 10144–10149. Salinas, E., & Sejnowski, T. J. (2001). Gain modulation in the central nervous system: Where behavior, neurophysiology, and computation meet. Neuroscientist, 7(5), 430–440. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27(1), 15–21. Schweitzer, P., Madamba, S. G., & Siggins, G. R. (2003). The sleep-modulating peptide cortistatin augments the H-current in hippocampal neurons. J. Neurosci., 23(34), 10884–10891.
Gain Control by Concerted Changes in IA and IH Conductances
995
Turrigiano, G. G., & Nelson, S. B. (2000). Hebb and homeostasis in neuronal plasticity. Curr. Opin. Neurobiol., 10(3), 358–364. Yang, F., Feng, L., Zheng, F., Johnson, S. W., Du, J., Shen, L., Wu, C. P., & Lu, B. (2001). GDNF acutely modulates excitability and A-type K(+) channels in midbrain dopaminergic neurons. Nat. Neurosci., 4(11), 1071–1078.
Received August 23, 2004; accepted October 1, 2004.
NOTE
Communicated by Klaus Obermayer
Winner-Relaxing Self-Organizing Maps Jens Christian Claussen
[email protected] ¨ Theoretische Physik und Astrophysik, Institut fur Christian-Albrechts-Universit¨at zu Kiel, 24098 Kiel, Germany
A new family of self-organizing maps, the winner-relaxing Kohonen algorithm, is introduced as a generalization of a variant given by Kohonen in 1991. The magnification behavior is calculated analytically. For the original variant, a magnification exponent of 4/7 is derived; the generalized version allows steering the magnification in the wide range from exponent 1/2 to 1 in the one-dimensional case, thus providing optimal mapping in the sense of information theory. The winner-relaxing algorithm requires minimal extra computations per learning step and is conveniently easy to implement. 1 Introduction The self-organizing map (SOM) algorithm (Kohonen, 1982) served both as a model for topology-preserving primary sensory processing in the cortex (Obermayer, Blasdel, & Schulten, 1992) and for technical applications (Ritter, Martinetz, & Schulten, 1992). Self-organizing feature maps map an input space, such as the retina or skin receptor fields, into a neural layer by feedforward structures with lateral inhibition. Defining properties are topology preservation, error tolerance, plasticity, and self-organized formation by a local process. Compared to other clustering algorithms and vector quantizers, its apparent advantage for data visualization and exploration is its approximative topology preservation. In contrast to the elastic net (Durbin & Willshaw, 1987) and the Linsker (1989) algorithm, which perform gradient descent in a certain energy landscape, the Kohonen algorithm lacks an energy function in the general case of a continuous input distribution. Although the learning process can be described in terms of a Fokker-Planck equation (Ritter & Schulten, 1988), the expectation value of the learning step is a nonconservative force (Obermayer et al., 1992) driving the process so that it has no associated energy function. Despite much research, the relationships between the Kohonen model and its variants to general principles remain an open field (Kohonen, 1991). 1.1 Kohonen’s Self-Organizing Feature Map. Kohonen’s selforganizing map is defined as follows. Every stimulus v of an input space V Neural Computation 17, 996–1009 (2005)
© 2005 Massachusetts Institute of Technology
Winner-Relaxing Self-Organizing Maps
997
is mapped to a “center of excitation,” or winner, s = argminr∈R |wr − v|,
(1.1)
where |·| denotes the Euclidean distance in input space. In the Kohonen model, the learning rule for each synaptic weight vector wr is given by δwr = η · grs · (v − wr ),
(1.2)
where grs defines the neighborhood relation in R, and throughout this article will be a gaussian function of the Euclidean distance |r − s| in the neural layer. Topology preservation is enforced by the common update of all weight vectors whose neuron r is adjacent to the center of excitation s; the adjacency function grs prescribes the topology in the neural layer. The speed of learning η usually is decreased during the process. 1.2 The Winner-Relaxing Kohonen Algorithm. We now consider an energy function V first proposed in Ritter et al. (1992). If we have a discrete input space, the potential function for the expectation value of the learning step is given by V({w}) =
1 γ grs p(vµ ) · |vµ − wr |2 , 2 rs µ µ|v ∈F ({w})
(1.3)
s
where Fs ({w}) is the cell of the Voronoi tesselation (or Dirichlet tesselation) of input space defined by equation 1.1. For discrete input space, where p(v) is a sum over delta peaks δ(v − vµ ), the first derivative with regard to wr is not continuous at all weight vectors where the borders of the Voronoi tesselation are shifting over one of the input vectors (see Figure 1). However, equation 1.3 requires the assumption that none of the borders of the Voronoi tesselation is shifting over a pattern vector vµ , which may be fulfilled in the final convergence phase for discrete input spaces but becomes problematic if there are more receptor positions than neurons. If p(v) is continuous, the sum over µ becomes an integral, and with every stimulus vector update, the surrounding Voronoi borders slide over stimuli (which means they become represented by another weight vector), so that there is no global energy function for the general case. We remark that replacing the crisp (or hard) winner selection, equation 1.1, by a soft-winner s = argminr r grr |wr − v|2 minimizes equation 1.3 even in the continuous case (Graepel, Burger, & Obermayer, 1997, Heskes, 1999). This is a formally elegant approach if one wants to ensure the existence of an energy function and accepts modifying the winner selection. However, to motivate the winner-relaxing learning, we return to the hard winner selection scheme, equation 1.1, and take up the learning rule given
998
J. Claussen
Figure 1: Shift of Voronoi borders as an effect of weight vector update.
by Kohonen (1991). Our use of this ansatz, however, is justified here only a posteriori by its use for adjusting the magnification. From the shift of the borders of the Voronoi tesselation Fs ({w}) (see Figure 1) in evaluation of the gradient with respect to a weight vector wr , Kohonen (1991) derived for the (approximated) gradient descent in V the additive term − 12 ηδrs r =s gr s (v − wr ), extending equation 1.2 for the winning neuron. As it implied an additional elastic relaxation, it was straightforward to call it the “winner-relaxing” (WR) Kohonen algorithm (Claussen, 1992). In the remainder of this article, we study the (generalized) winner-relaxing Kohonen algorithm, or winner-relaxing self-organizing Map (WRSOM), introduced first in Claussen (1992) as δwr = η (v −
γ wr )grs
− λδrs
r =s
γ gr s (v
− w r ) ,
(1.4) γ
where s is the center of excitation for incoming stimulus v, and grs is a gaussian function of distance in the neural layer with characteristic length γ . Here, λ is a free parameter of the algorithm. The original algorithm (associated with the potential function) proposed by Kohonen in 1991 is obtained
Winner-Relaxing Self-Organizing Maps
999
Table 1: Magnification Laws for One-Dimensional Maps. Elastic Net 1 1+
κ P σ2 J
VQ, NG
WRK, λ = 12
SOM, λ=0
WRK, λ = −1
Linsker
P 1/3
P 4/7
P 2/3
P1
P1
for λ = +1/2, whereas the classical SOM algorithm is obtained for λ = 0. The influence of λ on the magnification behavior is the central issue of this article. 1.3 The Magnification Factor. The magnification factor is defined as the density of neurons r (i.e., the density of synaptic weight vectors wr ) per unit volume of input space, and therefore is given by the inverse Jacobi determinant of the mapping from input space to neuron layer: M = |J |−1 = | det(dw/dr)|−1 . We assume the input space to be continuous and of the same dimension as the neural layer, and the map to be noninverting (J > 0). The magnification factor quantifies the network’s response to a given probability density of stimuli P(v). To evaluate M in higher dimensions, one in general has to compute the equilibrium state of the entire network and therefore needs complete global knowledge on P(v), except for separable cases. For one-dimensional mappings, the magnification factor can ¯ follow a universal magnification law, that is, M(w(r)) is a function of the local probability density P only, independent of both the location r in the ¯ neural layer and the location w(r) in input space. It is nontrivial whether there exists a power law; the elastic net obeys a universal magnification law that remarkably is not a power law (Claussen & Schuster, 2002) due to a nonvanishing elastic tension in regions of small input density. For the classical Kohonen algorithm, the magnification law is given by a power ρ ¯ ¯ law M(w(r)) ∝ P(w(r)) with exponent ρ = 23 (Ritter & Schulten, 1986). (See Table 1 for an overview.) For a discrete neural layer and different neighborhood kernels, corrections apply (Ritter, 1991; Ritter et al., 1992; Dersch & Tavan, 1995). As the brain is assumed to be optimized by evolution for information processing, one could conjecture that maximal mutual information can define an extremal principle governing the setup of neural structures. For feedforward neural structures with lateral inhibition, an algorithm of maximal mutual information has been defined by Linsker (1989) using the gradient descent in mutual information. It requires computationally costly integrations and has a highly nonlocal learning rule; therefore, it is neither favorable as a model for biological maps nor feasible for technical applications. Due to realization constraints, both technical applications and cortical networks (Plumbley, 1999) are not necessarily capable of reaching this
1000
J. Claussen
optimum. Even if one had experimental data of the magnification behavior, the question from what self-organizing dynamics neural structures emerge remains. Overall it is desirable to find learning rules that minimize mutual information in a simpler way. An optimal map from the view of information theory would reproduce the input probability exactly (M ∼ P(v)ρ with ρ = 1), being equivalent to the condition that all neurons in the layer are firing with same probability. This defines an equiprobabilistic mapping (van Hulle, 2000). An exponent ρ = 0, on the other hand, corresponds to a uniform distribution of weight vectors, or no adaptation at all. So the magnification exponent is a direct indicator of how far an SOM algorithm is away from the optimum predicted by information theory. 2 Magnification Exponent of the Winner-Relaxing Kohonen Algorithm We now derive the magnification law of the winner-relaxing Kohonen algorithm, equation 1.4, for the case of a 1D→1D map. Note that for higher dimensions, analytical results can be obtained only for special degenerate cases of the input probability density and therefore lack generality. The necessary condition for the final state of the algorithm is that the expectation value of the learning step vanishes for all neurons r : ∀r ∈R
0=
dv p(v)δ w¯ r (v).
(2.1)
Since this expectation value is equal to the learning step of the pattern parallel rule, equation 2.1 is the stationary state condition for both serial and parallel updating and also for batch updating. Thus, we can proceed with these variants simultaneously. (As synaptic plasticity is widely assumed to be based on integrative effects, one could claim that a parallel model is sufficient.) The update rule, equation 4.1, can be extended by an additional diagonal term controlled by µ:1 δwr = η (v − wr ) ·
grγs
+ µ(v − wr )δr s − λδr s
r =s
γ gr s (v
−w ) . r
(2.2) 1 Whereas the extra term controlled by the parameter µ has been introduced in Claussen (1992) for pure generality and will be kept within the derivation, it does not contribute to the magnification. In general, the setting µ = 0 is recommended (and probably most stable), and the winner-relaxing Kohonen algorithm thus has only one relevant control parameter λ.
Winner-Relaxing Self-Organizing Maps
1001
By insertion of the update rule, equation 2.2, one obtains γ 0 = ds P(w(s))J(s)g ¯ − w(r ¯ )) ¯ r s (w(s) + (µ + λ) · −λ · =
ds P(w(s))J(s)δ ¯ ¯ − w(r ¯ )) r s (w(s) ≡0
γ
ds dr P(w(s))J(s)δ ¯ ¯ − w(r ¯ )) r s gr s (w(s)
ds P(w(s))J ¯ (s)grγs (w(s) ¯ − w(r ¯ ))
+ λ · P(w(r ¯ ))J (r ) ·
γ
dr gr r (w(r ¯ ) − w(r ¯ )).
(2.3)
The derivation can be performed analogous to Ritter et al. (1992). In the continuum limit, there is always an exactly matching winning weight vector w¯ s = v. Further, the integration variable is substituted, dv = dw ¯ s = J (s)ds, and we define the abbreviation P¯ := P(w(r ¯ )). In the first integrand, P¯ J has to be expanded in powers of q := s − r . Within the second integral, P¯ J is evaluated only at r . Thus, the integration yields in leading order in q :
d2 w¯ ¯ d( P¯ J ) 1 2 dw ¯ 0=γ + PJ dr dr 2 dr 2 2 2 dw¯ q d w¯ γ + λ · P¯ J · dq g0q q dr + 2 dr 2
0=γ2
contribution 0
d( P¯ J ) P¯ J dJ P¯ J dJ J + +λ dr 2 dr 2 dr
.
(2.4)
¯ Further, we have to require γ = 0, P = 0, d P/dr = 0. Then the ansatz of a ¯ )) (that is, J depends only on universal local magnification law J(r ) = J( P(r the local value of P, which may be expected for the one-dimensional case only) requires J to fulfill the differential equation
J 1 λ dJ 0= + 1+ + (2.5) P¯ 2 2 d P¯ or 2 J dJ . =− d P¯ 3 + λ P¯
(2.6)
1002
J. Claussen
Magnification exponent
1 0.9 0.8
2/3 4/7 0.5
-1
-0.5
0
0.5
1
λ Figure 2: Impact of parameter λ on the magnification exponent. The cases of λ = 1/2 (Kohonen, 1991), the SOM case λ = 0 (Kohonen, 1982) and the winnerenhancing choice λ = −1 are marked with filled circles.
It has a power law solution (provided that λ = −3), which verifies the ansatz made above, J being a function of the local density only, M=
2 1 ∼ P(v) 3+λ . J
(2.7)
2 and can be tuned from Thus, the magnification exponent is given by 3+λ 1/2 to 1 (see Figure 2) within the range of stability. For the λ = 1/2 choice of the winner-relaxing Kohonen algorithm, the magnification factor follows an exact power law with magnification exponent ρ = 4/7, which is smaller than ρ = 2/3 for the classical self-organizing feature map (Ritter & Schulten, 1986), but is still much larger than ρ = 1/3 for vector quantization and neural gas. In any case, the maps resulting from the choices λ = 1/2 and λ = 0 are not optimal in terms of information theory.
3 Enhancing the Magnification From this result, one would try to invert the relaxing effect by choosing negative values for λ, which means to “enforce” the winner. In fact, the choice of λ = −1 leads to the magnification exponent 1. The magnification law, equation 2.7, is verified numerically, as is shown in Table 2. Apart from the fact that the exponent can be varied by a priori parameter choice between 1/2 and 1, the simulations show that our
Winner-Relaxing Self-Organizing Maps
1003
Table 2: Magnification Exponent of the Winner-Relaxing Algorithms Determined Numerically from a Sample Setup with 200 Neurons and 2 · 107 Update Steps and a Learning Rate of 0.1. ↓γλ→ 0.1 0.5 1.0 2.0 5.0 Theory:
−1
−3/4
−1/2
−1/4
0
1/4
1/2
3/4
1
0.29 ±.04 0.49 ±.02 0.75 ±.04 0.93 ±.03 0.99 ±.05 1.00
0.29 ±.04 0.46 ±.01 0.77 ±.02 0.86 ±.02 0.88 ±.04 0.89
0.23 ±.04 0.43 ±.02 0.68 ±.02 0.77 ±.02 0.80 ±.03 0.80
0.29 ±.04 0.45 ±.01 0.67 ±.02 0.71 ±.01 0.72 ±.02 0.73
0.27 ±.05 0.43 ±.01 0.61 ±.01 0.65 ±.01 0.66 ±.02 0.67
0.25 ±.04 0.40 ±.01 0.58 ±.01 0.61 ±.01 0.61 ±.02 0.62
0.26 ±.04 0.39 ±.02 0.58 ±.01 0.57 ±.01 0.57 ±.02 0.57
0.27 ±.04 0.37 ±.01 0.53 ±.01 0.53 ±.01 0.53 ±.02 0.53
0.27 ±.05 0.34 ±.01 0.51 ±.01 0.50 ±.01 0.50 ±.02 0.50
Notes: The input space was the unit interval; the stimulus probability density was chosen exponentially as exp(−βw) with β = 4. After an adaptation period of 5 · 107 learning steps further, 10% of the learning steps were used to calculate average slope and its fluctuation of log J as a function of log P. (The first and last 10% of neurons were excluded to eliminate boundary effects.) The small numbers denote the fluctuation of the exponent through the final 10% of the experiment. For small γ , the neighborhood interaction becomes too weak. If the gaussian neighborhood extends over some neurons (γ = 5), the exponent follows the predicted dependence of γ given by 2/(3 + λ). For |λ| > 1, the system is unstable. This is the case where the additional update term of the winner is larger than the sum over all other update terms in the network. Tuning of the parameter µ did not seem to extend the region of stability. As the relaxing effect is inverted for λ < 0, fluctuations are larger than in the Kohonen case.
winner-relaxing algorithm is able to establish information theoretically optimal SOMs in the “winner-enforcing” case (λ < 0).
4 Ordering Time and Stability Region At least for the 2D → 2D case, the winner-relaxing Kohonen algorithm was reported as somewhat faster (Kohonen, 1991) in the initial ordering process. In a 1D → 1D sample setup (Claussen, 2003), a marginally quicker ordering was observed for negative λ, at least at a relatively high learning rate η = 1. As a lot of parameters and the input distribution itself influence the ordering time and decay of fluctuations, different results may be obtained; for example, a small fraction of input distributions containing topological kinks takes much longer to become orderered, so minimal, maximal, averaged, and inverse-averaged ordering time will deviate.
1004
J. Claussen
λ = −1.0 λ = −0.5 λ = 0.0 λ = 0.5 λ = 1.0
log (fluctuations)
-1
-2
-3
-4
10
100
1000
10000
1e+05
1e+06
Number of iterations Figure 3: Time dependence (every tenth iterate shown) of the log root mean square fluctuations for different λ. Here, the same setup of a single run with γ = 1.0, η = 0.1, and 10 neurons is being used. Each run starts with the same configuration and random initial values between 0 and 1. For λ > 0, a quicker ordering is observed.
log (fluctuations)
-1
-2
-3
-4
10
100
1000
10000
1e+05
1e+06
Number of iterations Figure 4: Fast learning using a simple switching strategy. Starting with λ = 1/2, ordering is acheived quickly. At iteration step 2000, λ is immediately changed to −1 (dotted). This speeds up the learning phase by two orders of magnitude compared to starting with λ = −1 and by a factor 4 compared to λ = 0 (dashed, shown for comparison). If the duration of the initial ordering phase is underestimated, a long learning phase results (solid line; switch at step 200).
Winner-Relaxing Self-Organizing Maps
1005
log (fluctuations)
15
10
5
0
-5 -2
-1
0
1
2
3
4
5
λ Figure 5: For µ ∈ [−1, +1], the common stability range is λ ∈ [−1, +1]. For λ < −1, the log root mean square of the weight vector differences wr − wr −1 diverges, but extremely long, quiet transients are observed there. In the upper range λ > +1, making use of the diagonal term by using µ = 0 extends the stability range. The plots correspond to 107 (straight), 106 (dash-dotted), and 105 (dashed), respectively, for µ = 0. For 107 iterations, the cases µ = −1 (thin dots) and µ = +1 (thick dots) are shown. Parameters are γ = 1.0, η = 0.1, and 10 neurons are initialized near an equidistant chain with noise of amplitude 0.01 added.
If one instead investigates the time dependence of the fluctuations, for positive values of λ, a considerably quicker decay is observed (see Figure 3), being consistent with the observation by Kohonen (1991) mentioned above. These simulations indicate that for obtaining optimal magnification, the price of a longer learning phase may have to be paid. However, this drawback can be circumvented by combining the advantages of both λ ranges— using λ > 1 in the initial phase to speed up ordering and switching to λ = −1 after a considerable decay of fluctuations (see Figure 4). No complicated time dependence of this parameter switch has been used, and neither learning rate nor neighborhood has been changed during the simulation. The last important issue to be addressed is the dependence of stability on the parameter λ, especially at the border −1. Fortunately, the algorithm appears to be stable (in the 1D→1D case) in the whole range −1 ≤ λ ≤ +1, as shown in Figure 5. On both borders, the winner-relaxing learning remains
1006
J. Claussen
Output Entropy
4.6
10 * 10 batch 5 * 20 batch 10 * 10 5 * 20
4.4
4.2
4
-1
-0.8
-0.4
-0.6
-0.2
0
λ Figure 6: Entropy enhancement for the 2D→2D case for network geometries of 10 ∗ 10 and 5 ∗ 20 neurons. The data density was sin(πv1 ) · sin(πv2 ) within the unit square, γ = 5.0, and η was decreased from 0.01 to 0.001 during 106 learning steps. Alternatively, batch learning (over 100 steps) has been used; here, η was decreased from 0.05 to 0.001, and γ = 2.0 in the first 2 · 105 steps when ordinary SOM learning was applied (γ = 5.0, λ = 0). In all cases, for λ = −1, the entropy is enlarged compared to the unmodified case λ = 0 and close to the optimum (ln 100 = 4.605).
stable. Thus, the full range of magnification exponents between 1/2 and 1 can be acheived. In higher dimensions, no universal magnification law is expected, but one can evaluate the output entropy for a given input distribution and network. As shown in Figure 6, the enhancement of output entropy by winner-relaxing learning is effective also in the 2D case, although parameters have to be chosen more carefully.
5 Discussion After our first study (Claussen, 1992), Herrmann, Bauer, and Der (1995) introduced annother modification of the learning process, which was also applied to the neural gas algorithm (Villmann & Herrmann, 1998). Their central idea is to use a learning rate η that is locally dependent on the
Winner-Relaxing Self-Organizing Maps
1007
input probability density and also an exponent 1 can be obtained. As the input probability density should not be available to a neural map that selforganizes from stimuli drawn from that distribution, it is estimated from the actual local reconstruction mismatch (being an estimate for the size of the Voronoi cell) and from the time elapsed since the last time being the winner. Both operations require additional memory and computation, and due to the estimating character, the learning rate has to be bounded in practical use. This localized learning was overall more easily applicable and overcame the stability problems of the early approach of conscience learning (DeSieno, 1988). Another systematic method, the extended maximum entropy learning rule, has been introduced by van Hulle (1997). It approximates a map of maximal output entropy for arbitrary dimension, although in higher dimensions, the handling of the quantization regions becomes less practial (van Hulle, 1998). A quite different approach that is also capable of generating equiprobabilistic maps is via kernel optimization (van Hulle 1998, 2000, 2002); that is, neighborhood kernel radii themselves become learning parameters, in addition to the weight vectors defining the kernel centers. Other approaches, also influencing magnification, consider the selection of the winner to be probabilistic, leading to elegant statistical approaches to potential functions, as given by Graepel et al. (1997) and Heskes (1999). As shown recently (Claussen & Villmann, in press), the winner-relaxing concept can also be transferred successfully to the neural gas, confirming the utility of this class of learning rules.
6 Conclusions The Linsker, elastic net, and winner-relaxing Kohonen algorithms can be derived from an extremal principle, given by information theory, physical motivations, and reconstruction error, respectively. In this article, we have chosen the magnification law to indicate how close the algorithm reaches the adaptation properties of a map of maximal mutual information. The magnification law is one quantitative property that is both accessible by neurobiological experiments and manifests as a quantitative control parameter of a neural map used as vector quantizer in applications. A map of maximal mutual information uses all neurons with same probability; that is, their firing rate will be equal. In this article, we have investigated the winner-relaxing approach to establish a new family of vector quantizers. The shift from Kohonen (ρ = 2/3) to winner-relaxing Kohonen algorithm (ρ = 4/7) seems to be marginal if the emphasis is laid on the existence of a potential function. If a large magnification exponent is desired, the winner-relaxing Kohonen algorithm (with λ = −1) combines simple computation with a magnification corresponding to maximal mutual information.
1008
J. Claussen
Acknowledgments I thank H.G. Schuster for calling attention to the topic and for stimulating discussions. References Claussen, J. C. (1992). Selbstorganisation Neuronaler Karten. Diploma thesis, Kiel, Germany. Claussen, J. C. (2003). Winner relaxing and winner-enhancing Kohonen maps: Maximal mutual information from enhancing the winner. Complexity, 8(4), 15–22. Claussen, J. C., & Schuster, H. G. (2002). Asymptotic level density of the elastic net self-organizing map. In J. R. Dorronsoro (Ed.), Artificial neural networks. Berlin: Springer. Claussen, J. C., & Villmann, T. (in press). Magnification control in winner relaxing neural gas. Neurocomputing. Available at http://dx.doi.org/j.neucom. 2004.01.191. Dersch, D. R., & Tavan, P. (1995). Asymptotic level density in topological feature maps. IEEE Transactions on Neural Networks, 6, 230–236. DeSieno, D. (1988). Adding a conscience to competitive learning. In Proc. ICNN’88, International Conference on Neural Networks (pp. 117–124). Piscataway, NJ: IEEE Service Center. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic selforganizing maps. Phys. Rev. E, 56, 3876–3890. Herrmann, M., Bauer, H.-U., & Der, R. (1995). Optimal magnification factors in selforganizing maps. In F. Fogelman-Soulie, & P. Gallinari (Eds.), Proceedings of the International Conference on Artificial Networks (pp. 75–80). Paris: Ec2 et Cie. Heskes, T. (1999). Energy functions for self-organizing maps. In E. Oja & S. Kaski (Eds.), Kohonen maps. Amsterdam: Elsevier. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1991). Self-organizing maps: Optimization approaches. In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks. Amsterdam: North-Holland. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402–411. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A, 45, 7568–7589. Plumbley, M. D. (1999). Do cortical maps adapt to optimize information density? Network, 10, 41–58. Ritter, H. (1991). Asymptotic level density for a class of vector quantization processes. IEEE Trans. Neural Networks, 2, 173–175. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and self-organizing maps: An introduction. Reading, MA: Addison-Wesley.
Winner-Relaxing Self-Organizing Maps
1009
Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s self-organizing sensory mapping. Biol. Cybernetics, 54, 99–106. Ritter, H., & Schulten, K. (1988). Convergence properties of Kohonen’s topology conserving maps: Fluctuations, stability and dimension selection. Biological Cybernetics, 60, 59–71. van Hulle, M. M. (1997). Nonparametric density estimation and regression achieved with topographic maps maximizing the information-theoretic entropy of their outputs. Biological Cybernetics, 77, 49–61. van Hulle, M. M. (1998). Kernel-based equiprobabilistic topographic map formation. Neural Computation. 10, 1847–1871. van Hulle, M. M. (2000). Faithful representations and topographic maps. New York: Wiley. van Hulle, M. M. (2002). Kernel-based topographic map formation by local density modeling. Neural Computation, 14, 1561–1573. Villmann, T., & Herrmann, M. (1998). Magnification control in neural maps. In Proc. European Symposium on Artificial Neural Networks, 1998 (pp. 191–196). Bruges: D-Facto.
Received August 20, 2002; accepted November 1, 2004.
LETTER
Communicated by DeLiang Wang
Image Segmentation by Networks of Spiking Neurons Joachim M. Buhmann
[email protected] Tilman Lange
[email protected] Swiss Federal Institute of Technology, CH-8092 Zurich, Switzerland
Ulrich Ramacher
[email protected] ¨ Infineon Technologies, D-81739 Munchen, Germany
A network of leaky integrate-and-fire (IAF) neurons is proposed to segment gray-scale images. The network architecture with local competition between neurons that encode segment assignments of image blocks is motivated by a histogram clustering approach to image segmentation. Lateral excitatory connections between neighboring image sites yield a local smoothing of segments. The mean firing rate of class membership neurons encodes the image segmentation. A weight modification scheme is proposed that estimates segment-specific prototypical histograms. The robustness properties of the network implementation make it amenable to an analog VLSI realization. Results on synthetic and real-world images demonstrate the effectiveness of the architecture. 1 Introduction The task of segmenting a given image into disjoint (homogeneous) areas or segments is often considered an important yet difficult task in computer vision. Image segmentation provides the common basis to solve many subsequent computer visions problems such as object recognition and scene analysis. Segmented images are significantly reduced in complexity, which might be measured by coding lengths, but at the same time they retain a large amount of information on objects in the image and their mutual relations. The most competitive models for image segmentation are currently formulated as high-level probabilistic models of pixel assignments to groups (Puzicha, Hofmann, & Buhmann, 1999; Tu, Zhu, & Shum, 2002). The inference essentially amounts to finding minima of the corresponding energy functional. This strategy is advantageous, since stable states correspond to (local) minima of the objective function. The optimization, however, might be infeasible for practical applications depending on the model. Furthermore, the biological plausibility of these models is tentative since the Neural Computation 17, 1010–1031 (2005)
© 2005 Massachusetts Institute of Technology
Image Segmentation by Networks of Spiking Neurons
1011
popular claim that biological neural networks are statistical inference machines involves a large amount of abstraction. Bridging the gap between biologically plausible models of low and intermediate vision and abstract probabilistic model for image analysis is required for two reasons: (1) a testable model in biological science has to be sufficiently close to physiological neurons so that theoretical predictions are amenable to experimental tests in living animals (Koch, 1999), and (2) biological neural networks might motivate new designs of hardware and software solutions for vision problems, for example, in analog VLSI. This article is motivated by both lines of thought, although our research effort was mainly driven by the second goal. Our proposal employs a network of spiking neurons for image segmentation. Although a similar implementation in a rate model is also possible, the use of spiking neurons promises high relevance for biological systems and, furthermore, might be more flexible for computer vision applications. Starting with a highly competitive probabilistic model for image segmentation called the histogram clustering model (HCM), we propose a neuronal version of this segmentation model that has been successfully applied to color and texture analysis of images (Puzicha et al., 1999). Section 2 introduces the model and shows how it can be applied to image segmentation and extends the model by a smoothness constraint based on a local neighborhood system. The introduction of these lateral relationships leads to significantly increased computational complexity compared to the original model. The basic architecture of the derived network implementation is presented in section 3, where the smoothness requirement is taken into account. Finally, section 4 provides segmentation results for artificial and real images and compares them to the deterministic annealing implementation of the unconstrained histogram clustering model. Alternative neural network architectures that follow a similar line of thought are discussed in von der Malsburg and Buhmann (1992) and Wang and Terman (1997). Locally excitatory, globally inhibitory oscillator networks (LEGION), which are introduced in Wang and Terman (1997), consist of locally coupled integrate-and-fire (IAF) neurons. Here, pairwise correlations are used in order to achieve grouping. From a conceptual point of view, this network design is close to pairwise Markov random fields (MRF) used in image grouping. The LEGION network has been successfully applied to image segmentation problems. However, it appears to be only generally applicable if an algorithmic abstraction as developed in Campbell, Wang, and Jayaprakash (1999) is employed. LEGION combined with an MRF-based preprocessing has been also applied for the purpose of texture segmentation in Cesmeli and Wang (2001). In Wersing, Steil, and Ritter (2001), the competitive-layer model (CLM) has been employed for perceptual grouping. Using different layers, each representing a different class, pairwise interactions encode whether an object does or does not belong to a certain group. This model is highly related to the approach presented here and—with
1012
J. Buhmann, T. Lange, and U. Ramacher
lateral connection weights chosen accordingly—to pairwise data clustering (Hofmann & Buhmann, 1997). 2 The Histogram Clustering Model The neural network for image segmentation (see section 3) is motivated by the histogram clustering model (HCM) as proposed in Pereira, Tishby, and Lee (1993) and Puzicha et al. (1999), which explains the observed data by prototypical distributions on the feature space X —one prototype distribution for each designated cluster. Experimental evaluations (e.g., in Puzicha et al., 1999) demonstrate the competitive performance of histogram clustering and its superiority to pure pairwise clustering approaches as well as other graph cut methods. This observation motivates the neural implementation of the model discussed next. For the HCM, the input data are composed of histograms h i, j that count the co-occurrence of object i ∈ {1, . . . , N} and feature j ∈ {1, . . . , M}. Hence, for each object (each image site for the segmentation), a histogram hi := (h i, j )1≤ j≤M is defined as a tuple hi ∈ N M . M denotes the number of features, and n i denotes the number of observations belonging to object i, that is, ni := j h i, j . Then, by normalizing h i, j , one arrives at empirical conditional probabilities pˆ ( j|i) :=
h i, j , ni
(2.1)
so that pˆ i := ( pˆ ( j|i))1≤ j≤M estimates the probability to observe features j given object i. To segment an image, we decompose it in N not necessarily disjoint image patches (see Figure 1) of equal size where ni is the number of pixels in image patch i. The central pixel of a patch is called a site. The sites can be thought of as pixels in a subsampled version of the original image. These sites are the objects considered in the HCM. Histograms are formed from gray values of each image patch to provide information about each site where each histogram consists of M bins. These bins represent the features in the segmentation problem and reflect the composition of intensity values within an image patch: an intensity value x ∈ {0, 1, . . . , 255} increases the Mx count in the bth bin by one where b = 256 + 1. The goal of image segmentation is to partition the set of image sites based on the features obtained from histogramming into K disjoint classes, called segments, where the number of groups K is supposed to be known a priori. In the HCM, segments are represented by prototypical distributions qν = (q ( j|ν)) j∈{1,...,M} where ν ∈ {1, . . . , K }. Here, q ( j|ν) can be interpreted as the probability of observing feature j with an object of class (segment) ν and, thus, qν is a class-conditional distribution. The HCM can be formulated as
Image Segmentation by Networks of Spiking Neurons
1013
Figure 1: Schematic view of the image patches and the histogram generation process.
the result of a maximum a posteriori (MAP) estimation for a generative model, which is characterized by the following sampling procedure: 1. Select object i with (a priori) probability Pr({i}). 2. Suppose i belongs to class ν. Then select feature j with probability q ( j|ν). For image segmentation, the prior probabilities of sites are uniform, that is, Pr({i}) := 1/N. Hence, the probability of observing feature j when object i belongs to class ν, is q ( j|ν)/N when generated according to the above sampling procedure. Here, the sets of unknown parameters are (1) the assignments of image sites to segments, which are denoted by c: {1, . . . , N} → {1, . . . , K }, and (2) the prototypical distributions q1 , . . . , q K , which both have to be estimated from the data. Considering the log posterior of the unknown parameters given the data (in the form of the histograms pˆ i ), we derive the negative log likelihood −L(c, q1 , . . . , q K |pˆ 1 , . . . , pˆ N ), const −
1≤i≤N 1≤ j≤M
pˆ ( j|i) log(q ( j|c(i))),
(2.2)
1014
J. Buhmann, T. Lange, and U. Ramacher
which has to be minimized. Replacing the constant term with the negative empirical entropy i Pr({i}) j pˆ ( j|i) log( pˆ ( j|i)), we derive the objective function
H(c, {q}) :=
DKL (pˆ i qc(i) ).
(2.3)
1≤i≤N
The argument that minimizes the costs H(c, {q }) represents the desired segmentation of the N image sites. Here, DKL denotes the Kullback-Leibler divergence between (discrete) probability distributions with DKL (pq) :=
p j log
1≤ j≤M
pj qj
(2.4)
for p = ( p1 , . . . , p M ) and q = (q 1 , . . . , q M ). Note that this model can be extended to include a local smoothness constraint. Suppose there is a (local) neighborhood Ni defined for each site i. By adding the term λ i j∈Ni 1{c(i) = c( j)} to the cost function defined in equation 2.3, one can penalize segmentations with high local variability. The multiplier λ ∈ R+ , which can be considered as a Lagrange parameter in a constrained optimization problem, controls the amount of penalization. The modified cost function H H (c, {q}) := H(c, {q}) + λ
i
1{c(i) = c( j)}
(2.5)
j∈Ni
represents the model that we attempt to implement with spiking neurons. It usually results in improved segmentations (due to the smoothness) in comparison to those obtained by minimizing H. In Puzicha et al. (1999), a deterministic annealing approach to minimizing the cost function H has been proposed that employs the Gibbs distribution for the inference of robust solutions in every iteration. Note that the Gibbs distribution factorizes for H, thereby enabling relatively efficient operation. However, for the modified model H , the Gibbs distribution has no longer a factorial form,1 which renders annealing approaches to exact optimization impractical, since the Gibbs distribution is hard to compute. The experimental section of Puzicha et al. (1999) provides experimental evidence for the increased computational complexity when couplings between assignments 1 To be precise, the Gibbs distribution is considered over the set of all possible clustering solutions {c}, that is, P(c) = Z1 exp(−β H(c))for a cost function H. The crucial point here represents the normalization constant Z := c exp(−β H(c)) (the partition function), being a sum over exponentially many solutions. For the HCM, this sum has a simple factorial form. However, for the constrained problem, there is no simple factorization, thereby making the exact computation of the Gibbs distribution intractable.
Image Segmentation by Networks of Spiking Neurons
1015
are introduced. The network-based approach aims at overcoming this issue without resorting to independence assumptions as, for example, used in mean-field approximations (Opper & Saad, 2001). For the experiments in section 4, we have employed the deterministic annealing approach in conjunction with the unconstrained model in order to compare the results with those obtained with network implementation, described next. 3 A Neuronal Implementation of the Histogram Clustering Model This section describes the network of spiking neurons that is employed for the purpose of image segmentation. We start by introducing the underlying leaky integrate-and-fire neuron model (Koch, 1999; Gerstner, 1998, 2002). 3.1 The Neuron Model. At a fixed time t, each neuron k is characterized by its internal state sk (t), its constant threshold θk , and its output yk (t) ∈ R. The spike duration τk is assumed to be constant for all neurons (τk := τ := 1) and is measured in milliseconds. The time at which neuron k fires for the (f) (1) (n) f th time is denoted by tk . Let tk , . . . , tk be the firing times of neuron k if it has fired for n times. The state of neuron k before firing, a fictitious neuron potential, is determined by the first-order ordinary differential equation (ODE), dsk sk (t) (t) = − + Ak, dt
(3.1)
(n)
for t > tk + τ (i.e., t larger than the last time neuron k stopped firing) and sk (t) < θk (i.e., the internal state is still below the threshold θk ). Here, Ak = l∈k wlk (t)yl (t) denotes the activation of neuron k, where k is the set of all presynaptic neurons of k and wlk ∈ R is the weight or strength of the connection between the presynaptic neuron l and the postsynaptic neuron k. When the internal state sk (t) reaches the threshold, it is reset to sk (t) = 0 and the neuron fires yk (t) for t ∈ [t ( f ) , t ( f ) + τ ], 1 ≤ f ≤ n; it emits a pulse of width τ . Neurons do not accumulate incoming signals while they emit a pulse themselves in our model, and the internal state sk (t) of neuron k is reset to zero immediately after firing. Furthermore, the term −sk (t)/ in equation 3.1 forces the neuron to forget prior excitation if the incoming stimulus (the activation Ak ) is not strong enough. In the following, t measures the model time (in ms). 3.2 The Network Architecture. As in the HCM (see section 2), we consider image sites as the objects to be grouped. In the case of segmentation,2 2
We adopt here the usual clustering nomenclature. What we call objects here are often denoted as features in the neural networks literature where the clustering problem is also called feature binding problem.
1016
J. Buhmann, T. Lange, and U. Ramacher
Figure 2: The basic architecture for an image site.
each object i is characterized by ( pˆ ( j|i))1≤ j≤M —the empirical probabilities of observing a pixel of a certain gray value belonging to the jth bin in the image path around site i. For the purpose of neural segmentation, we use for each site i a building block consisting of K + M neurons (see Figure 2). The M input neurons fire in each simulation step with frequency proportional to pˆ ( j|i). Each of the M input neurons is connected to all K neurons. The idea is now that firing patterns should reflect the dissimilarity of the prototypical feature probabilities q ( j|1), . . . , q ( j|K ) to the empirical feature frequencies pˆ ( j|i) as measured by DKL (pˆ i qν )—without explicitly relying on the prototypical distributions qν . The νth of the latter neurons is supposed to be active if the image site i should belong to the νth segment. Suppose the weight between an input neuron j for site i and the ν-segment membership neuron is chosen approximately as (i)
w j,ν (t) ≈ wimax + log(q ( j|ν))
(3.2) (i)
for sufficiently large times t; then the weights w j,ν (t) will reflect the KL divergence. The weight learning (see section 3.3) adapts the weights in this direction. The weights are time dependent due to the weight learning that employs the current state of the network. It should be noted that the prototypical distributions are not necessary for computation: we only need weights and an adaption scheme. wimax denotes a suitable constant that ensures positivity. In fact, it is the maximal synaptic strength (since log q ( j|ν) ≤ 0), and it also determines the minimal interspike interval through the synaptic dynamics in equation 3.1. Technically, wimax can be
Image Segmentation by Networks of Spiking Neurons
1017
seen as constant input current fed into each ν-segment membership neuron. Clearly, wimax ≤ θi,ν , where θi,ν is the threshold of the ν-segment-membership neuron in site i, in order to avoid trivial activity at maximal rate. Hence, the closer the input stimulus and a prototypical value are, the larger will be the input to the corresponding class representing neuron. With an appropriate weight learning, the input to segment-membership neurons reflects the per object costs (i.e., the per site costs) if site i is assigned to segment ν. This relation can be seen from equations 2.2 and 3.2. In particular, for appropriately chosen weights, the total input will reflect wimax − DKL (pˆ i qν ), and hence it will be the larger the more similar pˆ i is to qν , that is, the smaller DKL (pˆ i qν ) is, which exactly models the desired adaptation behavior. In addition to the excitatory stimuli, the segment membership neurons are connected by inhibitory synapses that realize a (soft) winner-takes-all mechanism; the segment-membership neuron ν with the largest input over several epochs is most likely to fire, while other neurons are potentially kept from firing. Ideally, the winner neuron is the one for which the corresponding cluster prototype q ( j|ν) has the highest similarity to pˆ ( j|i). This WTA mechanism is related to the inhibition scheme employed in Wersing et al. (2001), where lateral inhibition is also used for segment-coding neurons. The corresponding weights in our implementation are denoted by wµ,ν (set to −2/3 for the experiments) and are time independent and not affected by the learning of weights. However, these weights can be made dependent on the current state of the network. Experiments indicate that this dependence leads to slightly improved segmentations. A similar weight (i) adaption scheme as that for w j,ν (see below) can be employed for this purpose. We refer to the local neuronal architecture of a single site i as a block. There exist N blocks, one for each site. In order to smooth the neuronal segmentation, the architecture favors local cooperation between neurons coding for the same segments and competition between neurons of different segments. The standard fourneighborhood Ni of image sites around a site i is employed where neurons of the same segment layer are excitatorily connected and neurons of different segment layers inhibit each other. Clearly, larger neighborhoods are possible, essentially enforcing smoother segmentations. The weights between segment-membership neurons in neighboring blocks are chosen (i,l) to be time independent and are denoted by wµ,ν (set to −1/3(1 − 2δµ,ν ) in the experiments) if lateral connections between neuron µ from block i and neuron ν from block l are concerned. It should be noted that these connection strengths implicitly regulate the amount of penalization of nonsmooth segmentations and thereby determine the Lagrange parameter λ in equation 2.5. Technically such connections are realized as follows: for each site, the neuron representing the membership to segment ν gets inhibitory stimuli from all membership neurons for segments µ = ν of each block from the four-neighborhood and excitatory stimuli from those for the same segment ν of each block from this four-neighborhood.
1018
J. Buhmann, T. Lange, and U. Ramacher
To sum up, the total activation exciting the ν-segment membership neuron for site i is described by A(i) ν =
j
(i)
(i)
w j,ν (t) y j (t) +
µ=ν
wν,µ yµ(i) (t) +
l∈Ni
µ
(i,l) (l) wµ,ν yµ (t).
(3.3)
(i)
The(i) resulting ODE for the ν-segment membership neuron is dsdtν (t) = (i) − sρν + Aν . In equation 3.3, the first sum adds up the excitatory stimulus (i) (i) from the input layer y j (t), 1 ≤ j ≤ M, which is weighted by w j,ν (t) (see equation 3.2). The second sum in equation 3.3 represents the inhibitory stimuli from the other segment-membership neurons µ = ν from block i. Finally, the last double sum arises from the excitatory and inhibitory links from the neighboring blocks. The ν-segment-membership neuron in site i fires when the internal state reaches the threshold θi,ν . A (hard) assignment of image sites i to segments ν is finally obtained by considering the firing rates (i.e., the number of firing times normalized by the maximally possible number of firing times within the simulation time) of segment-membership neurons. Note that we determine the segment assignment by means of a rate code. At first glance, the timing of spikes does not play any role for the rate code that we employ. In fact, timing becomes crucial through the term −sk (t)/, which erases prior excitation by forgetting, and through the lateral connections that realize the soft WTA mechanism and the lateral smoothing. Furthermore, we observe synchrony of segmentmembership neurons coding for the same class. This is investigated in the experimental section. 3.3 Learning of Weights. For the HCM, K · (M − 1) parameters have to be learned—the prototypical probabilities q ( j|ν), ν ∈ {1, . . . , K }, j ∈ {1, . . . , M}. These probabilities can be constructed from a given assignment of image sites to image segments. In contrast to the explicit calculation of prototypical probabilities by Bayesian inference, we will define a network dynamics where these quantities are iteratively estimated on the basis of local input statistics. The network implementation has K · N · M free parameters—the weights between input neurons and segment-membership (i) neurons. The weights w j,ν are randomly initialized in most simulation experiments, and as a consequence, the initial assignments become random. The of the learning rule is to shift the weights in a direction such that aim (i) (i) j w j,ν (t) y j (t) reflects how well the ith site fits into segment ν. Based on the current firing rate of each segment-membership neuron within a block (∼site), a winner neuron can be determined, yielding a temporary assignment of the site to the winner segment. Let mi,ν = 1 iff segment ν is the winner segment in the block corresponding to site i and mi,ν = 0 otherwise. Denote by h j,ν the arithmetic mean of the probabilities for feature j of all sites that are temporarily assigned to segment ν, that
Image Segmentation by Networks of Spiking Neurons
1019
is, h j,ν = i pˆ ( j|i) mi,ν / i mi,ν . h j,ν can be seen as the current estimate of q ( j|ν). It should be noted that weighted sums can be computed in a network of spiking neurons (Ruf, 1998). Thus, considering h j,ν does not pose a conceptual problem, as is also demonstrated in Wersing et al. (2001), where a similar averaging quantity is involved in the learning process. The idea underlying the proposed weight adaption scheme originates from the observation that the prototypical distributions q ( j|ν) in the HCM ideally represent the average probability to observe feature j given segment ν, which is encoded by the histogram h j,ν . In the space of distributions, one therefore obtains d q ( j|ν) = α( h j,ν − q ( j|ν)) dt
(3.4)
as an update rule, where α is a suitable learning rate. For updating the weights between input neurons j and segment-membership neurons ν for site i and feature j, we propose the following rule, which can be derived from equations 3.2 and 3.9, d (i) (i) w = α exp wimax − w j,ν (t) h j,ν − 1 , dt j,ν
(3.5)
where, α ∈ [0, 1] is a time-independent learning rate that is usually small (i) (i.e., α ≈ 0.01). Note that exp(wimax − w j,ν (t)) ≈ 1/q ( j|ν) holds due to relation 3.2. It determines how quickly weights are adapted. The learning rule depends on the ratio of the pre- and the postsynaptic activity of all neurons coding for the same segment. The mechanism employed here is similar to the learning approaches used in competitive learning (Intrator, 2002) and supports the following statistical interpretation: when the likelihood ratio h j,ν /q ( j|ν) between the measurements and the inferred prototypical feature distribution exceeds one, that is, the learned feature probabilities (i) are too small to match the measurements, then the weights w j,ν should be increased to raise q ( j|ν). By using equation 3.5, the weight dynamics are regulated in such a way that the likelihood ratio fluctuates around one, which yields a self-stabilizing learning dynamics. The value h j,ν which stores a reference value for the average feature probability given segment ν, might raise questions on the biological plausibility of this quantity in the context of neural networks. In essence, it represents a population average (as it is an average over the output of the input, both in time and space). It is clear that globally distributed information has to be collected in the network to represent the segment statistics. Such information can be computed within the network (similar to those in Maass, 1998; see also Ruf, 1998) and thus can be read out, for example, by adding neurons for each segment ν and each feature j to the architecture.
1020
J. Buhmann, T. Lange, and U. Ramacher
3.4 Localization of Information Processing. The process of learning involves information gathered from all sites belonging to the same segment. This can be considered disadvantageous in terms of biological plausibility as well as for hardware implementability. We have translated the system to a localized scheme without resorting to pure pairwise interactions. The latter would imply a CLM-like architecture as proposed in Wersing et al. (2001) and thereby would essentially amount to a pairwise clustering setting (Hofmann & Buhmann, 1997). We sketch the localization scheme: the set of neighbors of site i mentioned above is used for local smoothing. For the localization, a second set of neighbors attached to each site, denoted by Mi , is considered. Regarding the energy functions of section 2, the additional neighborhood system affects only the HCM cost function H in equation 2.5. To be precise, we approximate H by
DKL (pˆ i qc(i) ) ≈
i
(M ) DKL pˆ i qc(i)i ,
(3.6)
i (M )
where qν i represents a local centroid that depends on only the elements of Mi and the given assignment c. Again, we want to stress that this provides (M ) only the intuition for the approach taken. In practice, qν i is not used in the network during propagation. In fact, the introduction of the additional neighborhood system Mi affects only the weight learning rule: instead of considering h j,ν as average over all sites assigned to segment ν, only those that are in the vicinity Mi of site i are taken into consideration. This leads (M ) to a local estimate h j,ν i . Clearly, if Mi = {1, . . . , N} for all i, we have the original cost function H . The neighborhood system therefore regulates the amount of localization. In Geman, Geman, Graffigne, and Dong (1990), few (randomly chosen) long-range connections are used in order to break permutation symmetries, that is, to make the labeling unique. For our purpose, the inclusion of such long-range interactions in Mi also ensures that weights for classes that potentially do not occur in a small neighborhood around the site of interest can be reasonably learned. Therefore, the neighborhood used (solely) for learning mainly contains local neighbors of the current site i, in particular Ni ⊂ Mi , and a few long-range connections. Experimentally, we have observed that the localized scheme leads to satisfying results, even for small neighborhoods Mi , as demonstrated in Figure 3. In this example, Mi consisted of the four-neighborhood Ni and randomly chosen connections to 10% of the remaining sites. A 32 × 32 grid of sites was chosen. The major drawback of the localized scheme now becomes apparent: the potentially decreased convergence speed. Note that this is a difficult example for the localized version since there might be information about the other segment available only for the sites at the segment borders. Nevertheless, the localized scheme renders possible a
Image Segmentation by Networks of Spiking Neurons
1021
Figure 3: Localized versus nonlocalized information processing: The localized scheme achieves similar results but needs much more time to reach a stable state. This is, however, a difficult example for the localization (see the text). The time t refers to the model time (not to the number of simulation steps).
simple, robust, and essentially local learning rule, which is desirable for an analog VLSI implementation. We conclude that localized information processing is possible in principle but might significantly decrease convergence speed. 4 Experimental Results We now provide some experimental results that shed light on the properties of the proposed architecture and its usefulness for image segmentation. We have implemented a discrete time simulation with time steps t := 0.05. Further parameters of the simulations are := 3.5, τ := 1, θ := 3, and w max := 3. States have been randomly initialized, and state updates are subject to small additive noise. It should be noted that this randomness reflects noisy environments as found, for example, in analog VLSI implementations. Learning starts after t = 2 in order to allow the network architecture to stabilize itself in its start configuration. The number of segments was chosen to be K := 5 in all experiments. For the histogramming, M := 12 bins were used. The grid used to generate image patches and sites consists of 64 × 64 nonoverlapping cells (as schematically depicted in Figure 1). 4.1 Application to Artificial Images. We apply the segmentation architecture to two artificial images (see Figure 4): an image consisting of simple geometric objects (a square and two disks) and a textured image composed of five textures. The network activity is monitored by measuring the average firing rate of the segment neurons for a particular site i and assigning this site to the segment represented by the neuron with the highest activity. The segment labels are gray value coded in the following images. For the first image (see Figure 4A), one would expect that all (four) objects are detected as entities of its own. In fact, HCM recovers this structure from the data as expected and introduces a fifth segment defined as the border around the lower disk-like object. The line-like fifth segment became
1022
J. Buhmann, T. Lange, and U. Ramacher
Figure 4: Artificial (input) images used for the experiments.
possible since we have chosen five rather than four potential segments and HCM does not invoke a model order selection procedure. The spiking neuron network finds a segmentation that is close to the ground-truth knowledge (and thus to the HCM result; see Figure 5). For comparing results, we have plotted the HCM result together with the current network segmentation and a difference picture in which black codes for a difference in the segmentations. For this image, we have rerun the simulation with K = 2 and monitored the mean activity of segment membership neurons (see Figure 6A). One clearly recognizes class-specific activity patterns that are shifted in phase. Therefore, neurons coding for the same class appear to synchronize, while one can observe desynchrony up to a certain extent with class membership neurons coding for different classes. Figures 7A and 7B show the firing rate and the spikes during simulation of the class membership neurons in a specific site. Two observations are noteworthy: the winner class in fact suppresses the other class, supported by the learning of weights, and the firing rate can be considered as an assignment probability. The second example—the image composed of five different textures; see Figure 4B—represents a much harder problem since the features (the histograms in this case) do not necessarily capture the difference between two textures. The notion of texture refers to the occurrence of repeated patterns in an image. The ability of the method to identify distinct textures based on local histograms largely depends on the kind of texture and of the locality of the histograms. Other features, such as Gabor filter responses, are more suitable for texture segmentation since they are more discriminative (Hofmann, Puzicha, & Buhmann, 1998) than gray value histograms. For
Image Segmentation by Networks of Spiking Neurons
1023
Figure 5: Segmentation results for an artificial image. Assignments (left: HCM; middle: network; right: difference picture) are shown for several time steps. The HCM segmentation has been included for comparison. (A) At t ≈ 10, the segments are roughly recognized by the network. (B) At t ≈ 95, the network has reached a stable state. Here, the assignments do not change anymore.
Figure 6: (A) Mean activity of class membership neurons for the first toy image in Figure 4A. (B) The misclassification rate of the network with regard to the HCM segmentation for six simulation runs with the railway station image: The plot demonstrates how the weight learning adapts weights to capture the image statistics.
1024
J. Buhmann, T. Lange, and U. Ramacher
Figure 7: (A) Firing rates of class 1/class 2 membership neurons for a single site. (B) Spikes generated by the class 1/class 2 membership neurons for a single site. See the text.
this reason, neither HCM nor the network can be expected to perfectly find the five different textures, and, in fact, this is not the case (see Figure 8). Nevertheless, both segmentations come rather close to ground truth. 4.2 Segmentation of Real-World Images. The results of the last section have demonstrated that the network matches the quality of the segmentation results achieved by HCM when applied to artificial data. We use two real-world images (see Figure 9) to show the segmentation abilities of the network under realistic conditions of real-world imagery. On the first image, depicting a railway station, we observe again a segmentation similar to the one of HCM (see Figure 10). All major segments have been identified by both approaches. Figure 6B shows the misclassification rate with regard to the HCM segmentation for a simulation run with this image. 3 It demonstrates how the network approximates the HCM segmentation via weight adaption. After some fluctuations, the network starts learning the segments around t = 5–6, and after that, it quickly finds a segmentation that is already very close to the one generated by HCM. For the balloon image, the network segmentation and that of HCM differ significantly (consider Figure 11). We argue that the segmentation generated by the network is superior because the meadow in the foreground is split into two parts by HCM while the network recognizes and represents it as a homogeneous region. Similarly, the forest in the image background is divided into two segments in a stripe pattern by HCM. By reconsidering
3
Since segment labels are unique only up to a permutation of the label indices, we consider the misclassification rate with respect to the permutation that minimizes the total number of misclassification.
Image Segmentation by Networks of Spiking Neurons
1025
Figure 8: Segmentation results for a Mondrian image. Assignments (left: HCM; middle: network; right: difference picture) are shown for several time steps. The HCM segmentation has been included for comparison. (A) At t ≈ 4, the segments are roughly recognized by the network. (B) At t ≈ 50 the network has reached a stable state.
Figure 9: Real-world (input) images used for the experiments: (A) Railway station. (B) Balloon image.
1026
J. Buhmann, T. Lange, and U. Ramacher
Figure 10: Segmentation results for an image of a railway station. Assignments (left: HCM; middle: network; right: difference picture) are shown for several time steps. The HCM segmentation has been included for comparison. (A) At t ≈ 5, the segments are roughly recognized by the network. (B) At t ≈ 50, the network has reached a stable state.
the original image in Figure 9B, one can hardly detect this pattern except for one case where the network has also been able to detect two different segments. Note that this behavior was essentially reproducible over 20 different simulation runs. Hence, we conclude that the network proposed leads to a visually more plausible result for the balloon image, although effectively not all segments are used by the network. In the experiments, we observed waves of activation sweeping through segments. This synchronization effect is caused by the lateral inhibitory and excitatory stimuli between neighboring image sites. We want to shed some more light on the network dynamics using the balloon image example. Figure 12A shows the values of the HCM cost function as a function of the model time (in ms), while Figure 12B shows the costs of the constrained model. For comparison, the HCM costs as obtained by the deterministic annealing approach in Puzicha et al. (1999) are also depicted as a constant function. Clearly, the deterministic annealing solution is superior in terms of HCM costs, which is not particularly surprising since the learning mechanism does not optimize the exact HCM cost function. The network, for example, has built in local smoothness constraints by
Image Segmentation by Networks of Spiking Neurons
1027
Figure 11: Segmentation results for the balloon image. Activities (left: HCM; middle: network; right: difference picture) are shown for several time steps. The HCM segmentation has been included for comparison. (A) At t ≈ 5, the network starts recognizing segments. (B) At t ≈ 50, the network has reached a stable state.
its lateral excitatory and inhibitory connections. Nevertheless, the weight adaption scheme still minimizes the HCM costs H, although in an oscillatory rather than steepest-descent fashion. Furthermore, we observe that the network finds a better solution in terms of the constrained objective function H in comparison to the HCM solution. In Figure 12D, we have again plotted the mean activity per segment of segment membership neurons (i, ν) in a small time window. One can detect periodic different activity patterns (see also above)—one characteristic pattern for each segment. In the time window shown, that network has almost reached a stable state. In general, we have observed these segment-specific periodic firing patterns, that is, synchrony within classes can be observed. The dynamics of the network can be nicely visualized by considering the prototypical histograms q ( j|ν). Note that these are computed for diagnostic purposes from the current assignment of image sites to segments, that is, from the current network read-out. The time evolution of these prototypical histograms is shown in Figure 12C for the second segment in the segmentation of the balloon image. Each curve depicts q ( j|ν), 1 ≤ j ≤ M as a function of the simulation time. From the figure, the effects of the weight adaptation scheme can be traced since
1028
J. Buhmann, T. Lange, and U. Ramacher
Figure 12: Demonstration of the network dynamics and learned statistics. (A) HCM costs for the balloon image in comparison with the costs achieved by the deterministic annealing approach. (B) HCM costs with smoothness penalty for the balloon image in comparison with the costs achieved by the deterministic annealing approach. (C) q ( j|2) as computed from the network site-to-segment assignment as a function of the time (see the text). (D) Activity pattern of segment membership neurons in a simulation time window.
the weights essentially determine the assignment of sites to segments and the prototypes amount to a compact representation of the assignments here. One can thus see how the weight adaptation scheme modifies the weights (i) w j,ν and how the network reaches a stable state. In Figure 13 we take a look at the number of milliseconds t (model time) required until a stable state is reached. In this context, we talk about a stable state if less than 1% of the assignments have changed for the previous 10 iterations. We have used 49 restarts, each with random initializations, in order to get an idea of how many simulation steps are needed. For the railway station, a stable state is reached after 44 ± 14 ms and after 28 ± 14 ms in the case of the balloon image. Note furthermore that no weight adaption takes place in the first 2 ms simulation steps. We conclude that the proposed
Image Segmentation by Networks of Spiking Neurons
1029
Figure 13: Histogram for the milliseconds t (model time) until a stable state is reached for the (A) railway station and (B) balloon image.
architecture rather quickly leads to stable results despite the use of lateral smoothness constraints. 5 Conclusion We have presented a network of spiking neurons for the purpose of image segmentation. It is motivated by a statistical model for histogram clustering. A learning rule with fast synaptic adaptation is derived from the model that allows the inference of robust segmentations. Results on artificial as well as real-world images demonstrate the applicability of the proposed procedure. The implementation of the segmentation model with spiking neurons provides the advantage that computationally expensive Gibbs models can be efficiently simulated by the network dynamics rather than approximated by mean field or loopy belief propagation techniques. Alternatives to the network simulation often lead to the use of such Gibbs sampling schemes, which are qualitatively equivalent to our suggested network model but demand a larger amount of computation for obtaining comparable results. Future work should address the computation of histograms within the network (which is externally done in the presented simulations) and an RBF-like network architecture. The latter would ideally rely on the timing of spikes for the purpose segmentation. This modification of the network essentially amounts to a change of the coding scheme (e.g., time to first spike), since the present architecture neglects the exact timing of spikes. For Euclidean data, a successful temporal scheme has been proposed in Ruf (1998) for the purpose of clustering. At the far end, a general theory for mapping statistical models to networks of pulsing neurons would be desirable since the use of statistical models enables robust inference and networks of pulsed neurons can be
1030
J. Buhmann, T. Lange, and U. Ramacher
cheaply implemented in hardware. We consider the proposed architecture a promising step in this direction. Acknowledgments We thank the anonymous reviewers for their helpful comments and suggestions. References Campbell, S. R., Wang, D. L., & Jayaprakash, C. (1999). Synchrony and desynchrony in integrate-and-fire oscillators. Neural Computation, 11, 1595–1619. Cesmeli, E., & Wang, D. (2001). Texture segmentation using gaussian-Markov random fields and neural oscillator networks. IEEE Transaction on Neural Networks, 12(2), 394–404. Geman, D., Geman, S., Graffigne, C., & Dong, P. (1990). Boundary detection by constrained optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(7), 609–628. Gerstner, W. (1998). Spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks. Cambridge, MA: MIT Press. Gerstner, W. (2002). Integrate-and-fire neurons and networks. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Hofmann, T., & Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. IEEE PAMI, 19(1), 1–14. Hofmann, T., Puzicha, J., & Buhmann, J. (1998). Unsupervised texture segmentation in a deterministic annealing framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 803–818. Intrator, N. (2002). Competitive learning. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 238–241) Cambridge, MA: MIT Press. Koch, C. (1999). Biophysics of computation—Information processing in single neurons. New York: Oxford University Press. Maass, W. (1998). Computing with spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks. Cambridge, MA: MIT Press. Opper, M., & Saad, D. (2001). Advanced mean field methods. Cambridge, MA: MIT Press. Pereira, F. C. N., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In Proceedings of the 31st Meeting of the Association for Computational Linguistics (pp. 183–190). Columbus, OH. Puzicha, J., Hofmann, T., & Buhmann, J. (1999). Histogram clustering for unsupervised segmentation and image retrieval. Pattern Recognition Letters, 20, 899–909. Ruf, B. (1998). Computing and learning with spiking neurons—theory and simulations. Unpublished doctoral dissertation, TU Graz. Tu, Z., Zhu, S. C., & Shum, H. (2002). Image segmentation by data driven Markov chain Monte Carlo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 657–673.
Image Segmentation by Networks of Spiking Neurons
1031
von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biological Cybernetics, 67, 233–242. Wang, D., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9, 805–836. Wersing, H., Steil, J., & Ritter, H. (2001). A competitive-layer model for feature binding and sensory segmentation. Neural Computation, 13, 357–387.
Received January 7, 2004; accepted October 1, 2004.
LETTER
Communicated by Bard Ermentrout
Neural Modeling of an Internal Clock Tadashi Yamazaki
[email protected] Shigeru Tanaka
[email protected] Laboratory for Visual Neurocomputing, RIKEN Brain Science Institute. Wako, Saitama 351-0198, Japan
We studied a simple random recurrent inhibitory network. Despite its simplicity, the dynamics was so rich that activity patterns of neurons evolved with time without recurrence due to random recurrent connections among neurons. The sequence of activity patterns was generated by the trigger of an external signal, and the generation was stable against noise. Moreover, the same sequence was reproducible using a strong transient signal, that is, the sequence generation could be reset. Therefore, a time passage from the trigger of an external signal could be represented by the sequence of activity patterns, suggesting that this model could work as an internal clock. The model could generate different sequences of activity patterns by providing different external signals; thus, spatiotemporal information could be represented by this model. Moreover, it was possible to speed up and slow down the sequence generation.
1 Introduction An internal clock, which explicitly represents temporal information among cognitive and behavioral events, is widely believed to exist in the brain. In particular, many lines of evidence support the idea that a time passage from the onset of an input signal is represented. Several hypotheses of how a time passage is represented have been proposed (see Ivry, 1996, for review), and one possible way of the representation is by assigning one population of active neurons to one time interval from the stimulus onset. If active neuronal populations are sequentially generated in the order of interval lengths and one population exclusively corresponds to only one time interval, then we regard the sequence of these populations as the time passage from the stimulus onset and the stimulus onset as the trigger of starting the internal clock. Moreover, if another sequence of populations of active neurons is generated when another input signal is presented, this mechanism can represent not merely temporal information but also spatiotemporal information. Yet what type of neuronal circuit enables such a population coding of time? Neural Computation 17, 1032–1058 (2005)
© 2005 Massachusetts Institute of Technology
Neural Modeling of an Internal Clock
1033
To address this question, we studied a simple recurrent inhibitory network in which neurons were connected with random synaptic weights. We conducted computer simulations and demonstrated that this model generates a sequence of activity patterns of neurons dynamically changing from the stimulus onset and these activity patterns did not recur. Thus, this model could represent a time passage from the stimulus onset in the rate-coding scheme. We confirmed the nonrecurrence of activity patterns by calculating correlation between patterns generated at two given time steps. We then examined the parameter dependency of dynamics both numerically and analytically. We found that the total number of active neurons remained constant, suggesting that the maximum length of a time passage may be exponential to the number of neurons. The total activity of neurons was also estimated and was constant. Moreover, we found a parameter that controls the accuracy of a time passage representation. We further demonstrated some abilities of the model. The model can generate different activity patterns of neurons when different input stimuli are presented. This implies the ability to generate different time passages for different input patterns, which suggests that spatiotemporal information is represented. The stability of sequence generation against noise was examined. An important issue regarding an internal clock is how to reset it, that is, when an identical stimulus is presented, the same population of active neurons must always be generated. This was possible by adding a strong transient signal to the input signal. Finally, the clock speed could be modified by changing the neuronal time constant or the synaptic strength of recurrent connections. 2 Model Description The model consists of N neurons, each with two types of input: excitatory afferent inputs and recurrent inhibitory inputs from other neurons. Let zi (t) be the activity of neuron i at time t. This is calculated as zi (t) = [ui (t)]+ , where [x]+ = x if x > 0 and 0 otherwise, and ui (t) is the internal state of neuron i at time t, which is calculated as ui (t) = I −
j
wij
t
exp (−(t − s)/τ ) z j (s − 1).
(2.1)
s=1
I and wij denote the afferent input signal to neuron i and the synaptic weight of recurrent inhibition from neuron j to neuron i, respectively. We assume the temporal integration of activities of neurons over a long time, which is represented by the summation with respect to s, where τ determines the
1034
T. Yamazaki and S. Tanaka
integration range. Since z can take a real value, the model is represented in the rate-coding scheme. Our goal is to show that the activity patterns of neurons generated using equation 2.1 can encode a passage of time. Specifically, we hypothesized that the activity patterns do not recur, that is, the activity pattern at one time step is dissimilar to the pattern at a different time step when the interval between the two steps is large. Therefore, we use the following correlation function as the similarity index: N zi (t1 )zi (t2 ) C(t1 , t2 ) = i=1 . N N 2 2 z (t ) z (t ) 1 2 i=1 i i=1 i
(2.2)
The numerator represents the inner product of vectors of activity patterns at times t1 and t2 , and the denominator normalizes the vector lengths. Since zi (t) can take only positive values, the index takes the value between 0 and 1. It takes 1 when theactivities of neurons at t1 and t2 are identical. We N N defined C(t1 , t2 ) = 0 if i=1 zi2 (t1 ) or i=1 zi2 (t2 ) = 0, which occurs when the inhibition is strong and no neurons are active. Furthermore, we considered the similarity between two activity patterns (1) (2) under different parameter settings. Let zi (t) and zi (t) be the activities of neuron i at time t under different parameter settings. In this case, we used the following cross-correlation function as the similarity index: N C(t1 , t2 ) =
N
(1) (2) i=1 zi (t1 )zi (t2 )
(1)2 i=1 zi (t1 )
N
(2)2 i=1 zi (t2 )
.
(2.3)
This similarity index also takes the value between 0 and 1. Its value is 1 when the two activity patterns are identical. We also defined C(t1 , t2 ) = 0 if N (1)2 N (2)2 i=1 zi (t1 ) or i=1 zi (t2 ) = 0. We conducted computer simulations in T steps and evaluated similarity indices between t1 and t2 . We used the following parameters as the basic parameter setting: N = 1000, T = 1000, I = 1, τ = 100, and wij ’s were given under the binomial distribution Pr(wij = 0) = Pr(wij = 2κ/N) = 0.5, where κ = 1. We tested the cases of uniform distribution and normal distribution with average κ/N and variance (κ/N)2 , and found no significant difference in results. 3 Results 3.1 Representation of a Time Passage. The left panel in Figure 1 shows the plot of active states of the first 100 neurons out of 1000 neurons composing the recurrent network during T steps under the basic parameter setting. Once a neuron started to become active, its activity continued for several
Neural Modeling of an Internal Clock
1035
t2 80 60 40 20 0
0
200 400 600 800 1000
t
t1
similarity index
1
neuron index
100
0.8 0.6 0.4
C(200,t) C(500,t) C(800,t)
0.2 0
0
200 400 600 800 1000
t
Figure 1: Raster plot of active states (zi (t) > 0) of the first 100 neurons (left), similarity index (C(t1 , t2 )) (middle), and plots of C(200, t), C(500, t), and C(800, t) (right).
hundreds of steps, and then the neuron became inactive. Some neurons were reactivated after the inactive period. Thus, the active and inactive periods appeared alternately. We examined how the number of active neurons changed with respect to time steps and found that the number rapidly converged to become constant; only 196.0 ± 7.5 neurons were activated on average in the last 500 steps. This indicates that a constant number of neurons are active at each time step, that is, a time passage is represented based on the sparse coding. This also implies the number of possible active states of N that 5 exp(0.5N) √ neurons is approximately ≈ . Here we used Stirling’s formula N/5 √ 2 2π N (n! ≈ 2πnnn e −n ). Hence, the model may be able to represent a sufficiently long time passage. We calculated the total activity of neurons and found that the quantity also rapidly converged to become constant—10.8 ± 0.07 on average for the last 500 steps. The middle panel in Figure 1 shows the similarity index calculated using equation 2.2. Since equation 2.2 takes two arguments of t1 and t2 , we obtained a T × T matrix, where the row and the column were specified by t1 and t2 , respectively. Similarity indices were plotted in a gray scale in which black indicated 0 and white 1. A broad white band appeared diagonally. Since the similarity index at the identical step (t2 = t1 ) takes 1, the diagonal elements of the similarity index appeared white. The similarity index decreased monotonically as the interval between t1 and t2 became longer. However, it remained high when the interval was short. This could be confirmed as shown in the right panel of Figure 1, which shows the plots of C(200, t), C(500, t), and C(800, t), namely, the 200th, 500th, and 800th rows in the similarity index. This result indicates that the activity pattern of neurons changed gradually with time and did not recur. We tested the case of 100,000 steps (i.e., T = 100,000) and confirmed that the similarity index displayed a white diagonal band and dark off-diagonal domains, indicating that the representation of a time passage was stable at least up to 100,000 steps. Thus, our model can represent a time passage by a dynamically changing population of active neurons in the rate-coding scheme.
1036
T. Yamazaki and S. Tanaka
(a) N = 100
(b) N = 500
(c) N= 1000
(d) N= 2000
(e) τ = 10
(f) τ = 50
(g) τ = 100
(h) τ = 200
(i) κ = 0.5
(j) κ = 1.0
(k) κ = 2.0
(l) κ = 3.0
t1t 2 Figure 2: Similarity indices at various parameter settings.
3.2 Parameter Dependency. We studied how the dynamics changed when the value of a parameter was modified. First, we varied N and calculated similarity indices. Other parameters were fixed under the basic parameter setting. For N = 100 (see Figure 2a), the width of the white diagonal band became narrower and jagged, and white off-diagonal domains indicating the recurrence of an activity pattern of neurons appeared. As N increased, the white off-diagonal domain disappeared, although the diagonal band became broad and vague (see Figures 2b–2d). This suggests that the number of neurons should be large for the unique representation of a time passage by activity patterns of neurons. On the other hand, for N = 2000 (see Figure 2d), off-diagonal domains tended to be totally white. Next, we varied τ . For τ = 10, the similarity index became totally white so that the diagonal band did not appear (see Figure 2e). This was because
Neural Modeling of an Internal Clock
1037
for τ = 10, no neurons exhibited the repetition of transition between active and inactive states so that the activity pattern did not change. For τ = 50, the diagonal band appeared but expanded as t1 or t2 increased (see Figure 2f). For larger τ , the diagonal band appeared clearly, and the off-diagonal domains became darker (see Figures 2g and 2h). This suggests that the range of temporal integration should be long for the stable representation of a time passage. Figures 2i to 2l plot similarity indices for various κ values. The diagonal band became narrower and the off-diagonal domains became darker as κ increased, suggesting that a large κ enables the model to stably represent a time passage. As κ increased, black domains appeared on the top and the left sides of the similarity index. The emergence of these domains indicates that all neurons became inactive for the first tens of time steps because of the strong recurrent inhibition due to a large κ. Hence, κ should be within an appropriate range. Varying I did not change the similarity index. Let zic (t) be the activity of neuron i at time t under I = c. For any i and t, we obtained the following relationship: zic (t) = czi1 (t), where {zi1 (t)} is the activities of neurons under the basic setting. This indicates that neuronal activity is scaled by the input strength. Consequently, the similarity index does not depend on I . This scaling property depends on the choice of the nonlinearity. In the model, the scaling property holds because the half-rectification zi (t) = [ui (t)]+ does not saturate the (positive) value of ui (t). In summary, N and τ should be large while κ should be within an appropriate range to generate a sequence of activity patterns of neurons uniquely representing a time passage. 3.3 Analysis of Parameter Dependency. We also analyzed the parameter dependency of the dynamics of our model. We analytically estimated the total activity of neurons, which is given by lim
t→∞
i
zi (t) =
NI . 1 + κτ
(3.1)
The derivation of this formulais found in the appendix. The left panel in Figure 3 shows the plots of i zi (T) obtained by computer simulations and equation 3.1 against τ , and the middle panel the plots of i zi (T) and equation 3.1 against κ. In computer simulations, we confirmed that the value of i zi (t) had already reached the steady state and became constant at t = T. As can be seen, there is excellent agreement between the analytical and simulation results.
1038
T. Yamazaki and S. Tanaka 100
Simulation Theory
80
60
60
40
40
20
0 0
50
100 150 200
0 0
τ
80 40
20
0
Simulation Theory
120
∑izi(t)
∑izi(t)
80
160
Simulation Theory
tinactive
100
0.5
1
1.5
κ
2
0
1
2
3
4
5
κ
Figure 3:Plots of i zi (T) obtained by computer simulations against τ (left), plots of i zi (T) against κ (middle), and the simulated duration of inactive periods in simulations against κ (right). The solid curves in the left and middle panels indicate the analytical estimation of the total activity given by equation 3.1, and the solid curve in the right panel indicates the analytically estimated duration of inactive periods given by equation 3.2.
Figures 2k and 2l show that all neurons became inactive in the first tens to hundreds of steps when κ was large. We theoretically estimated the duration of the initial inactive period. Let tinactive be the time step when neurons recovered from the initial inhibition and started to become active, that is, the duration of the initial inactive period. When κ < 1, tinactive = 0. On the other hand, when κ ≥ 1, we obtained tinactive = τ log κ + 1 . (3.2) This derivation is found in appendix. The right panel in Figure 3 plots equation 3.2 and the duration of inactive periods in simulations with different κ’s. The simulation result agrees with the theoretical one in general; however, there is a slight difference when κ is large. The difference could be due to an approximation in the derivation of equation 3.2. We omitted fluctuation terms in the derivation, which might shorten the inactive period when κ was large. Then we studied how the width of the diagonal band is determined. We defined width(t), the width of the diagonal band at time t, by width(t) = arg max (t > t, C(t, t ) > θ ), t
(3.3)
where θ is the threshold of similarity and was set at 0.9. We simplified equation 2.1 and derived the following equation (the derivation is found in appendix): t I 1 − √κτN κ + ui (t) ≈ δ
wij exp (−(t − s)/τ ) z j (s − 1), N j 1 + κτ 1 − √1 s=1 N
(3.4)
Neural Modeling of an Internal Clock
1039
800
800
width(t)
1000
width(t)
1000
600
N=1000 N=2000 N=3000 N=4000 N=5000
400 200 0 0
20
40
60
N=1000 N=2000 N=3000 N=4000 N=5000
600 400 200 0
80
100
0
200 400 600 800 1000
t
t
1000
N τ
and various N values (left) and the same
κ=0.40 κ=0.45 κ=0.50 κ=0.55 κ=0.60 κ=0.65 κ=1.00
800
width(t)
√
600 400 200 0 0
200 400 600 800 1000
t
4000 3000
width(t)
Figure 4: Plots of width(t) at κ = √ plots at κ = 2 τ N (right).
2000 1000 0 0.3
0.4
κ
0.5
0.6
Figure 5: Plots of width(t) (left) and width(T) against κ (right).
where Pr(δ
wij = 1) = Pr(δ
wij = −1) = 0.5. This equation shows that if √κτN is less than 1, then ui (t) for all i’s tends to be a positive constant due to the first term in equation 3.4, and thus the fluctuation term (the second term in equation 3.4) may not contribute. Hence, width(t) increases as t increases. The left panel of Figure 4 plots width(t) under different N’s, τ = 100, and √ κ = τN . As can be seen, width(t) increases exponentially as t increases. Note that the change in width(t) is invariant regardless of N, suggesting that √κτN controls the dynamics. On the other hand, as √ shown in the right panel of Figure 4, when κ was twice as large (i.e., κ = 2 τ N ), width(t) became constant. Moreover, the change in width(t) is invariant, again regardless of N. These results show that there exists a critical κ value denoted by κ∗ such that if κ < κ∗ then width(t) increases rapidly, whereas if κ > κ∗ , then width(t) remains constant. We varied κ and plotted width(t) with different κ’s, which is shown in the left panel of Figure 5. As can be seen, κ∗ can be estimated at approximately 0.5 when N = 1000 and τ = 100. If κ < 0.5, then width(t) increases rapidly, whereas if κ > 0.5, then width(t) remains constant, and when κ = 0.5, the increase in width(t) becomes linear. Then we estimated κ∗ more accurately. We conducted computer simulations in 5T steps and plotted width(T) against κ in the right panel of Figure 5. At
1040
T. Yamazaki and S. Tanaka
κ < 0.496, width(T) became 4T. Since the total simulation steps were 5T, the upper bound of width(T) was 4T, concluding that the diagonal band reached the maximum. At κ ≈ 0.496, width(T) sharply decreased, and as κ further increased, width(T) became constant. Hence, we could observe phase transition at κ∗ = 0.496. The control parameter √κτN also implies that width(t) increases when N is large, τ is small, or κ is small. This is consistent with the simulation results shown in Figure 2d (for N), Figures 2e and 2f (for τ ), and Figure 2i (for κ). Therefore, although the plots of similarity indices show substantial differences in their shapes, this control parameter provides a unified view of how the shapes differ under different parameter settings. 3.4 Analysis of the Sparseness. These dependency results may imply that much of the variance among similarity indices is accounted for by the level of sparseness of the active neuron population. We examined how much the sparseness could account for the variance. Let r (t) be the ratio of the number of active neurons to N at time t defined as r (t) =
# of active neurons at time t . N
The sparseness is the reciprocal of r (t), that is, smaller r (t) values indicate more sparseness. We conducted simulations to clarify the relationship between width(t) and r (t). Parameters N, τ , and κ were chosen from N ∈ {500, 1000, 2000}, τ ∈ {50, 100, 150, 200}, and κ ∈ {0.5, 1.0, 2.0, 3.0}, and all combinations of N,τ and κ were tested. Thus, we conducted 3 × 4 × 4 = 48 simulations. We did not test the case of N = 100, because the fluctuation of the diagonal band observed in Figure 2a is clearly due to the size effect. Hence, we considered only larger N values: N = 500, 1000, and 2000. We also ignored the case of τ = 10 and tested τ = 150 instead. Then we plotted width(t) against r (t) at t = T in Figure 6. The left panel of Figure 6 shows the plot of width(T) against r (T) for 0 ≤ r (T) ≤ 1. As can be seen, width(T) values for r (T) < 0.4 are small but scattered, while width(T) values for r (T) > 0.4 reach the maximum value of 1000. Moreover, this plot implies that there may be a positive correlation between width(T) and r (T). To verify this, we fitted a linear function to the set of data points for r (T) < 0.4 using a least-square method and drew the regression line. The line has a positive slope, indicating a positive correlation. Next, we examined how the scattering of data points appeared. For each τ ∈ {50, 100, 150, 200}, we selected corresponding data points, fitted a linear function to these points, and drew the line in the right panel of Figure 6. We used the data points for r (T) < 0.4 only. This figure indicates that the scattering of width(T) values is dependent on τ values, that is, width(T) increases with τ . Therefore, the sparseness can be a good measure
Neural Modeling of an Internal Clock
1041
800
800
width(T)
1000
width(T)
1000
600 400
τ=50 τ=100 τ=150 τ=200
600 400 200
200 0
0 0
0.2
0.4
0.6
r(T)
0.8
1
0
0.1
0.2
0.3
r(T)
0.4
0.5
Figure 6: Plots of width(T) against r (T) for 0 ≤ r (T) ≤ 1 with the linear regression line calculated using data points for r (T) ≤ 0.4 (left) and that for 0 ≤ r (T) ≤ 0.4 with regression lines determined using data points corresponding to each τ value (right).
that accounts for the variation among similarity indices, even though τ also plays a critical role in determining the exact value of width(T). 3.5 Representation of Different Time Passages for Different Input Signals. The model can represent different time passages for different input signals by generating different sequences of activity patterns of neurons. To demonstrate this, we extended the input I to Ii such that individual neurons could receive arbitrary signals independently, and we conducted two simulations. In each simulation, Ii ’s were set to 0 or 1 randomly with the probability 0.5. Then we calculated the similarity index between activity patterns in the two simulations. The left and middle panels in Figure 7 show the similarity indices obtained from each simulation, and the right panel represents the similarity index between two sequences of activity patterns generated at different settings of {Ii }. As can be observed, the similarity index for each simulation shows a clear diagonal band, suggesting that a sequence of activity patterns is generated without recurrence. The similarity index between these two sequences, however, is totally black. This indicates that different time passages are generated by different input signals. 3.6 Stability Against Noise in Input Signals. We examined to what extent a sequence of activity patterns of neurons generated by our model was robust against temporal noise. We extended input signals as follows, Ii (t) = I · (1 + η · U[−1, 1]),
(3.5)
where U[−1, 1] denotes a uniform random number in [−1, 1] considered as noise and η represents the noise intensity. Note that different noise was incorporated at each time step.
1042
T. Yamazaki and S. Tanaka
t1t 2 Figure 7: Similarity indices at different settings of {Ii }’s (left and middle) and the similarity index between two sequences (right).
t1t 2 Figure 8: Similarity indices at various η values. From left to right, the η values are 0.01, 0.05, and 0.10, respectively.
The simulation procedure was the same as that in the previous simulation. We tested the cases of η = 0.01, 0.05, and 0.10. For each η value, we (1) (2) generated two sequences of input signal patterns {Ii (t)} and {Ii (t)} and conducted simulations. Then we calculated the similarity indices between the simulations, which are shown in Figure 8. Even with injection of 5% noise, the diagonal band still remained. The diagonal band disappeared at 10% noise injection. 3.7 Stability Against Noise in Connection Matrix. We also examined the robustness of sequence generation against noise in the connection matrix wij . We extended the synaptic weights as follows, wij (t) =
(1 + η · U[−1, 1]) · wij
wij =
(η · U[0, 1]) ·
otherwise,
1 N
2κ , N
(3.6)
where U[−1, 1] and U[0, 1], respectively, denote uniform random numbers in [−1, 1] and [0, 1] considered as noise and η represents the noise intensity. Note that different noise was incorporated at each time step. The simulation procedure was the same as that in the previous section. Briefly, we varied
Neural Modeling of an Internal Clock
1043
t1t 2 Figure 9: Similarity indices at various η values. From left to right, the η values are 0.3, 0.5, and 0.7, respectively.
t1t 2 Figure 10: Similarity indices for {zi (t)} (left) and for {zi (t + T)} (middle), and the similarity index between {zi (t)} and {zi (t + T)} (right).
η as 0.3, 0.5, and 0.7. For each η value, we conducted two simulations under different random seeds, and then the similarity index between the two sequences generated by the two simulations was calculated. The calculated similarity indices are shown in Figure 9. Even with injection of 50% noise, the diagonal band still remained. The diagonal band disappeared at 70% noise injection. 3.8 Resetting of an Internal Clock. Can we reset the clock in the model? We conducted simulation in 2T steps, applying an input signal in which strong transient input signals I = 3.0 at t = 0 and t = T were added to the sustained signal I . The left and middle panels in Figure 10 are similarity indices for sequences {zi (t)} and {zi (t + T)} for 0 ≤ t ≤ T − 1, respectively. Immediately after the transient injection of a strong signal, all neurons became active once; then they became inactive for the subsequent tens of steps because they received strong inhibitory feedback from other neurons. Hence, similarity indices represent black domains on the top and the left side. After the release from inhibition, neurons started to become active, and a sequence of populations was generated, which is indicated by the white diagonal band. The right panel in Figure 10 is the similarity index between {zi (t)} and {zi (t + T)}, which also shows a white diagonal band.
1044
T. Yamazaki and S. Tanaka
100
I(0)=I+∆I I(0)=I
∑izi(t)
80 60 40 20 0 0
200
400
t
600
800 1000
Figure 11: Plots of i zi (t) when the transient signal was added (I (0) = I + I ) and not added (I (0) = I ).
This index shows that the sequence of activity patterns generated at t ≥ T was almost identical to that generated at t ≥ 0. This suggests that a transient component of the input signal can reset the internal clock. Does the transient signal affect the total activity i zi (t)? If the activity markedly decreases after receiving the signal and does not return to the normal calculation using equation 2.4, the model may not be adequate for the internal clock. We plotted in Figure 11 the total activity of neurons i zi (t) when the transient signal was added and not added. As seen, during the inactive period, the total activity was 0, and the activity returned to normal immediately after the end of the inactive period, indicating that the transient signal does not affect the level of the total activity. 3.9 Speeding Up and Slowing Down an Internal Clock. We examined the similarity between sequences of activity patterns of neurons generated using different values of τ or κ. For each case, we conducted two computer simulations: in one simulation, the value of either τ or κ was altered, and in the other, the basic setting was used. Then we calculated the similarity (1) index between the sequence {zi (t)} in the altered case and the sequence (2) {zi (t)} in the basic case. The left panel in Figure 12 shows the similarity index for the case in which τ was set at 90. As can be seen, the white band representing high similarity slanted slightly above the diagonal line. The degree of the slant depended on the difference of values of τ . If we set τ at 80, the slant increased, as shown in the middle panel of Figure 12. However, the length of the white band became shorter, which suggests that a long time passage could not be represented. The right panel in Figure 12 shows the similarity index in the second case, in which κ was at 1.5. We can see that the white band also slanted less than the diagonal line, and the band became shorter. The black
Neural Modeling of an Internal Clock
1045
t1t 2 Figure 12: Similarity indices in cases of decreasing τ to 90 (left) and 80 (middle), and increasing κ to 1.5 (right).
region on the top edge represents the period in which all neurons became inactive due to the strong inhibition by the large κ. These results indicate that the clock speed increased with a change in parameter τ or κ.
4 Discussion 4.1 The Model. We studied a random recurrent inhibitory network of an internal clock. Individual neurons exhibited the random repetition of transition between active and inactive states following the stimulus onset due to the randomness incorporated in the recurrent connections, and the active and inactive periods were sustained for several tens to hundreds of steps. Hence, the population of active neurons changed gradually with time and did not recur. This property was confirmed by calculating the similarity index. Both the number of active neurons and the total activity of neurons remained constant. This is an important property for a stable time representation because if the number of active neurons varies at each time step, there may exist a time step at which no neurons become active. Moreover, if a population of active neurons at a time step is a subset of another population of active neurons at another time step, then it is impossible to distinguish these two time steps from each other in the learning of a timing discussed in section 4.2. This also indicates that a time passage is represented based on the sparse coding, suggesting that the model may be able to generate a sequence of neuronal activity patterns in which the length is the exponential of the number of neurons. Noise stability was also discussed, and sufficient stability for up to 5% uniform noise in the input signals and 50% uniform noise in the connection matrix were demonstrated. Our internal clock could be reset by a strong transient input added to the sustained signal. Accuracy is an important issue for the representation of time. Accuracy in our framework may be estimated by the length of successive time steps represented by one single activity pattern of neurons, because when neurons reveal an activity pattern, time steps represented by this activity pattern
1046
T. Yamazaki and S. Tanaka
cannot be distinguished from each other. If successive time steps are represented by a single activity pattern, the similarity index between two time steps within the range is high. Thus, the range is equivalent to the width of the white diagonal band in the similarity index. Therefore, width(t) defined in equation 3.2 represents the accuracy of time representation in our framework. As discussed in section 3.3, the change in width(t) depends on the parameter √κτN . If this is small, then width(t) increases exponentially, hence, accuracy deteriorates as t increases, whereas if this is large, width(t) remains constant and accuracy is preserved. Therefore, depending on this parameter, the model can represent an accurate internal clock as well as an internal clock in which the accuracy decreases with time. Psychophysical experiments have shown a relationship between time interval and accuracy known as Weber’s law (Gibbon, Malapani, Dale, & Gallistel, 1997), which states that as time interval t increases, the standard deviation of reported intervals over trials increases linearly, while the average of reported intervals is equal to t. That is, the accuracy of the representation of time t decreases linearly as t increases. The accuracy in our model linearly decreases only at the critical point κ = κ∗ . Hence, strictly speaking, this model does not satisfy Weber’s law. Nevertheless, the left panel in Figure 5 demonstrates that at κ = 0.5, width(t) seems increasing linearly with respect to t, implying that Weber’s law is satisfied in the vicinity of the critical point for finite time steps even in this model. Furthermore, we tested whether the model could change the clock speed. It has been shown that the basal ganglia is involved in time estimation, specifically, for longer time events (>3 sec) in motor planning (Ivry, 1996; Lalonde & Hannequin, 1999). Some studies have shown that in the basal ganglia, the facilitation of dopamine secretion from substantia nigra pars compacta speeds up the internal clock, while the suppression or blockade of dopamine secretion decreases the speed (Meck, 1996). Since modulatory neurotransmitters change neuronal excitability (Missale, Nash, Robinson, Jaber, & Caron, 1998; Kandel, Schwartz, & Jessel, 2000), we considered varying τ or κ. Then we obtained similarity indices in which the white band deviated from the diagonal line. These results suggest that the change in neuronal activity patterns is speeded up or slowed down, hence the speed of the internal clock is also. 4.2 Learning of a Timing in the Cerebellum. Pavlovian eyelid conditioning is a clear and experimentally traceable example of the representation of a time passage (Mauk & Donegan, 1997). This experiment involves repeated presentations of a conditional stimulus (CS) paired with an unconditional stimulus (US) delayed for a certain time after the CS onset. Typically, a neutral stimulus such as a tone serves as the CS and an air puff directed to one eye as the US. With this training, the subject not only learns to close its eye (conditioned response, CR) in response to the CS but also learns to elicit the CR at the same time as the interstimulus interval (ISI)
Neural Modeling of an Internal Clock
1047
GOLGI
LTD
PARALLEL FIBERS
PURKINJE
GRANULE
MOSSY FIBERS
SYNAPSES EXCITATORY INHIBITORY
CS PONTINE NUCLEUS
CR
CEREBELLAR NUCLEUS CLIMBING FIBER
US INFERIOR OLIVE
Figure 13: Schematic of known cell types and synaptic connections in the cerebellum. Shown in bold letters and solid lines are the components corresponding to our model.
between the CS and US onsets during the conditioning, indicating that the time passage from the CS onset to the US onset is learned. Evidence from electrophysiological and behavioral studies demonstrates that the Pavlovian eyelid conditioning is mediated by the cerebellum working as an internal clock (Ivry, 1996). Figure 13 represents the schematic of known cell types and synaptic connections in the cerebellum (Eccles, Ito, & Szent´agothai, 1967; Ito, 1984). The neural signal of the CS comes from the pontine nucleus through mossy fibers to the cerebellar nucleus, which attempts to elicit spike activity. The CS signal is also sent to Purkinje cells relayed by granule cells, and then Purkinje cells inhibit the cerebellar nucleus. Thus, the cerebellar nucleus receives both excitatory inputs directly through mossy fibers and inhibitory inputs from Purkinje cells. Golgi cells are inhibitory neurons; they receive excitatory inputs from granule cells through parallel fibers and inhibit granule cells. The connections between granule and Golgi cells are recurrent and bipartite; there is no direct connection from a granule cell to another granule cell and from a Golgi cell to another Golgi cell. The neural signal of the US comes from the inferior olive through climbing fibers to Purkinje cells. The ISIs between the CS and US onsets are learned by long-term depression (LTD) of parallel fiber terminals at Purkinje cells (Ito, 1989). Before LTD is induced, the cerebellar nucleus receives the CS signal through mossy fibers and attempts to elicit the spike activity. The nucleus, however, is inhibited by Purkinje cells that are excited by granule cells; hence, the nucleus cannot elicit spikes. When the US signal arrives at a Purkinje cell through a climbing fiber, with conjunctive stimulation through parallel fibers, the
1048
T. Yamazaki and S. Tanaka 1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 0
200
400
t
600
800
1000
0 0
200
400
t
600
800
1000
0
200
400
t
600
800
1000
Figure 14: Plots of 0.1Net input (t) (dashed line) and C(t1 , t) (solid line) for t1 = 200, 500, 800 (left to right).
LTD occurs at the parallel fiber terminals at the Purkinje cell. Thereby, the net input from granule cells to the Purkinje cell decreases. After the LTD is established, the nucleus is released from inhibition by the Purkinje cell and can elicit the CR. Our model described in equation 2.1 can be derived from dynamical equations of granule and Golgi cells; the derivation is in the appendix. Hence, the model may be able to play a role of the cerebellar internal clock. Let us examine whether our model can be used for the learning of ISI between the onsets of the CS and US on the basis of the above scenario. Let wiPF be the synaptic weight between granule cell i and the target Purkinje cell, which takes 1 for any i initially. We assumed that the US signal arrived at the Purkinje cell at t = 200 and set wiPF = 0 for i ∈ {i|zi (200) > 0} (neurons that were active at t = 200) due to the LTD of parallel fiber terminals. The net input to the Purkinje cell at time t was calculated according to Net input(t) = wiPF zi (t). i
We also tested the cases of t = 500 and 800. The left panel in Figure 14 shows the plot of Net input(t) and C(200, t). The value of Net input(t) was rescaled by multiplying 0.1 so as to be comparable to [0, 1]. At t = 1, this value reached 0.1N because inhibition was not effective at this step (out of range in this plot). Then it decreased gradually as t increased until t = 200 and became 0 at t = 200, after which the value slowly increased and became constant. Similar results were obtained in cases of t = 500 and 800 (the middle and right panels in Figure 14). These results indicate that our internal clock works for the learning of ISI. Note that the shape of Net input(t) became the inverse of the similarity index at time t, suggesting that the success of learning coincides with the appearance of a clear white band in the similarity index. The model can address several aspects in Pavlovian eyelid conditioning. Mauk & Ruiz (1992) demonstrated that a differential conditioning procedure in which discriminable CSs are individually paired with a US at different ISIs can promote the concurrent acquisition of differently timed
Neural Modeling of an Internal Clock
1049
CRs in an identical animal. Discriminable CSs used by Mauk & Ruiz (1992) may correspond to the different patterns of input signals in our model, and the model could represent different time passages for the different patterns as shown in Figure 7. Aitkin & Boyd (1978) reported that neurons in the lateral pontine nuclei, which project mossy fiber inputs to the cerebellum and is a source of the CS signal, show strong transient responses at the onset of a pure tone presentation followed by sustained activities during the presentation. They found that three-fourths of the mossy fibers showed transient activity. We simulated the 3:1 ratio of transient:sustained activity strength of the afferent input signal and successfully reproduced the identical sequence of activity patterns of neurons. The cerebellum is modulated by dopaminergic inputs. Specifically, dopaminergic fibers from the ventral tegmental area to the cerebellar cortex terminate in the granular layer (in which granule and Golgi cells are located) of the crus I and crus II ansiform lobules (Ikai, Takada, Shinonaga, & Mizuno, 1992). It was also reported that these areas are essential for the normal learning of Pavlovian eyelid conditioning (Hardiman & Yao, 1992). Although there has been no report of whether the cerebellar internal clock is speeded up or slowed down by these neurotransmitters, they might modulate the speed of the cerebellar internal clock. 4.3 Relationships with Other Models. While the model utilized random recurrent inhibitory connections corresponding to recurrent connections between granule and Golgi cells for the generation of a cerebellar internal clock, several models based on different mechanisms have been proposed to date. Moore, Desmond, and Berthier (1989) proposed a tappeddelay-line model. They assumed an anatomical structure of mossy fibers such that they fire one after another with a certain delay after the onset of mossy fiber signals. Due to such an explicit delay mechanism, they proposed that a firing event of each granule cell represents a time passage after the signal onset. However, no such anatomical structure among mossy fibers has been found. Gluck, Reifsnider, and Thompson (1990) assumed oscillating granule cells with different frequencies. Their model is analogous to the Fourier series expansion, on the point that any timing can be represented by a population of neurons. However, they had to assume synaptic weights of complex values to calculate the Fourier series expansion. Hence, their model seems unrealistic. Chapeau-Blondeau and Chauvet (1991) proposed a model in which parallel fibers play the role of an explicit delay mechanism so that the transmission of granule cell activities is delayed. However, as demonstrated by Vos, Maex, Volny-Luraghi, and De Schutter (1999), Golgi cells receiving the same parallel fiber input fire synchronously, suggesting that parallel fibers do not generate delay. Bullock, Fiala, and Grossberg (1994) have proposed a spectral timing model using an intracellular mechanism of delay generation. However, to generate such delay, they needed to assume that the number of Golgi cells is equal to that of granule cells so
1050
T. Yamazaki and S. Tanaka
that there is one-to-one correspondence between granule and Golgi cells. Biologically, a granule/Golgi cell ratio is estimated to be about 1000 on the basis of a granule/Purkinje cell ratio of 274 (Harvey & Napper, 1991) and a Purkinje/Golgi ratio of 3.22 (Palkovits, Magyar, & Szent´agothai, 1971). Later Fiala, Grossberg, and Bullock (1996) proposed another mechanism of delay generation in Purkinje cells on the basis of the slow process of intracellular transduction by the metabotropic glutamate receptor (mGluR), by which their model can generate delay up to 4 secs. Moreover, such slow processes have been reported (Batchelor, Madge, & Garthwaite, 1994; Dzubay & Otis, 2002) and glutamate uptake controls the length of delay (Reichelt & Knopfel, ¨ 2002). However, delay reported so far is at most 700 msec, which is insufficient to account for Pavlovian eyelid conditioning because the conditioning is successful during the ISI is up to 2 to 3 secs (Mauk & Donegan, 1997). Also, the 4 sec delay generated by their model is dependent on a parameter that does not have biological counterparts. Rather, they chose parameters so as to obtain 4 sec delay. Buonomano and Mauk (1994) were the first to focus on the recurrent loop between granule and Golgi cells as the device of time coding. To demonstrate how the recurrent circuit works, they modeled the circuit based on many biological properties, such as neural circuitry, cell ratio, the synaptic convergence-divergence ratio, realistic neuron models with leaky membranes, and realistic cell parameters. Then the model generated a temporal code based on the population of active neurons. Later the model was extended to the complete cerebellar circuit and could reproduce the experimental results of Pavlovian eyelid conditioning (Medina, Garcia, Nores, Taylor, & Mauk, 2000; Medina & Mauk 2000). However, despite the completeness of the realistic modeling, the mechanism of generating such a temporal code was still unclear due to the complexity of the realistic model. In contrast to their realistic modeling, our approach focused on the simplicity of the model. This simplicity made possible a degree of mathematical and numerical analysis of how and why this model could generate a temporal code. We could clarify the mechanism of time coding; that is, randomness in recurrent connections plays the key role. Moreover, we could investigate the dynamics of the model in depth and could demonstrate some properties, such as parameter dependency, noise stability, resetability, and ability, to produce different patterns for different input signals. Our simplified modeling approach had the advantage of enabling mathematical and numerical analyses, but potentially at the cost of realism. We have started a realistic modeling of the cerebellar circuit based on conductance-based leaky integrate-and-fire neuron models with large decay constants of excitatory postsynaptic potentials (EPSP) at synapses from granule to Golgi cells and inhibitory postsynaptic potentials (IPSP) at synapses from Golgi to granule cells. Such large time constants are evident; the NMDA component of EPSP can be fitted by two exponential functions in
Neural Modeling of an Internal Clock
1051
which the time constants are 31 msec (33%) and 170 msec (67%) (Dieudonn´e, 1998), and the GABAA component of IPSP by three exponential functions in which the time constants are 8.0 msec (22%), 33.5 msec (58%), and 102.9 msec (20%) (Brickley, Cull-Candy, & Farrant, 1999). To date, we have confirmed that the dynamics of the realistic model is quite similar to that of the simple model. It is not surprising, because Ermentrout (1994) reported the equivalence of conductance-based models to rate models when the conductance change caused by incoming spikes is sufficiently slow. Rabinovich et al. (2001) proposed a theoretical framework of a temporal coding called winnerless competition in which the dynamics of neurons is described by the Lotka-Volterra equation. This model slightly differs from ours on some points, although the temporal changes of active neurons are commonly caused by asymmetric recurrent inhibitory connections. In their approach, the response profile of one neuron, which is temporally complex, is assumed to carry information of a stimulus. In our approach, the temporal profile of one neuronal response does not carry this information. Rather, the temporal change of active neuronal populations correlates with a time passage. Hence, whereas in their model, the number of active neurons can vary with time because the response profiles are important, in our model the number should not vary for stable representation of a time passage as discussed in section 4.1. In their seminal papers, Marr (1969) and Albus (1971) had paid attention to the recurrent circuit between granule and Golgi cells as the device for the sparse representation of spatial information called codon representation (Marr, 1969) or expansion recoding (Albus, 1971). They did not, however, study the ability for time coding in the cerebellum. We extended their idea of sparse representation to spatiotemporal information processing. Li and Hopfield (1989) studied the dynamics of the network composed of mitral and granule cells in the olfactory bulb. Mitral cells are excitatory cells that receive odor signals and excite granule cells, which inhibit mitral cells in turn. In this way, the mitral-granule cells network is a recurrent inhibitory circuit like the cerebellar granule–Golgi cells network. Hence, their dynamical model becomes similar to ours. However, they focused not on the time coding but the oscillatory responses of mitral cells. Bugmann (1998) and Okamoto and Fukai (2001) proposed models representing a single time interval and satisfying Weber’s law. These models have symmetric recurrent connections and hence a steady state. A single time interval is represented by the length of the relaxation process required to attain the steady state. On the contrary, these models do not work as an internal clock because they cannot represent multiple time intervals. Our internal clock model is simple and derivable from the well-known structure of the cerebellum. Its simplicity helps us to clarify the mechanism and the function of the internal clock, while its realistic design may justify the biological reality.
1052
T. Yamazaki and S. Tanaka
Appendix A.1 Derivation of Equation 3.1 From equation 2.1, it follows that ui (t) ≈
I 1 wij z j (t − 1), + 1− ui (t − 1) − κ τ τ j
and by considering the continuum limit, we obtain τ
dui (t) wij z j (t). = −ui (t) + I − κτ dt j
Let ui and zi be ui (t) and zi (t) at the steady state, respectively. Since at the steady state, then ui = I − κτ
dui (t) dt
=0
wij z j ,
j
and hence zi = F
I − κτ
wij z j ,
j
where F (x) = max(x, 0). Since {wij } has the same number of 0’s and 2/Ns on average, if this equation can be linearized at the steady state, the mean field solution is I 1 zi = N i 1 + κτ and equation 3.1 is derived. def 1 N
A.2 Derivation of Equation 3.2. We calculate q (t) = steady state when N → ∞, which is q (t) = I − κ
i
t−1−s exp − q (s). τ s=0
t−1
From the definition, it follows that q (0) = 0 and q (1) = I . Then,
1 1 q (2) = I − κ exp − I = I 1 − κ exp − . τ τ
zi (t) at the
Neural Modeling of an Internal Clock
1053
If q (2) > 0, then the inactive period stops. On the other hand, if q (2) ≤ 0, then we calculate q (3) as
2 q (3) = I 1 − κ exp − . τ Inductively, tinactive can be estimated as tinactive =
>0 arg mint 1 − κ exp − t−1 τ
κ ≥ 1,
0
otherwise.
When κ ≥ 1, it follows that
t−1 tinactive = arg min 1 − κ exp − t τ
>0
= arg min(t > τ log κ + 1) t
= τ log κ + 1, and equation 3.2 is derived. def A.3 Derivation of Equation 3.4. We recalculate q (t) = N1 i zi (t) when N is finite but sufficiently large. A similar calculation carried out in the derivation of equation. 3.2 yields
t−1−s exp − q (s) τ s=0
t−1 1 t−1−s + κ δwij exp − z j (s). N i j τ s=0
q (t) ≈ I − κ
t−1
(A.1)
2 , Since N1 i δwij is a gaussian variable with zero mean and variance (1/N) N an upper bound of this term may be N√1 N . Then, instead of equation A.1, we consider the following equation:
t−1−s exp − q ∗ (s) τ s=0
t−1 t−1−s κ exp − q ∗ (s) +√ τ N s=0
t−1 1 t−1−s = I −κ 1− √ exp − q ∗ (s). τ N s=0
q ∗ (t) = I − κ
t−1
1054
T. Yamazaki and S. Tanaka
Thus, q ∗ (t) may be regarded as an upper bound of q (t). A similar calculation carried out in the derivation of equation 3.1 yields q ∗ (t) =
I 1 + κτ 1 −
√1 N
.
Notice that q ∗ (t) → q (t) when N → ∞. We assume that |q ∗ (t) − q (t)| is sufficiently small and hence q ∗ (t) ≈ q (t). Then, by incorporating this into equation 2.1, we have
t−1−s exp − z j (s) τ j s=0
t−1 t−1−s δwij exp − z j (s) +κ τ j s=0
t−1 t−1−s τκI +κ δwij exp − z j (s) ≈I− τ 1 + κτ 1 − √1N j s=0
t−1 I 1 − √κτN t−1−s +κ = δwij exp − z j (s). τ 1 + κτ 1 − √1N j s=0
ui (t) = I − wκ
t−1
A.4 Derivation of Equation 2.1 from the Cerebellar Circuit. We consider N granule cells and K Golgi cells. Each granule cell receives mossy fiber signal and sends its output signal to Golgi cells with excitatory synapses. In turn, each Golgi cell sends its output signal to granule cells with inhibitory synapses. Let ziGr (t) and zkGo (t) be activities of granule cell i and Golgi cell k at time t, respectively. They are given as + ziGr (t) = uiGr (t) + zkGo (t) = uGo k (t) , where uiGr (t) and uGo k (t) are internal states of granule cell i and Golgi cell i at time t, respectively, and [x]+ = x if x > 0 and 0 otherwise. uiGr (t) and uGo k (t) are calculated using the following equations, τ Gr u˙ iGr = −uiGr + I − Go τ Go u˙ Go k = −uk +
j
Gr←Go Go wi,k zk (t),
k Go←Gr Gr wk, z j (t), j
Neural Modeling of an Internal Clock
1055
where τ Gr and τ Go are decay time constants, respectively, and I represents Gr←Go the afferent input signal conveyed through mossy fibers. wi,k denotes the Go←Gr synaptic weight from Golgi cell k to granule cell i, and wk, j the synaptic weight from granule cell j to Golgi cell k. They are solved as
t uiGr (t) = I 1 − exp − Gr τ −
uGo k (t) =
1 τ Gr
j
Go←Gr wk, j
t
0
k
1 τ Go
Gr←Go wi,k
t
0
t−r exp − Gr zkGo (r )dr, τ
t−s exp − Go zGr j (s)ds. τ
(A.2)
(A.3)
Since Golgi cells receive only excitatory inputs, uGo k (t) is always positive, that is, + = uGo zkGo (t) = uGo k (t) k (t). Using this relationship and incorporating equation A.3 into equation A.2, we obtain
1 t Gr←Go Go←Gr uiGr (t) = I 1 − exp − Gr − Gr Go wi,k wk, j τ τ τ j k
t r r − s Gr t−r × exp − Gr exp − Go z j (s)dsdr τ τ 0 0
t = I 1 − exp − Gr τ
t r 1 t−r r −s − Gr wij exp − Gr exp − Go zGr j (s)dsdr, τ τ τ 0 0 j (A.4) where def
wij =
1 τ Go
Gr←Go Go←Gr wi,k wk, j .
k
We assume the temporal integration of activities of neurons over a long time, which is represented by the summation with respect to s, where τ Go determines the integration range. This assumption is based on biological
1056
T. Yamazaki and S. Tanaka
findings. Some in situ experiments showed that upon the injection of depolarizing current pulses of the same amplitude, Golgi cells fire much less frequently than granule cells: 0.07 HzpA−1 for Golgi cells (Dieudonn´e, 1998) and 2.94 HzpA−1 for granule cells (D’Angelo, Filippi, Rossi, & Taglietti, 1995). Since current pulses are injected continuously, this decreased firing frequency of Golgi cells may indicate that input signals are integrated in the time interval between two successive spikes. The large difference in firing rate per input current suggests τ Gr τ Go . Specifically, we assume τ Gr 1 τ Go . Then we regard τ 1Gr exp − t−r as the delta function δ(t − r ) τ Gr and exp − τ tGr ≈ 0. It follows that uiGr (t) = I −
t
wij 0
j
t−s exp − Go zGr j (s)ds. τ
Finally, by replacing uiGr (t),ziGr (t),τ Go simply with ui (t), zi (t), τ , and discretizing the time step, we obtain ui (t + 1) = I −
j
t−s wij exp − z j (s), τ s=0 t
and equation 2.1 is derived. References Aitkin, L. M., & Boyd, J. (1978). Acoustic input to the lateral pontine nuclei. Hearing Research, 1, 67–77. Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61. Batchelor, A. M., Madge, D. J., & Garthwaite, J. (1994). Synaptic activation of metabotropic glutamate receptors in the parallel fibre–Purkinje cell pathway in rat cerebellar slices. Neuroscience, 63(4), 911–915. Brickley, S. G., Cull-Candy, S. G., & Farrant, M. (1999). Single-channel properties of synaptic and extrasynaptic GABAa receptors suggest differential targeting of receptor subtypes. Journal of Neuroscience, 19, 2960–2973. Bugmann, G. (1998). Towards a neural model of timing. BioSystems, 48, 11–19. Bullock, D., Fiala, J. C., & Grossberg, S. (1994). A neural model of timed response learning in the cerebellum. Neural Networks, 7(6/7), 1101–1114. Buonomano, D. V., & Mauk, M. D. (1994). Neural network model of the cerebellum: Temporal discrimination and the timing of motor responses. Neural Computation, 6, 38–55. Chapeau-Blondeau, F., & Chauvet, G. (1991). A neural network model of the cerebellar cortex performing dynamic associations. Biological Cybernetics, 65, 267–279. D’Angelo, E., Filippi, G. D., Rossi, P., & Taglietti, V. (1995). Synaptic excitation of individual rat cerebellar granule cells in situ: Evidence for the role of NMDA receptors. Journal of Physiology, 484(2), 397–413.
Neural Modeling of an Internal Clock
1057
Dieudonn´e, S. (1998). Submillisecond kinetics and low efficacy of parallel fibre-Golgi cell synaptic currents in the rat cerebellum. Journal of Physiology, 510(3), 845– 866. Dzubay, J. A., & Otis, T. S. (2002). Climbing fiber activation of metabotropic glutamate receptors on cerebellar Purkinje neurons. Neuron, 36, 1159–1167. Eccles, J. C., Ito, M., & Szent´agothai, J. (1967). The cerebellum as a neuronal machine. New York: Springer-Verlag. Ermentrout, B. (1994). Reduction of conductance-based models with slow synapses to neural nets. Neural Computation, 6, 679–695. Fiala, J. C., Grossberg, S., & Bullock, D. (1996). Metabotropic glutamate receptor activation in cerebellar Purkinje cells as substrate for adaptive timing of the classically conditioned eye-blink response. Journal of Neuroscience, 16(11), 3760–3774. Gibbon, J., Malapani, C., Dale, C. L., & Gallistel, C. (1997). Toward a neurobiology of temporal cognition: Advances and challenges. Current Opinion in Neurobiology, 7, 170–184. Gluck, M. A., Reifsnider, E. S., & Thompson, R. F. (1990). Adaptive signal processing and the cerebellum: Models of classical conditioning and VOR adaptation. In M. A. Gluck & D. E. Rumelhart (Eds.), Neuroscience and connectionist theory (pp. 131–186). Hillsdale, NJ: Erlbaum. Hardiman, M. J., & Yao, C. H. (1992). The effect of kainic acid lesions of the cerebellar cortex on the conditioned nictitating membrane response in the rabbit. European Journal of Neuroscience, 4, 966–980. Harvey, R. J., & Napper, R. M. A. (1991). Quantitative studies of the mammalian cerebellum. Progress in Neurobiology, 36, 437–463. Ikai, Y., Takada, M., Shinonaga, Y., & Mizuno, N. (1992). Dopaminergic and nondopaminergic neurons in the ventral tegmental area of the rat project, respectively, to the cerebellar cortex and deep cerebellar nuclei. Neuroscience, 51(3), 719– 728. Ito, M. (1984). The cerebellum and neuronal control. New York: Raven Press. Ito, M. (1989). Long-term depression. Annual Review of Neuroscience, 12, 85–102. Ivry, R. B. (1996). The representation of temporal information in perception and motor control. Current Opinion in Neurobiology, 6, 851–857. Kandel, E. R., Schwartz, J. H., & Jessel, T. M. (2000). Principles of neural science (4th ed.). New York: McGraw-Hill. Lalonde, R., & Hannequin, D. (1999). The neurobiological basis of time estimation and temporal order. Reviews in the Neurosciences, 10, 151–173. Li, Z., & Hopfield, J. J. (1989). Modeling the olfactory bulb and its neural oscillatory processing. Biological Cybernetics, 61, 379–392. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology, 202, 437–470. Mauk, M. D., & Donegan, N. H. (1997). A model of Pavlovian eyelid conditioning based on the synaptic organization of the cerebellum. Learning and Memory, 3, 130–158. Mauk, M. D., & Ruiz, B. P. (1992). Learning-dependent timing of Pavlovian eyelid responses: Differential conditioning using multiple interstimulus intervals. Behavioral Neuroscience, 106(4), 666–681. Meck, W. H. (1996). Neuropharmacology of timing and time perception. Cognitive Brain Research, 3, 227–242.
1058
T. Yamazaki and S. Tanaka
Medina, J. F., Garcia, K. S., Nores, W. L., Taylor, N. M., & Mauk, M. D. (2000). Timing mechanisms in the cerebellum: Testing predictions of a large-scale computer simulation. Journal of Neuroscience, 20(14), 5516–5525. Medina, J. F., & Mauk, M. D. (2000). Computer simulation of cerebellar information processing. Nature Neuroscience, 3, 1205–1211. Missale, C., Nash, S. R., Robinson, S. W., Jaber, M., & Caron, M. G. (1998). Dopamine receptors: From structure to function. Physiological Reviews, 78, 189–225. Moore, J. W., Desmond, J. E., & Berthier, N. E. (1989). Adaptively timed conditioned responses and the cerebellum: A neural network approach. Biological Cybernetics, 62, 17–28. Okamoto, H., & Fukai, T. (2001). Neural mechanism for a cognitive timer. Physical Review Letters, 86(17), 3919–3922. Palkovits, M., Magyar, P., & Szent´agothai, J. (1971). Quantitative histological analysis of the cerebellar cortex in the cat. II. Cell numbers and densities in the granular layer. Brain Research, 32, 13–32. Rabinovich, M., Volkovskii, A., Lecanda, P., Huerta, R., Abarbanel, H. D. I., & Laurent, G. (2001). Dynamicsl encoding by networks of competing neuron groups: Winnerless competition. Physical Review Letter, 87(6), 068102. Reichelt, W., & Knopfel, ¨ T. (2002). Glutamate uptake controls expression of a slow postsynaptic current mediated by mGluRs in cerebellar Purkinje cells. Journal of Neurophysiology, 87, 1974–1980. Vos, B. P., Maex, R., Volny-Luraghi, A., & De Schutter, E. (1999). Parallel fibers synchronize spontaneous activity in cerebellar Golgi cells. Journal of Neuroscience, 19(RC6), 1–5.
Received March 17, 2004; accepted October 1, 2004.
LETTER
Communicated by Klaus Obermayer
Mirror Symmetric Topographic Maps Can Arise from Activity-Dependent Synaptic Changes Reiner Schulz
[email protected] James A. Reggia
[email protected] Departments of Computer Science and Neurology, UMIACS, University of Maryland, College Park, MD 20742, U.S.A.
Multiple adjacent, roughly mirror-image topographic maps are commonly observed in the sensory neocortex of many species. The cortical regions occupied by these maps are generally believed to be determined initially by genetically controlled chemical markers during development, with thalamocortical afferent activity subsequently exerting a progressively increasing influence over time. Here we use a computational model to show that adjacent topographic maps with mirror-image symmetry can arise from activity-dependent synaptic changes whenever the distribution radius of afferents sufficiently exceeds that of horizontal intracortical interactions. Which map edges become adjacent is strongly influenced by the probability distribution of input stimuli during map formation. Our results suggest that activity-dependent synaptic changes may play a role in influencing how adjacent maps become oriented following the initial establishment of cortical areas via genetically determined chemical markers. Further, the model unexpectedly predicts the occasional occurrence of adjacent maps with a different rotational symmetry. We speculate that such atypically oriented maps, in the context of otherwise normally interconnected cortical regions, might contribute to abnormal cortical information processing in some neurodevelopmental disorders. 1 Introduction Multiple adjacent, roughly mirror-image topographic maps are commonly observed experimentally in the sensory neocortex of many species.1 Familiar examples of adjacent mirror image cortical maps include multiple representations of the body surface in primary somatosensory cortex of monkeys as illustrated in Figure 1A (Merzenich, Kaas, Sur, & Lin, 1978; Sur, Nelson, & 1
We are concerned solely with topographic map formation, in which a roughly pointto-point “image” of a receptive surface forms on the cortical surface, and not with “feature maps,” such as those involving orientation sensitivity. Neural Computation 17, 1059–1083 (2005)
© 2005 Massachusetts Institute of Technology
1060
R. Schulz and J. Reggia
Figure 1: (A) A cortical map in the somatosensory region SI of squirrel monkey, composed of adjoining areas 3b and 1, based on multiunit microelectrode readings. The arrow pairs indicate that the map consists of two complete, nearly topographic representations of the body surface that are roughly mirror images of each other, where the axis of reflection corresponds to the border between areas 3b and 1. From (Sur et al., 1982, with permission). (B) Architecture of the model cortical region used in this and previous studies. An input pattern x encoding the stimulation of a point on the sensory surface is modulated by afferent synaptic strengths W to produce an activation pattern y over a region of cortical elements representing the neocortical surface. During learning of W, a map of the input patterns and hence, assuming a suitable encoding is used, of the underlying sensory surface forms in the cortical region.
Kaas, 1982) and several mirror-image tonotopic maps in primary auditory cortex (Heschl’s gyrus) in humans (Engelien et al., 2002; Formisano, Kim, & Salle, 2003). If one considers not only primary but also secondary sensory cortex (which also receives thalamocortical projections), numerous other mirror-image maps have been found in somatosensory (Beck, Pospichal, & Kaas, 1996; Krubitzer & Calford, 1992; Krubitzer, Clarey, Tweedale, Elston, & Calford, 1995; Nelson, Sur, Felleman, & Kaas, 1980), visual (Drager, 1975; Newsome, Maunsell, & van Essen, 1986; Sereno et al., 1995; Tiao &
Symmetric Maps
1061
Blakemore, 1976), and auditory (Imig, Real, & Brugge, 1986; Pantev et al., 1995; Talavage, Ledden, & Benson, 2000) cortex in a variety of species. In addition, mirror-image movement representations have been found in the motor cortex of the macaque monkey (Gentilucci et al., 1989). At present, the initial parcellation of cortex into multiple regions or areas is generally believed to be due to genetically determined chemical markers and independent of thalamocortical afferent activity (Sur & Leamey, 2001). However, it remains less clear as to why partially redundant cortical maps occur, why they are so often oriented with reflection symmetry, and what role thalamocortical activity plays in their formation. Multiple adjacent maps are often hypothesized to arise during development due to genetically mediated chemical gradients (Grove & Tomomi, 2003; Levitt, 2000; Zhou & Black, 2000). They are sometimes conjectured to have evolved due to genetic mutations (Allman & Kaas, 1971; Krubitzer, 1995), and it has been suggested that they may provide advantages due to separation of spatial and temporal processing, parallel processing of different sensory attributes, minimization of connection distances, and other factors (Kaas, 1988; Cowey, 1981; Jones, 1990). There has been less speculation as to why such maps often exhibit reflection symmetry, and the relative contributions during map formation of activity-dependent versus activity-independent mechanisms remain the source of some debate, even for individual maps (Grove & Tomomi, 2003; Cohen-Cory, 2002). Past computational models of self-organizing neocortical topographic maps (Kohonen, 2001; Ritter, Martinetz, & Schulten, 1992; Sutton, Reggia, Armentrout, & D’Autrechy, 1994; Sirosh & Miikkulainen, 1994) have generally been limited to single maps and thus do not shed substantial light on the issue of forming multiple mirror-image maps. Here we use a computational model to show that, in principle, multiple adjacent and mirror symmetric topographic maps can arise from activitydependent synaptic changes alone, assuming that the distribution radius of already established cortical afferents sufficiently exceeds that of horizontal intracortical interactions during map development (Brown, Keynes, & Lumsden, 2001). This result raises the possibility that activity-dependent synaptic changes, which are generally assumed to be a central force in individual topographic map self-organization during development (Kohonen, 2001; Ritter et al., 1992; Sutton et al., 1994), may play a more important role in forming the orientations of adjacent cortical maps than is currently recognized. At the least, our results indicate the existence of an influence on the relative orientations of adjacent maps that may have affected the evolution of their genetically guided afferent connectivity. 2 Methods Our computer simulations adopt the widely used Kohonen model of map formation (Kohonen, 2001; Ritter et al., 1992) in a straightforward fashion.
1062
R. Schulz and J. Reggia
Our goal is not to provide a detailed, veridical model of neocortical circuitry and its dynamics, but simply to show that principles widely used in past selforganizing map models are sufficient (with simple modifications) to create mirror-image map formation. While the standard Kohonen model is a substantial simplification of biological reality, it can produce single maps that, at least qualitatively, are strikingly similar to neocortical maps (Bauer, 1995; Kohonen, 1989; Martinetz, Ritter, & Schulten, 1989; Obermayer, Blasdel, & Schulten, 1992; Obermayer, Ritter, & Schulten, 1990; Obermayer, Schulten, & Blasdel, 1992; Palakal, Murthy, Chittajallu, & Wong, 1995; Ritter & Schulten, 1986), suggesting that it captures some fundamental principles of cortical self-organization. We extend the single-winner Kohonen model in a natural way to allow for the occurrence of multiple “winner nodes” across the emerging map’s surface. Such an extension is consistent with the multifocal activity peaks often found in neocortical regions (Donoghue, Leibovic, & Sanes, 1992; Georgopoulos, Kettner, & Schwartz, 1988; Pei, Vidyasagar, Volgushev, & Creutzfeldt, 1994). Multiwinner self-organizing maps (SOMs) have been used in past— more biologically realistic and more complex and computationally expensive cortical models (von der Malsburg, 1973; Sutton et al., 1994; Sirosh & Miikkulainen, 1994). However, these previous multiwinner SOMs typically do not compute their output using Kohonen’s algorithm (or a generalization thereof, as we do here). Instead, a system of differential equations governs the SOM’s activation dynamics in these cases, and the system’s iterative simulation usually leads to multiple, spatially separated “Mexican hat” patterns of activity. So, past multiwinner SOMs differ from Kohonen SOMs algorithmically and also with respect to their application domain (Schulz & Reggia, 2004). Here, we address the issue of how generalizing Kohonen’s algorithm to multiple winners influences map formation, an issue that to our knowledge has not been investigated before. 2.1 Model Architecture and Dynamics. The basic architecture of the multiwinner SOM, illustrated in Figure 1B, is identical to that of a standard SOM. While many variations are possible, we elected for the cortical nodes to be arranged in a regular, rectangular lattice of R rows by C columns. The distance between two cortical nodes i and i at positions (r, c) and (r , c ) in the lattice is measured using the box-distance metric, that is, dlattice (i, i ) = max(|r − r |, |c − c |). Each cortical node i receives an afferent connection from each of the P nodes in the input layer. Every afferent connection carries a nonnegative, real-valued weight wi j on the connection from the jth input P to the ith cortical node, and w i ∈ R+ represents the afferent weight vector to the ith cortical node. The level of activation of an input or cortical node ranges between 0 (inactive) and 1 (fully active). The activation levels of all P input nodes make up the input pattern, a vector x ∈ [0, 1] P of unit length.
Symmetric Maps
1063
Similarly, the activation levels of all cortical nodes form the output pattern, a vector y ∈ [0, 1] RC . In general, an input pattern x encodes the stimulation of a point on a two-dimensional sensory surface. To avoid biases due to unequal length input vectors, the planar sensory surface inputs were normalized in length by their projection onto the surface of the unit sphere. Specifically, given a point p = unit square, its image on the unit sphere is [ px , p y ] on the p point q = q x = pax , q y = ay , q z = ab , where a = ( px2 + p 2y + b 2 )1/2 and b = √ 2 − ( px2 + p 2y )1/2 . The images on the unit sphere of the 441 points at the intersections in a regular grid of 21 rows by 21 columns covering the unit square were used for training. We used coordinate encoding of input patterns in most of our experiments because, especially during the search for suitable training parameters, computational efficiency was critical. Real sensory stimuli during biological map formation are, of course, more complex, but such simple local stimuli have often been used in past models of topographic map formation and have repeatedly been found to be adequate to support formation of single topographic maps. However, to demonstrate that coordinate encoding does not bias the model in favor of our hypothesis about orientations of adjacent maps, we repeated some of the experiments using bell-shaped sensory activation patterns (described later). Given an input pattern x, the output pattern is determined by the same computationally efficient process employed by the standard SOM (Kohonen, 2001), except that it is generalized in a natural way to allow the simultaneous existence of multiple “winners.” First, the net input h to each cortical node i is computed as h i = w iT x where T indicates the transpose of the column vector w i . We approximate the computationally expensive, iterative competitive activation dynamics (Mexican hat pattern) that is often implemented via the numerical solution of differential equations and iteratively transforms h into y (Cho & Reggia, 1994; Pearson, Finkel, & Edelman, 1987; Reggia, D’Autrechy, Sutton, & Weinrich, 1992; Sutton et al, 1994; von der Malsburg, 1973) by a one-step selection of winners. However, unlike in the standard SOM, multiple winners occur where each cortical node i that receives a net input greater than that to each of the N neighboring cortical nodes closest to i (ties resolved arbitrarily) is taken to be a winner. For all cortical nodes, including those near or on the edges of the SOM’s lattice, N is taken to be the number of other cortical nodes within a fixed radius of competition rcomp from the cortical node at location (R/2, C/2) in the center of the lattice. Note how having the same N for cortical nodes along the lattice’s edges is different from letting each cortical node compete with all other cortical nodes within a fixed radius from its position, which would introduce a bias favoring cortical nodes located near the lattice boundary. Since parameter rcomp is usually chosen to be small relative to the size of the lattice, typically multiple winner cortical nodes occur throughout the lattice in response to each input pattern. Each winner is made the central ”peak” of an ”island” of activation. The distribution of activation on a single island
1064
R. Schulz and J. Reggia
is such that the winner at the center of the island (cortical node i) is maximally active (yi = 1), and the activation level of each cortical node j that competed with i decreases exponentially with increasing distance between j and i. Specifically, if the set V of winners is V = {i | ∀ j = i : j competes with i ⇒ h j (t) < h i (t)},
(2.1)
then the activation of cortical node j is y j = γ d(i, j)
with i ∈ V, and
∀k ∈ V, d(k, j) ≥ d(i, j),
(2.2)
where γ ∈ [0, 1] determines the shape of each island of activation (lower γ means faster drop-off from the peak). If two or more islands of activation partially overlap, the activation level of a cortical node j in the region of overlap is determined by the island whose peak is closest to j. Unless stated otherwise, the parameter values used in the experiments reported here are C = 11 and rcomp = 7. Before training, each weight is independently initialized with a random value from the interval [0, 1], and each weight vector is then normalized to unit length. During training, the SOM learns by adjusting the weights on the incoming connections in response to each input of a vector from the training set, presented in a random order that is different for each epoch. The number of training epochs is 2000 unless noted otherwise. For each cortical node i, the learning rule is w i = w i + µyi x
(2.3)
i /||w i ||2 . w i = w
(2.4)
Equation 2.3 implements typical Hebbian learning where µ ∈ (0, 1] is the learning rate. Normalization in equation 2.4 restricts w i to move across the surface of the unit hypersphere and may decrease a connection’s efficacy due to competition with the other connections terminating at cortical node i. We refer to this learning as competitive Hebbian learning, as it incorporates competition between cortical nodes for activation, and competition between weights due to vector normalization, as well as Hebbian learning. Typically during the training of a standard single-winner SOM, the values of certain parameters in the learning rule depend on how far training has progressed (Kohonen, 2001). For example, training is often divided into two phases: a rough ordering phase corresponding to large values for γ and µ in our model and a convergent phase corresponding to small values for γ and µ. Analogously in our model, parameters γ and µ monotonically decrease in a nonlinear fashion from some initial value to a smaller final value. For example, γ (t) = γfin + (γinit − γfin )/(1 + e (t−γinfl )/γσ ), where t is the
Symmetric Maps
1065
fraction of completed training epochs, γinit = 0.9 (γfin = 0.0) determines γ ’s initial (final) value, γinfl = 0.33 is the point of inflection, and γσ = 0.1 determines the rate of decline. A similar function is used for µ, where µinit = 0.5, µfin = 0.0, µinfl = 0.5, and µσ = 0.1. 2.2 Measures of Map Formation. Measures of the goodness of map formation, such as the topographic product (Bauer & Pawelzik, 1992) or the topographic function (Villmann, Der, Herrmann, & Martinetz, 1997), include in their calculations the extent to which adjacent locations in the input sensory space are represented by nonadjacent cortical nodes. Thus, these measures give inaccurate scores when multiple maps are involved because two cortical elements in different cortical maps representing nearby sensory surface locations can be widely separated from each other. For this reason, we devised and used a measure M ≥ 0 of map formation computed as the mean of the smallest 2% of all pairwise dot products between the weight vectors of adjacent cortical nodes. Roughly, M is inversely proportional to the average distance between receptive field centers of adjacent cortical elements in those parts of the cortex where these distances are greatest. Larger values of M indicate better map formation. Due to the normalization of the weight vectors to unit length, M = 1 is maximal. Solely for the purpose of visualizing map formation, each input location is associated with a label that is either the character @ or the blank character. When these labels are viewed on the sensory surface at their associated positions as in Figure 2A, they can be seen to form an inward clockwise spiral, with the size of the @ labels decreasing toward the center of the spiral (the size of an @ label has no meaning other than to indicate visually its location). The blank spaces in between the arms of the spiral aid the efficient visual judgment of whether a roughly topographic map of the sensory surface has formed, and if so, what its orientation is. The spiral labeling pattern is asymmetric so that the relative orientations of any two adjacent maps can be determined unambiguously (without, for example, mistakenly identifying them as a single large map). Each cortical element i is labeled by the input pattern x j to which that element is most sensitive, that is, j = argmaxk (w iT xk ). A topology-preserving or well-formed map of the sensory surface thus shows up on the cortical surface as a projected image of the input surface’s spiral pattern. The image may be rotated or slightly distorted or may show a reversal of the spiral’s direction from clockwise to counterclockwise, since these transformations do not violate the topology of the sensory surface. 3 Results Given the radius of competition rcomp of 7, multiwinner SOMs of 15 by 15 or fewer cortical nodes (15 = 2rcomp + 1) are equivalent to the standard single-winner SOM since each cortical node competes with all other cortical
1066
R. Schulz and J. Reggia
nodes for activation and learning, and hence there is always only a single winner. Figure 2B shows a typical example of the initial disorganized state of an 11 by 11 cortical region’s representation of the sensory surface prior to training that is due to the random initialization of the afferent weights. Figure 2C shows, for the same region, the ordered map representation that was formed by training the network. As expected, when 11 by 11 and 11 by 15 cortical regions were trained, each self-organized into a single topologypreserving map of the sensory surface that covered the entire cortical region (see Figure 2C). Fourteen separate experiments were conducted, each corresponding to a specific cortical region size (R = 11, 15, 20, . . . , 75; C = 11), and for each size cortical region, 20 independent runs were done (independent in the sense that in each run, the network was initialized with different random weights and a different random order was used for the presentation of the input patterns during training). In all simulations prior to learning, the points of the sensory surface were found to be randomly represented over the cortical surface. As can be seen from the left half of Table 1, for a sufficiently small R ≤ 20, our model was essentially equivalent to a standard single-winner SOM, and consequently, only a single map of the sensory surface formed, covering the entire lattice of cortical nodes (as in Figure 2C). In contrast, with R ≥ 25, multiple well-formed maps invariably appeared. Examples of
Figure 2: (A) The input patterns encode representative points (the points of intersection in a grid) on a planar 2D sensory surface. For illustrative purposes, parts of this surface are labeled with an inward clockwise spiral of @ characters of progressively decreasing size. By showing how this spiral is projected onto the cortical surface, one can reliably judge by visual inspection whether a rough topology-preserving map of the sensory surface has formed, and, if so, its orientation. (B) The disorganized pretraining representation of the sensory surface by an 11 by 11 cortical surface (one square cell per cortical element), showing that the label positions (which indicate the locations of the cortical elements’ receptive field centers on the sensory surface) are not arranged in any order due to the initially random weights. A cell’s brightness indicates how similar that cortical node’s weight vector is to the weight vectors of its immediate neighbors. Disorganization here is thus also indicated by the very dark shading and an associated M = .35. (C) As expected with a small cortical region, the posttraining representation is like that seen with a standard single-winner SOM: a single topology-preserving map of the sensory surface that covers the entire region. This is indicated by @ signs showing how the spiral on the sensory surface has become topographically projected onto the cortex, the relatively light shading of all cortical elements (which also shows that the topographic organization indeed extends to the unlabeled regions in between the spiral arms), and a much larger M = .98.
Symmetric Maps
1067 A 1.0 .9 .8 .7 .6 .5 .4 .3 .2 .1 0.0
@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@@@@@@ @@@ @@@ @@@ @@@ @@@ @@@ @ @ @ @ @ @ @ @ @ @@@ @@@ @ @ @ @ @ @ @ @ @ @@@ @@@ @ @ @ @ @ @ @ @ @ @@@ @@@
@@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ 0.0 .1
@@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@ @@@
@ @ @ @ @ @
@ @ @ @ @ @
@ @ @ @ @ @
@ @ @ @ @ @
@ @ @ @ @ @
@ @ @ @ @ @ @@ @ @@@@@@@ @ @ @ @@ @ @@@@@@@ @ @ @ @@ @ @@@@@@@
.2
.3
@
@
.4
@ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @
.5
.6
.7
.8
.9
@
@
1.0
B 11
@
10
@
@
9 8
@
@
@ @
@
@
@
@ @
@
@ @
@ @
@
7
@
@
@
@ @
@
@
@
@
6
@ @ @
@
@
@
@
@
@
@
@
@
@
@
@
@
@
5 4
@
@
@ @
@
@
3
@
@
@
2
@
@
@
1
@
@
2
3
1
@
@
@
4
@
@
5
6
7
8
@
@
@
@
9
10 11
C 11 10
@ @ @ @ @ @ @ @ @ @ @ @ @
9
@
8
@
@
@
@
@
@
@
@
@
@
@
@
@
7
@
@
@
@
@
@
@
6
@
@
@
@
5
@
@
@
4
@
@
@
3
@
2
@
1
@ 1
@
@
@
@
@
@ @
@ @
@ @
2
@
@
@ @
@
@
@
@
@
@
@
@
3
4
5
6
7
8
9
10 11
1068
R. Schulz and J. Reggia
Table 1: Number and Pairwise Symmetries of Learned Maps (20 Runs Each). Pairwise Symmetriesa
Number of Maps R
Mean
Minimum
Maximum
m
g
r
11 15 20 25 30 35 40 45 50 55 60 65 70 75
1.00 1.00 1.00 2.00 2.00 2.12 2.95 3.00 3.40 3.83 4.00 4.50 5.07 5.18
1 1 1 2 2 2 2 3 3 3 4 2+ 2+ 3+
1 1 1 2 2 3 3 3 4 4 4 6 7 6
1.00 .90 .63 .92 .78 .73 .81 .93 .77 .82 .77
.00 .10 .37 .03 .10 .19 .15 .02 .20 .05 .12
.00 .00 .00 .05 .13 .08 .04 .05 .03 .14 .11
m = mirror reflection. g = glide reflection. r = 180◦ rotation. + = unorganized areas also present.
a
multiple map formation are illustrated in Figure 3. In general, the number of maps formed increased proportional to R: approximately one additional map was formed for each additional increment of R by 15 rows. This suggests that in general, the number of additional rows required to accommodate an additional map roughly equals the “diameter” of competition, that is, the number of rows 2rcomp + 1 (for R ≥ 2rcomp + 1) of other cortical nodes with which each cortical node has to compete for activation and learning (e.g., 15 for rcomp = 7). All of the maps in any one instance where multiple maps occurred were generally of the same size. Two adjacent maps were usually immediately adjacent; that is, there were no lattice parts in between them that were not part of the two adjacent maps, regardless of R. In all simulations producing multiple maps, only three types of symmetries were observed between any two adjacent maps, regardless of the total number of maps present. In the overwhelming majority (82%) of cases, the two adjacent maps were mirror images of each other (see Figure 3A). The second type of symmetry observed, found in 11% of the cases, was again essentially a mirror reflection, but now the axis of reflection was tilted so that the boundary between the two maps was no longer of minimal length (see Figure 3B). In addition, the maps were translated in opposite directions along their common tilted boundary so that the resultant transformation is better characterized as a glide reflection. Thus, in 93% of the cases, adjacent maps exhibited mirror symmetry or distorted mirror symmetry. In the remaining 7% of adjacent map pairs, each individual map was characterized as a rotation relative to the other of 180 degrees around a symmetry
Symmetric Maps
A
35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
1069
@@@@@@@@@@@ @@@@@@@@@@@ @@@ @@ @@@@ @@ @@@@@ @@ @ @ @ @@@ @ @@ @ @ @ @@ @ @ @@ @ @ @ @@ @@ @@ @@ @@ @@
@@@ @
@@
@@@@ @@@@ @@@
@ @ @ @
@@ @@ @@ @@@@@@@@@@@ @@@@@@@@@@@ @@@@@@@@@@@
@@@
@ @@ @@ @@ @@ @@ @@ @@
@@ @@@@ @@ @ @ @ @ @@ @ @ @ @ @ @ @ @@
@@ @@ @@ @ @@@@@@@
@ @ @ @@@ @ @@
@@ @@@@@@ @@ @@@ @@ @@@ @@@@@@@@ @ @@@@@@@@@@@ 1 2 3 4 5 6 7 8 9 10 11
B35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
@@@@@@@@@@@ @@@@@@@@@@@ @@ @@ @@ @ @@ @ @@
@ @ @ @@@ @ @@@@
@ @ @@
@@ @@ @ @ @@ @ @ @@@ @ @@ @
@@ @@ @@@ @@
@ @ @ @ @@ @@ @@
@@ @@ @@ @@ @@@ @@@@@ @ @ @ @@@@@@ @@@ @@@ @ @@ @ @ @ @@ @@ @ @ @@ @ @ @@ @ @ @ @@
@@ @@
@@ @@ @@
@@ @@ @@
@@
@@ @@ @@
@@@ @@@ @@@
@@ @ @ @ @ @@ @ @ @@ @ @ @ @@
@ @ @@ @@ @@ @
@@ @@ @@@@@@@@@@@ @@@@@@@@@@@ 1 2 3 4 5 6 7 8 9 10 11
C
40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
@@@@@@@ @ @ @@@@ @@ @ @@ @ @@@@@@ @@ @@@ @@ @@@ @@@ @@ @@@@ @@@ @@ @@@@ @ @ @@ @@ @@ @@ @@ @@ @@ @@
@@@ @@ @@ @ @@ @@@ @ @ @@@ @ @ @@ @ @ @@ @ @
@
@ @ @
@ @@ @@ @@ @@ @@ @@@ @@ @@@@ @@ @@@@ @@@@@ @@@@ @@@@@@ @@@@ @@@@@ @@@ @@@ @@ @@ @@ @@ @@ @@ @@@ @@ @ @ @@@ @ @ @ @ @@@ @ @ @ @ @@ @ @ @@ @ @ @ @ @
@@ @@ @@ @@ @@ @@@ @@ @@ @@@@ @ @@ @@ @@@@@ @@ @@@@ @ @
@@@ @@@@ @@@@@@@ @@@@@@
1 2 3 4 5 6 7 8 9 1011
Figure 3: Representative instances where two maps formed in cortical regions during learning (top). A corresponding schematic representation is given (bottom) for each of the three observed types of symmetry between adjacent maps. The spatial organization of the maps is indicated by how a single spiral painted on the sensory surface (see Figure 2A) is replicated and oriented in the map. (A) Mirror symmetry. (B) Glide reflection symmetry (distorted mirror symmetry). (C) Rotational symmetry. In the schematic representations, the thin lines in A and B and the point in C indicate the symmetry axis and the center of rotation, respectively.
point at the center in between the two maps (see Figure 3C). Rotational symmetry was easily identifiable since it was the only type of symmetry that caused the projected spiral patterns in two adjacent maps to follow the same (clockwise or anticlockwise) direction (see Figure 3). The right-most
1070
R. Schulz and J. Reggia
three columns of Table 1 show the fractions of mirror (m), glide (g), and rotation (r ) symmetries between adjacent maps for different lattice sizes R, averaged over 20 independent runs, respectively. Map visualizations like those in Figure 3 also revealed that the three symmetry types exhibited distinct patterns of similarity among the cortical elements along an intermap boundary. Cortical nodes along the boundaries between mirror-image maps typically were similar to their neighbors in the lattice, that is, their afferent weight vectors and, thus, their receptive field centers were close to one another. In Figure 3A, this is manifested by the lightly shaded cells along the intermap boundary. In contrast, for glide reflections and rotationally symmetric adjacent maps, dissimilar cortical elements (dark shaded cells) were typically prominent along the intermap boundary. Such dissimilar cortical elements were present all along the boundary between glide reflection symmetric maps (dark shaded boundary elements of rows 13–25 in Figure 3B), while their presence was limited to the outer reaches of the intermap boundary in the rotationally symmetric case (very dark shaded elements on the extreme left and right of rows 18–20 in Figure 3C). There was a weak tendency for the largest fractions of mirror symmetric maps (>80%) to occur when R was a multiple of 15 or slightly smaller (R = 25, 30, 40, 55, 60, 70 in Table 1). Under these conditions, the height of a single map was roughly 15. However, we found several cortical regions on which exclusively mirror symmetric map pairs formed and the number of maps exceeded the expected value because they were smaller. The single cortical region in Figure 4 provides an example of this where six somewhat compressed maps formed on the 75 by 11 cortical region. In a small minority of cases, the network did not completely self-organize, and parts of the cortical region remained disorganized after learning. For example, the entry 2+ for R = 65 in Table 1 indicates that in 1 of the 20 simulations with networks of this size, only two representations of the sensory surface were found, with the rest of the cortical region being disorganized (all other 19 simulations in this case exhibited at least four maps and no disorganized regions). 3.1 Measuring Map Formation and Types of Symmetries. For the 220 simulations with cortical regions sufficiently large for multiple maps to appear (R ≥ 25), the mean initial value of M prior to any learning was 0.31 (SD 0.02, minimum 0.23, maximum 0.38). Following learning, this increased to 0.97 (SD 0.02, minimum 0.87, maximum 0.98). Each cortical region that was in principle large enough for multiple well-formed maps to appear (all 220 runs in Table 1 for which R ≥ 25) was assigned to one of three categories after training. First, if any disorganized region was present, the cortical region was placed in the “?” category (20 runs, or 9%), even if well-formed maps were also present. The remaining fully organized cortical regions were divided into those in category m, where exclusively mirror
Symmetric Maps
75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38
1071
@@@@@@@ @@ @ @@@@@@@@ @ @ @@ @ @@ @ @ @ @@ @ @ @@ @ @ @ @ @ @ @@ @@ @ @ @ @@ @@ @ @ @ @@ @
@
@@ @@ @ @@ @@ @@@@@@@@@@@ @@@@@@@@@@@ @@@@@@@@@@@ @@@@@ @ @@@@@
@@ @@@ @@ @@@ @ @@ @ @ @ @ @@@@ @@ @ @ @@ @ @ @ @ @ @ @ @ @@ @@
@
@@
@@ @@ @@@@ @@@@ @@@@@@@@ @@@@@@@@ @@
@ @ @ @ @ @ @@ @@ @ @@ @ @@ @ @@ @ @@ @ @@ @ @ @ @@ @@@@@ @
@ @
@
@@ @@@@@ @ @@ @ @@ @@@@@@@@@@@ 1 2 3 4 5 6 7 8 9 1011
37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
@@@@@@@@@@@ @ @@ @ @@ @@ @
@ @ @@ @ @@ @@ @@ @
@@ @@ @@
@@
@ @
@@ @@ @@ @@ @@ @@ @@ @@
@@ @@ @@@@@@@@@ @@@@@@@@ @ @@ @@
@ @ @@ @ @@ @ @ @@ @@ @ @ @ @@ @ @ @@ @@ @ @@ @ @@@@ @ @@
@ @@ @ @ @ @ @@ @@ @@ @@ @@@@@@@@@@@ @@@@@@@@@@@ @@@@@@ @@ @@ @@ @ @@ @@ @ @@ @@
@ @ @ @ @@ @ @
@@ @ @@ @ @@ @ @ @@ @
@@ @@ @@ @@ @@ @@ @
@@
@ @@ @@ @@@@@@ @ @ @@ @@@@@@@@ 1 2 3 4 5 6 7 8 9 1011
Figure 4: A single cortical region on which six maps of the sensory surface appeared; every two adjacent maps are mirror images of each other. For illustrative purposes, the region has been split in the middle, with the top half shown on the left (schematic representation on the right).
symmetric map pairs had formed (138, 62%), and those in category g|r (62, 29%), where at least one glide reflection or rotationally symmetric map pair occurred. Figure 5 shows the distribution of the M values for these three categories of cortical regions. The mean M value was 0.980 (SD 0.002) for category m simulations and a significantly different 0.965 (SD 0.018) for category g|r simulations ( p < 10−3 on t-test). On average, the M values were significantly greater for category m (g|r ) than for category g|r (“?”) (Jonckheere trend test; p .0001), and the spread of the values was smaller for category m (g|r )
1072
R. Schulz and J. Reggia 80
Number of Pairs of Adjacent Maps
70
m g|r ?
60
50
40
30
20
10
0 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
M Figure 5: A histogram of the M values for each of the three symmetry categories. The values have been grouped into 16 consecutive intervals of width 0.01. To be comparable across the different-sized categories, each histogram shows, for each interval, the relative within-category frequency with which M fell within the limits of the interval. The histograms suggest that the means µ and standard deviations σ of the actual distributions of M values are ordered so that µm > µg|r > µ? and σm < σg|r < σ? .
than for category g|r (“?”). Since M primarily measures the organization along map boundaries when multiple maps are present, these results indicate that the same synaptic modifications responsible for individual map formation also tend to maximally preserve similarity of adjacent cortical element receptive fields along map boundaries by producing adjacent maps that are mirror symmetric. In contrast, other symmetry relationships (glide, rotational) are local maxima of M in which the map formation process becomes trapped during learning. Since category g|r lattices also exhibited mirror symmetric map pairs, the differences observed between categories m and g|r most likely would have been even more pronounced if each pair of adjacent individual maps had been manually categorized individually and if M had been measured separately for each pair (this was impractical to do).
Symmetric Maps
1073
3.2 Nonuniform Density of Sensory Stimuli. In the simulations described above where each sensory surface location was stimulated exactly once per epoch, when mirror-image maps formed, there was nothing to bias which of their four edges became adjacent. To see if the selection of which edge pair becomes adjacent during formation of mirror-image maps could be biased, we altered the previously uniform probability distribution of stimuli over the sensory surface during learning. This was done by partitioning the regular 21 by 21 grid of the sensory surface points into three consecutive 21 by 7 subgrids (I, II, and III). Each point in subgrid I was stimulated only once during a single epoch of training, while each point in subgrid II (III) was stimulated twice (three times) during the same period. During training, we used a fixed lattice size of 25 by 11 cortical nodes and coordinate-encoding input patterns, a combination that earlier had produced a pair of mirror-image maps in 19 of 20 runs when uniformly probable input stimuli were used. With uniform sensory stimuli previously, the absolute frequencies of the four relative orientations among these 19 mirror-image map pairs (6, 3, 6 and 4) were consistent with equiprobable adjacent edge selection (χ 2 = 1.42, p > 0.05). In contrast, of the same 20 independent runs of training that were performed using the nonuniformly distributed sensory stimuli, 19 again produced mirror-image map pairs. However, now 17 of the adjacent map edges corresponded to the single sensory surface edge where stimuli were most probable (absolute frequencies: 2, 0, 0, and 17). This was inconsistent with equiprobable edge selection (χ 2 = 42.26, p < 0.05). At this preferred orientation, the most frequently stimulated region of the sensory surface usually becomes represented in both maps along the intermap boundary, with the least frequently stimulated region most removed from the intermap boundary. Under these conditions, we also examined whether magnification effects occurred. In biological systems, sensory surface regions with higher innervation density or that are more frequently stimulated are typically magnified in cortical maps: their cortical representation occupies a disproportionately large area of cortical surface (Dykes & Ruest, 1984). Past computational models of single maps have reproduced this magnification effect, given inputs having a nonuniform distribution of sensory stimuli (Grajski & Merzenich, 1990). With the nonuniform sensory stimuli described above, we found that the images of the two more frequently stimulated sensory surface regions came to occupy a relatively larger area of the cortical region (at the expense of the third least stimulated sensory region) compared to when the three sensory regions were stimulated equally frequently during training. With the uniformly distributed sensory stimuli (the baseline), the map representations of the sensory regions corresponding to subgrids I, II, and III occupied, on average, 79.0, 104.0, and 91.9 cortical elements (standard deviations 13.0, 3.6, and 13.6, respectively). The somewhat smaller size of regions I and III here is due to edge effects. In comparison, with the nonuniformly distributed sensory stimuli, the same averages for regions
1074
R. Schulz and J. Reggia
I, II, and III were 57.6, 112.0, and 105.4 (standard deviations 4.9, 3.3, and 7.7, respectively). This is a significantly different result (U-test; p < 10−6 , p < 10−5 and p < .005, respectively, for each region). 3.3 Sensitivity to Model Changes. We examined a number of variations in the parameters and other aspects of our model to determine its robustness, that is, whether qualitatively similar results would be obtained after these variations. For a radius of competition rcomp = 7, we observed the largest fraction of mirror symmetric map pairs for a lattice width of C = 11 (0.82m, 0.11g, 0.07r ). Experiments with C < 11 resulted in a relatively larger number of glide reflections (e.g., 0.69m, 0.27g, 0.04r for C = 9). For C > 11, the relative number of rotational symmetries increased (e.g., 0.78m, 0.04g, 0.18r for C = 13, and 0.68m, 0.02g, 0.30r for C = 15). This suggests that for a given radius of competition rcomp , a particular width C of the lattice may be optimal for the formation of mirror symmetric map pairs. The initial values of γ , γinit , and γinfl were also important. For γinit = .8 (rather than .9, as in the experiments above), the fraction of mirror symmetric map pairs dropped to typically 60%. Rotation and glide reflection symmetry became more frequent, with each occurring in roughly 20% of the cases. Delaying γ ’s descent by increasing γinfl to 0.5 increased the number of cases in which self-organization failed partially or completely so that no wellformed maps were discernible in (parts of) the SOM’s lattice. The effects of parameter changes pertaining to γ especially depend on how the learning rate µ changes over time during training. We made the above observations on the effects of changes to γinit and γinfl while µinit = 0.5, µinfl = 0.5 and µsigma = 0.1 were held fixed. Two variations of how islands of activity are determined (see equation 2.2) were also tested. The first variant determines the activity of a cortical element by taking into account all winners (as opposed to just the closest one) and adding their activity contributions. Equation 2.2 was replaced by yj =
γ d(i, j) ,
(3.1)
i∈V
which, in general, increases the activity of cortical nodes that are located in an area of the cortical region where several islands of activation overlap. Given equation 3.1, it is possible that a cortical element in an area of overlap becomes more active than a winner. The second variant prevents this from happening by capping each cortical element’s activation if it exceeds 1. Both variants were somewhat less conducive to the formation of mirror symmetric map pairs than the original equation 2.2. In all of the simulations described, we used a computationally efficient coordinate encoding of sensory stimuli, as is often done with Kohonen maps. To verify that this did not bias our results, we repeated the first
Symmetric Maps
1075
Table 2: Number and Pairwise Symmetries of Learned Maps (10 Runs Each, Full Encoding). Pairwise Symmetriesa
Number of Maps S
Mean
Minimum
Maximum
m
g
r
11 15 20 25 30 35 40 45 50 55 60 65 70 75
1.00 1.00 1.00 2.00 2.00 2.22 3.00 3.00 3.40 4.00 4.00 4.78 5.20 5.50
1 1 1 2 2 2 3 3 3 2+ 3+ 3+ 5 5
1 1 1 2 2 3 3 3 4 4 4 5 6 6
1.00 1.00 1.00 1.00 .95 .92 .94 .79 .94 .93 .82
.00 .00 .00 .00 .00 .08 .03 .21 .00 .07 .09
.00 .00 .00 .00 .05 .00 .03 .00 .06 .00 .09
m = mirror reflection. g = glide reflection. r = 180◦ rotation. + = unorganized areas also present.
a
10 runs of each experiment, training the networks with a full encoding of the sensory input patterns. With the latter, each sensory activation pattern is a vector with as many components as the number of sample sensory surface points (441 in our model), making the model computationally much more expensive. With this full encoding, stimulation of a point on the sensory surface was represented as a bell-shaped activation pattern centered on the stimulus location. Specifically, if the stimulation of point p = ( px , p y , pz ) on the sensory surface is encoded by x( p) then ( p) xq , the component of x( p) corresponding to the activation level at point p q +p q +p q q = (q x , q y , q z ), equals ( ( pxx qx + pyy qy + pzz qz )2 )1/2 . This implies that the sensory x y z q activation patterns evoked by two separate and independent point stimuli are more similar the smaller the distance on the sensory surface between the two points (this is true also if coordinate encoding is used). The overall results, given in Table 2, were 91% mirror symmetric, 6% glide reflection symmetric, and 3% rotationally symmetric map pairs, indicating a significant increase in the fraction of (distorted or undistorted) mirror symmetric map pairs from 93% to 97% (one-sided χ 2 test; χ 2 = 6.71, p < .005). The average number of maps appearing on the cortical lattice was not significantly different at the 0.05 significance level, regardless of the specific lattice size R (U statistical test). These results indicate that using coordinateencoding input patterns did not substantially influence the average number of well-formed maps per cortical region and, more important, did not bias our results in favor of mirror symmetric adjacent maps.
1076
R. Schulz and J. Reggia
4 Discussion In this work we have explored the extent to which activity-dependent synaptic changes may play a role in the self-organization and relative orientation of adjacent topographic maps. Our computational model, when trained with input patterns that represent the stimulation of points on a sensory surface, formed multiple topologically correct maps of the sensory surface whenever the distribution radius of cortical afferents sufficiently exceeded that of horizontal intracortical interactions, conditions that may be present during development when thalamocortical afferent projections are more widespread than in adults (Brown et al., 2001; Mountcastle, 1998). Further, the adjacent maps produced in this fashion were largely mirror symmetric with respect to their common boundary, a finding that is consistent with observations of topographic maps and their relative orientations in biological cortex. These results persisted in the face of parameter variations and even a different representation of sensory stimuli as long as basic map self-organization was not disrupted. In all of our experiments, the topology of the sensory surface (the 2D unit square, projected onto the unit sphere to normalize input vectors to unit length) matched the topology of the modeled cortical surface (2D lattice of cortical elements). If the topology of the sensory surface had been as complex as, for example, the surface of the human body, and thus, more complex than the topology of the SOM’s 2D lattice, then the latter would have been unable to form perfectly topology-preserving maps of the entire sensory surface. A standard Kohonen SOM tends to minimize the degree to which the map of the sensory surface violates the topology of the sensory surface (Kohonen, 2001). However, it is an open question how a mismatch in topology would influence multiple map formation and their relative orientations in the multiwinner case. Mirror-image and distorted mirror-image maps in our model arose due to the natural tendency of competitive Hebbian learning to make adjacent cortical elements represent neighboring regions of the input surface while ensuring that the entire sensory surface is represented. A single topographic map of the sensory surface achieves both of these ends, but its formation requires a single global competition. In contrast, which biological cortical element responds most to a particular input is presumably known only locally. In our model, the multiple “winner elements” that arise due to simulating this locality result in the formation of multiple individually topology-preserving maps plus the optimization of the transitions between them. As illustrated in Figure 6A, which is pictured in the input space, this optimal transition corresponds to a fold in the cortical region’s lattice when projected onto the input space, the fold being perpendicular to the map edges and leading to mirror symmetric adjacent maps. The other two types of symmetry that were observed constitute suboptimal transitions from one
Symmetric Maps
A
1077
B
C
Figure 6: Various ways in which the model cortical region became embedded in the input space, a two-dimensional sensory surface that is shown as a light gray planar area. Cortical elements (small black rectangles) are connected by a solid line iff they are adjacent in the cortical region. Each case involves two roughly topographic maps of the sensory surface that have been spatially separated along the vertical axis to illustrate the projection of the cortical region in the input space. For illustrative purposes, the cortical regions shown here are much smaller than those used in the actual experiments. (A) Mirror symmetric maps (compare to Figure 3A). The cortical region becomes folded in the process of self-organization with the fold perpendicular to the longer sides of the cortical region, keeping connected cortical elements along the intermap boundary close in the input space. (B) Glide reflection symmetric maps (compare to Figure 3B). The cortical region contains a diagonal fold, and the cortical elements along the fold line (separated from the two maps along the z-axis) have been forced to move counterclockwise (clockwise) in the input space around the top (bottom) map. Many neighboring cortical elements along the fold line become widely separated in the input space (such nodes are gray and connected by thick gray lines). (C) Rotationally symmetric maps (compare to Figure 3C). The cortical region has been twisted in addition to being folded in the same manner as in A. In this case, adjacent cortical elements along the fold line and close to the edges of the cortical region become very widely separated in the input space.
map to the next. A glide reflection symmetry is the visible expression of a diagonal fold in the lattice combined with a shearing motion along the fold (see Figure 6B), which causes the entire diagonal intermap boundary to be mildly nonoptimal. Rotation symmetry indicates a combination of a perpendicular fold and a twist of the lattice (see Figure 6C) so that only the central region of the intermap boundary retains its optimality; near the edges of the cortical region, the boundary is highly nonoptimal. The key
1078
R. Schulz and J. Reggia
point illustrated by Figure 6 is that in the latter two cases, cortical elements near the map boundary become removed from one another in the input space, rendering these two transitions suboptimal and hence significantly less likely to occur. We also observed in our simulations that when the sensory surface was subdivided into regions, some being stimulated more often during training, two adjacent maps, in addition to being mirror symmetric, were almost always oriented in such a way that their representations of the most often stimulated region were located next to each other at the intermap boundary. The other regions were represented farther away from the boundary, following the gradient in the frequency of stimulation. The more often stimulated regions were represented by a relatively larger area of modeled cortical surface in each of the individual maps. Our finding that two adjacent maps produced by activity-dependent synaptic changes will usually be mirror images of one another is complementary to and consistent with the prevalent notion that activityindependent genetic factors initially determine cortical arealization and affect targeting of thalamocortical afferents. It raises the question of how genetic and activity-dependent synaptic changes might interact during development and even during evolution, as it seems improbable that evolutionary processes would hard-wire adjacent cortical maps to be mirror images so often unless there was some advantage to this arrangement (such as consistency with local synaptic plasticity). Our model also makes two specific, testable predictions that may or may not relate to biological cortical maps. First, when adjacent mirror image topographic maps occur in neocortex, their common edge should represent the region of sensory surface that develops and innervates first (i.e., that has the most frequent stimuli initially during map development). This is consistent with, for example, the otherwise surprising location of fingers and toes in biological neocortex far from the symmetry axis in mirror image hand and foot representations in S1 (see Figure 1A), as these distal digits appear late during development (Gilbert, 1994; Lonai, 1996). Second, adjacent maps may occasionally exhibit a very different rotational symmetry. If such previously unreported rotationally symmetric maps are ever observed experimentally in a small percentage of currently known cortical map regions, they would provide very strong support for our model. Such atypically oriented adjacent maps, in the context of normal connectivity between cortical regions, would be expected to cause abnormal cortical information processing, and we speculate that they might account for some of the cognitive deficits and functional imaging changes observed in neurodevelopmental disorders such as dyslexia or autism (Frank & Pavlakis, 2001; Papanicolaou et al., 2003; Temple et al., 2003). We believe that the rarity of such atypically oriented adjacent maps and the very limited experimental data on human maps may explain why they have not been reported experimentally.
Symmetric Maps
1079
In our model, each “competition” covers roughly as much cortical area as a single topographic map. This raises the question of whether lateral interactions in biological primary sensory cortex can span comparable distances. While most direct lateral interactions in adults are more localized than this, long-range lateral interactions have been reported in cortex. For example, in primary visual cortex (V1), specifically in the context of understanding the mechanisms that underlie the contextual modulation of a neuron’s receptive field, a range of interaction exceeding 13 degrees of the visual field has been described (Angelucci et al., 2002; Bringuier, Chavane, Glaeser, & Fr´egnac, 1999; Grinvald, Lieke, Frostig, & Hildesheim, 1994; Kitano, Niiyama, Kasamatsu, Sutter, & Norcia, 1994; Levitt & Lund, 2002). In adult macaque monkeys, this roughly corresponds to a distance in V1 of 20 mm. While this figure falls short of a lateral interaction range spanning all of V1 in the adult (macaque V1 is roughly an ellipsoid of 25 mm by 50 mm when flattened: Felleman and van Essen, 1991), the relative range may be longer early during development when map formation is occurring (Brown et al., 2001). In addition, the measurements often constitute lower bounds on the maximum range, and they take into account only specific types of lateral interaction (facilitation and surround inhibition). Thus, the range of competition in our model is not necessarily inconsistent with biological reality. Finally, we note that our one-shot model of map formation is computationally efficient, like Kohonen’s widely used model, yet at the same time exhibits more properties that are biologically plausible (e.g., multiwinner representations, multiple maps, mirror-image adjacent maps). This combination of properties may be of interest to future builders of very large-scale neural models where computational cost is a critical issue and therefore the use of iterative multiwinner SOMs becomes problematic. Some of our model’s computational properties may also prove valuable in traditional application domains of the single-winner Kohonen SOM. The property of redundant map formation could be exploited to increase the fault tolerance and, thus, the robustness of an existing single-winner SOM-based system. Data visualization, one of the main uses for single-winner SOMs, may benefit from the property that the relative orientation of adjacent mirror maps depends on the distribution of inputs, adding a visual clue that can help identify, for example, gradients in the input distribution. Exploration of these ideas is an important possible direction for future research.
Acknowledgments This work was supported by NIH award NS35466. We thank C. Lynne D’Autrechy, Mary Howard, and Scott Weems for constructive comments.
1080
R. Schulz and J. Reggia
References Allman, J., & Kaas, J. (1971). A representation of the visual field in the caudal third of the middle temporal gyrus of the owl monkey. Brain Research, 31, 85–105. Angelucci, A., Levitt, J., Walton, E., Hupe, J., Bullier, J., & Lund, J. (2002). Circuits for local and global signal integration in primary visual cortex. J. Neuroscience, 22(19), 8633–8646. Bauer, H.-U. (1995). Development of oriented ocular dominance bands as a consequence of areal geometry. Neural Computation, 7(1), 36–50. Bauer, H.-U., & Pawelzik, K. (1992). Quantifying the neighborhood preservation of self-organizing feature maps. IEEE Transactions on Neural Networks, 3(4), 570–579. Beck, P., Pospichal, M., & Kaas, J. (1996). Topography, architecture, and connections of somatosensory cortex in opossums: Evidence for five somatosensory areas. J. Comparative Neurology, 1(366), 109–133. Bringuier, V., Chavane, F., Glaeser, L., & Fregnac, ´ Y. (1999). Horizontal propagation of visual activity in the synaptic integration field of area 17 neurons. Science, 283, 695–699. Brown, M., Keynes, R., & Lumsden, A. (2001). The developing brain. New York: Oxford University Press. Cho, S., & Reggia, J. (1994). Map formation in proprioceptive cortex. Int. J. Neural Systems, 5(2), 87–101. Cohen-Cory, S. (2002). The developing synapse: Construction and modulation of synaptic structures and circuits. Science, 298, 770–776. Cowey, A. (1981). Why are there so many visual areas? In F. Schmitt (Ed.), The organization of the cerebral cortex (pp. 395–413). Cambridge, MA: MIT Press. Donoghue, J., Leibovic, S., & Sanes, J. (1992). Organization of the forelimb area in squirrel monkey motor cortex. Experimental Brain Research, 89, 1–19. Drager, U. (1975). Receptive fields of single cells and topography in mouse visual cortex. J. Comparative Neurology, 160, 269–290. Dykes, R., & Ruest, A. (1984). What makes a map in somatosensory cortex? In E. Jones & A. Peters (Eds.), Cerebral cortex (Vol. 5, pp. 1–29). New York: Plenum Press. Engelien, A., Yang, Y., Engelien, W., Zonana, J., Stern, E., & Silbersweig, D. (2002). Physiological mapping of human auditory cortices with a silent event-related fMRI technique. Neuroimage, 4(16), 944–953. Felleman, D., & van Essen, D. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1(1), 1–47. Formisano, E., Kim, D., & Salle, F. (2003). Mirror-symmetric tonotopic maps in human primary auditory cortex. Neuron, 40, 859–869. Frank, Y., & Pavlakis, S. (2001). Brain imaging in neurobehavioral disorders. Pediatr. Neurol., 25, 278–287. Gentilucci, M., Fogassi, L., Luppino, G., Matelli, M., Camarda, R., & Rizzolatti, G. (1989). Somatotopic representation in inferior area 6 of the macaque monkey. Brain, Behavior and Evolution, 2–3(33), 118–121. Georgopoulos, A., Kettner, R., & Schwartz, A. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the directions of movement by a neural population. J. Neuroscience, 8, 2928–2937.
Symmetric Maps
1081
Gilbert, S. (1994). Developmental biology. Sunderland, MA: Sinauer. Grajski, K., & Merzenich, M. (1990). Neural network simulation of somatosensory representational plasticity. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 52–59). San Mateo, CA: Morgan Kaufmann. Grinvald, A., Lieke, E., Frostig, R., & Hildesheim, R. (1994). Cortical point-spread function and long-range lateral interactions revealed by real-time optical imaging of macaque monkey primary visual cortex. J. Neuroscience, 14(5), 2545–2568. Grove, E., & Tomomi, F. (2003). Generating the cerebral cortical area map. Annual Review Neuroscience, 26, 355–380. Imig, T., Reale, R., & Brugge, J. (1986). Topography of cortico-cortical connections related to tonotopic and binaural maps. In F. Lepore (Ed.), Two hemispheres—One brain (pp. 103–115). New York: Alan Liss. Jones, E. (1990). Modulatory events in the development and evolution of primate neocortex. In A. Peters (Ed.), Cerebral cortex (pp. 311–351). New York: Plenum. Kaas, J. (1988). Why does the brain have so many cortical areas? J. Cognitive Neuroscience, 1, 121–134. Kitano, M., Niiyama, K., Kasamatsu, T., Sutter, E., & Norcia, A. (1994). Retinotopic and nonretinotopic field potentials in cat visual cortex. Visual Neuroscience, 11, 953–977. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer. Kohonen, T. (2001). Self-organizing maps (3rd ed.). Berlin: Springer. Krubitzer, L. (1995). The organization of neocortex in mammals. Trends in Neuroscience, 18, 408–417. Krubitzer, L., & Calford, M. (1992). Five topographically organized fields in the somatosensory cortex of the flying fox: Microelectrode maps, myeloarchitecture, and cortical modules. J. Comparative Neurology, 1(317), 1–30. Krubitzer, L., Clarey, J., Tweedale, R., Elston, G., & Calford, M. (1995). A redefinition of somatosensory areas in the lateral sulcus of macaque monkeys. J. Neuroscience, 5(15), 3821–3839. Levitt, J., & Lund, J. (2002). The spatial extent over which neurons in macaque striate cortex pool visual signals. Visual Neuroscience, 19(4), 439–452. Levitt, P. (2000). Molecular determinants of regionalization of the forebrain and cerebral cortex. In M. Gazzaniga (Ed.), The new cognitive neurosciences (pp. 23–43). Cambridge, MA: MIT Press. Lonai, P. (1996). Mammalian development. London: Harwood. Martinetz, T., Ritter, H., & Schulten, K. (1989). Kohonen’s self-organizing map for modeling the formation of the auditory cortex of a bat. In R. Pfeifer, Z. Schreter, F. Fogelman-Souli´e, & L. Steels (Eds.), Connectionism in perspective (pp. 403–412). Amsterdam: North-Holland. Merzenich, M., Kaas, J., Sur, M., & Lin, C. (1978). Double representation of the body surface within cytoarchitectonic areas 3b and 1 in “S1” in the owl monkey (aotus trivirgatus). J. Comparative Neurology, 181(1), 41–73. Mountcastle, V. (1998). The cerebral cortex. Cambridge, MA: Harvard University Press. Nelson, R., Sur, M., Felleman, D., & Kaas, J. (1980). Representations of the body surface in postcentral parietal cortex of macaca fascicularis. J. Comparative Neurology, 4(192), 611–643.
1082
R. Schulz and J. Reggia
Newsome, W., Maunsell, J., & van Essen, D. (1986). Ventral posterior visual area of the macaque: Visual topography and areal boundaries. J. Comparative Neurology, 2(252), 139–153. Obermayer, K., Blasdel, G., & Schulten, K. (1992). A statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Physical Review A, 45(10), 7568–7589. Obermayer, K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of cortical feature maps. Proc. National Academy of Sciences USA, 87, 8345–8349. Obermayer, K., Schulten, K., & Blasdel, G. (1992). A comparison of a neural network model for the formation of brain maps with experimental data. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 83–90). San Mateo, CA: Morgan Kaufmann. Palakal, M., Murthy, U., Chittajallu, S., & Wong, D. (1995). Tonotopic representation of auditory responses using self-organizing maps. Mathematical and Computer Modelling, 22(2), 7–21. Pantev, C., Bertrand, O., Eulitz, C., Verkindt, C., Hampson, S., Schuierer, G., & Elbert, T. (1995). Specific tonotopic organizations of different areas of the human auditory cortex revealed by simultaneous magnetic and electric recordings. Electroencephalography and Clinical Neurophysiology, 1(94), 26–40. Papanicolaou, A., Simos, P., Breier, J., Fletcher, J., Foorman, B., Francis, D., Costillo, E., & Davis, R. (2003). Brain mechanisms for reading in children with and without dyslexia. Dev. Neuropsychol., 24, 593–612. Pearson, J., Finkel, L., & Edelman, G. (1987). Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. Neuroscience, 7, 4209–4223. Pei, X., Vidyasagar, T., Volgushev, M., & Creutzfeldt, O. (1994). Receptive field analysis and orientation selectivity of postsynaptic potentials of simple cells in car visual cortex. J. Neuroscience, 11(14), 7130–7140. Reggia, J., D’Autrechy, C., Sutton, G., & Weinrich, M. (1992). A competitive distribution theory of neo-cortical dynamics. Neural Computation, 4, 287–317. Ritter, H., Martinetz, T., & Schulten, K. (Eds.). (1992). Neural computation and selforganizing maps. Reading, MA: Addison-Wesley. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s self-organizing sensory mapping. Biological Cybernetics, 54, 99–106. Schulz, R., & Reggia, J. (2004). Temporally asymmetric learning supports sequence processing in multi-winner self-organizing maps. Neural Computation, 16(3), 535– 561. Sereno, M., Dale, A., Reppas, J., Kwong, K., Belliveau, J., Brady, T., Rosen, B., & Tootell, R. (1995). Borders of multiple visual areas in humans revealed by functional magnetic resonance imaging. Science, 268(5212), 889–893. Sirosh, J., & Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71, 65–78. Sur, M., & Leamey, C. (2001). Development and plasticity of cortical areas and networks. Nature Reviews Neuroscience, 2, 251–262. Sur, M., Nelson, R., & Kaas, J. (1982). Representations of the body surface in cortical areas 3b and 1 of squirrel monkeys: Comparisons with other primates. J. Comparative Neurology, 2(211), 177–192.
Symmetric Maps
1083
Sutton, G., Reggia, J., Armentrout, S., & D’Autrechy, C. (1994). Cortical map reorganization as a competitive process. Neural Computation, 6, 1–13. Talavage, T., Ledden, P., & Benson, R. (2000). Frequency-dependent responses exhibited by multiple regions in human auditory cortex. Hearing Research, 150, 225–244. Temple, E., Deutsch, G., Poldrack, R., Miller, S., Tallal, P., Merzenich, M., & Gabrieli, J. (2003). Neural deficits in children with dyslexia ameliorated by behavioral remediation. Proc. Nat. Acad. Sci., 100, 2860–2865. Tiao, Y., & Blakemore, C. (1976). Functional organization in the visual cortex of the golden hamster. J. Comparative Neurology, 4(168), 459–481. Villmann, T., Der, R., Herrmann, M., & Martinetz, T. (1997). Topology preservation in self-organizing feature maps: Exact definition and measurement. IEEE Transactions on Neural Networks, 8(2), 256–266. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. Zhou, R., & Black, I. (2000). Development of neural maps. In M. Gazzaniga (Ed.), The new cognitive neurosciences (pp. 213–236), Cambridge, MA: MIT Press.
Received June 11, 2004; accepted October 14, 2004.
LETTER
Communicated by Tamar Flash
Stochastic Optimal Control and Estimation Methods Adapted to the Noise Characteristics of the Sensorimotor System Emanuel Todorov
[email protected] Department of Cognitive Science, University of California San Diego, La Jolla CA 92093-0515.
Optimality principles of biological movement are conceptually appealing and straightforward to formulate. Testing them empirically, however, requires the solution to stochastic optimal control and estimation problems for reasonably realistic models of the motor task and the sensorimotor periphery. Recent studies have highlighted the importance of incorporating biologically plausible noise into such models. Here we extend the linear-quadratic-gaussian framework—currently the only framework where such problems can be solved efficiently—to include controldependent, state-dependent, and internal noise. Under this extended noise model, we derive a coordinate-descent algorithm guaranteed to converge to a feedback control law and a nonadaptive linear estimator optimal with respect to each other. Numerical simulations indicate that convergence is exponential, local minima do not exist, and the restriction to nonadaptive linear estimators has negligible effects in the control problems of interest. The application of the algorithm is illustrated in the context of reaching movements. A Matlab implementation is available at www.cogsci.ucsd.edu/∼todorov. 1 Introduction Many theories in the physical sciences are expressed in terms of optimality principles, which often provide the most compact description of the laws governing a system’s behavior. Such principles play an important role in the field of sensorimotor control as well (Todorov, 2004). A quantitative theory of sensorimotor control requires a precise definition of success in the form of a scalar cost function. By combining top-down reasoning with intuitions derived from empirical observations, researchers have proposed a number of hypothetical cost functions for biological movement. While such hypotheses are not difficult to formulate, comparing their predictions to experimental data is complicated by the fact that the predictions have to be derived in the first place—that is, the hypothetical optimal control and estimation problems have to be solved. The most popular approach has been to optimize, in an open loop, the sequence of control signals (Chow & Jacobson, Neural Computation 17, 1084–1108 (2005)
© 2005 Massachusetts Institute of Technology
Methods for Optimal Sensorimotor Control
1085
1971; Hatze & Buys, 1977; Anderson & Pandy, 2001) or limb states (Nelson, 1983; Flash & Hogan, 1985; Uno, Kawato, & Suzuki, 1989; Harris & Wolpert, 1998). For stochastic partially observable plants such as the musculoskeletal system, however, open-loop approaches yield suboptimal performance (Todorov & Jordan, 2002b; Todorov, 2004). Optimal performance can be achieved only by a feedback control law, which uses all sensory data available online to compute the most appropriate muscle activations under the circumstances. Optimization in the space of feedback control laws is studied in the related fields of stochastic optimal control, dynamic programming, and reinforcement learning. Despite many advances, the general-purpose methods that are guaranteed to converge in a reasonable amount of time to a reasonable answer remain limited to discrete state and action spaces (Bertsekas & Tsitsiklis, 1997; Sutton & Barto, 1998; Kushner & Dupuis, 2001). Discretization methods are well suited for higher-level control problems, such as the problem faced by a rat that has to choose which way to turn in a twodimensional maze. But the main focus in sensorimotor control is on a different level of analysis: on how the rat chooses a hundred or so graded muscle activations at each point in time, in a way that causes its body to move toward the reward without falling or hitting walls. Even when the musculoskeletal system is idealized and simplified, the state and action spaces of interest remain continuous and high-dimensional, and the curse of dimensionality prevents the use of discretization methods. Generalizations of these methods to continuous high-dimensional spaces typically involve function approximations whose properties are not yet well understood. Such approximations can produce good enough solutions, which is often acceptable in engineering applications. However, the success of a theory of sensorimotor control ultimately depends on its ability to explain data in a principled manner. Unless the theory’s predictions are close to the globally optimal solution of the hypothetical control problem, it is difficult to determine whether the (mis)match to experimental data is due to the general (in)applicability of optimality ideas to biological movement, or the (in)appropriateness of the specific cost function, or the specific approximations—in both the plant model and the controller design—used to derive the predictions. Accelerated progress will require efficient and well-understood methods for optimal feedback control of stochastic, partially observable, continuous, nonstationary, and high-dimensional systems. The only framework that currently provides such methods is linear-quadratic-gaussian (LQG) control, which has been used to model biological systems subject to sensory and motor uncertainty (Loeb, Levine, & He, 1990; Hoff, 1992; Kuo, 1995). While optimal solutions can be obtained efficiently within the LQG setting (via Riccati equations), this computational efficiency comes at the price of reduced biological realism, because (1) musculoskeletal dynamics are generally nonlinear, (2) behaviorally relevant performance criteria are
1086
E. Todorov
unlikely to be globally quadratic (Kording & Wolpert, 2004), and (3) noise in the sensorimotor apparatus is not additive but signal-dependent. The third limitation is particularly problematic because it is becoming increasingly clear that many robust and extensively studied phenomena—such as trajectory smoothness, speed-accuracy trade-offs, task-dependent impedance, structured motor variability and synergistic control, and cosine tuning— are linked to the signal-dependent nature of sensorimotor noise (Harris & Wolpert, 1998; Todorov, 2002; Todorov & Jordan, 2002b). It is thus desirable to extend the LQG setting as much as possible and adapt it to the online control and estimation problems that the nervous system faces. Indeed, extensions are possible in each of the three directions listed above: 1. Nonlinear dynamics (and nonquadratic costs) can be approximated in the vicinity of the expected trajectory generated by an existing controller. One can then apply modified LQG methodology to the approximate problem and use it to improve the existing controller iteratively. Differential dynamic programming (Jacobson & Mayne, 1970), as well as iterative LQG methods (Li & Todorov, 2004; Todorov & Li, 2004), are based on this general idea. In their present form, most such methods assume deterministic dynamics, but stochastic extensions are possible (Todorov & Li, 2004). 2. Quadratic costs can be replaced with a parametric family of exponential-of-quadratic costs, for which optimal LQG-like solutions can be obtained efficiently (Whittle, 1990; Bensoussan, 1992). The controllers that are optimal for such costs range from risk averse (i.e., robust), through classic LQG, to risk seeking. This extended family of cost functions has not yet been explored in the context of biological movement. 3. Additive gaussian noise in the plant dynamics can be replaced with multiplicative noise, which is still gaussian but has standard deviation proportional to the magnitude of the control signals or state variables. When the state of the plant is fully observable, optimal LQG-like solutions can be computed efficiently, as shown by several authors (Kleinman, 1969; McLane, 1971; Willems & Willems, 1976; Bensoussan, 1992; El Ghaoui, 1995; Beghi & D’Alessandro, 1998; Rami, Chen, & Moore, 2001). Such methodology has also been used to model reaching movements (Hoff, 1992). Most relevant to the study of sensorimotor control, however, is the partially observable case, which remains an open problem. While some work along these lines has been done (Pakshin, 1978; Phillis, 1985), it has not produced reliable algorithms that one can use off the shelf in building biologically relevant models (see section 9). Our goal here is to address that problem, and provide the model-building methodology that is needed.
Methods for Optimal Sensorimotor Control
1087
Table 1: List of Notation. xt ∈ Rm ut ∈ R p yt ∈ Rk n A, B, H ξ t , ω t , εt , t , η t ξ , ω , ε , , η C1 , . . . , Cc D1 , . . . , Dd Qt , R xt et t x xe te , t , t vt Stx , Ste , st Kt Lt
state vector at time step t control signal sensory observation total number of time steps system dynamics and observation matrices zero-mean noise terms covariances of noise terms scaling matrices for control-dependent system noise scaling matrices for state-dependent observation noise matrices defining state- and control-dependent costs state estimate estimation error conditional estimation error covariance unconditional covariances optimal cost-to-go function parameters of the optimal cost-to-go function filter gain matrices control gain matrices
In this letter, we define an extended noise model that reflects the properties of the sensorimotor system; derive an efficient algorithm for solving the stochastic optimal control and estimation problems under that noise model; illustrate the application of this extended LQG methodology in the context of reaching movements; and study the properties of the new algorithm through extensive numerical simulations. A special case of the algorithm derived here has already allowed us (Todorov & Jordan, 2002b) to construct models of a wider range of empirical results than previously possible. In section 2 we motivate our extended noise model, which includes control-dependent, state-dependent, and internal estimation noise. In section 3 we formalize the problem and restrict the feedback control laws under consideration to functions of state estimates that are obtained by unbiased nonadaptive linear filters. In section 4 we compute the optimal feedback control law for any nonadaptive linear filter and show that it is linear in the state estimate. In section 5 we derive the optimal nonadaptive linear filter for any linear control law. The two results together provide an iterative coordinate-descent algorithm (equations 4.2 and 5.2), which is guaranteed to converge to a filter and a control law optimal with respect to each other. In section 6 we illustrate the application of our method to the analysis of reaching movements. In section 7 we explore numerically the convergence properties of the algorithm and observe exponential convergence with no local minima. In section 8 we assess the effects of assuming a nonadaptive linear filter and find them to be negligible for the control problems of interest. Table 1 shows the notation used in this letter.
1088
E. Todorov
2 Noise Characteristics of the Sensorimotor System Noise in the motor output is not additive but instead increases with the magnitude of the control signals. This is intuitively obvious: if you rest your arm on the table, it does not bounce around (i.e., the passive plant dynamics have little noise), but when you make a movement (i.e., generate control signals), the outcome is not always as desired. Quantitatively, the relationship between motor noise and control magnitude is surprisingly simple. Such noise has been found to be multiplicative: the standard deviation of muscle force is well fit with a linear function of the mean force, in both static (Sutton & Sykes, 1967; Todorov, 2002) and dynamic (Schmidt, Zelaznick, Hawkins, Frank, & Quinn, 1979) isometric force tasks. The exact reasons for this dependence are not entirely clear, although it can be explained at least in part with Poisson noise on the neural level combined with Henneman’s size principle of motoneuron recruitment (Jones, Hamilton, & Wolpert, 2002). To formalize the empirically established dependence, let u be a vector of control signals (corresponding to the muscle activation levels that the nervous system attempts to set) and ε be a vector of zero-mean random numbers. A general multiplicative noise model takes the form C(u)ε, where C(u) is a matrix whose elements depend linearly on u. To express a linear relationship between a vector u and a matrix C, we make the ith column of C equal to Ci u, where Ci are constant scaling matrices. Then we have C(u)ε = i Ci uεi , where εi is the ith component of the random vector ε. Online movement control relies on feedback from a variety of sensory modalities, with vision and proprioception typically playing the dominant role. Visual noise obviously depends on the retinal position of the objects of interest and increases with distance away from the fovea (i.e., eccentricity). The accuracy of visual positional estimates is again surprisingly well modeled with multiplicative noise, whose standard deviation is proportional to eccentricity. This is an instantiation of Weber’s law and has been found to be quite robust in a variety of interval discrimination experiments (Burbeck & Yap, 1990; Whitaker & Latham, 1997). We have also confirmed this scaling law in a visuomotor setting, where subjects pointed to memorized targets presented in the visual periphery (Todorov, 1998). Such results motivate the use of a multiplicative observation noise model of the form D (x) = i Di x i , where x is the state of the plant and environment, including the current fixation point and the positions and velocities of relevant objects. Incorporating state-dependent noise in analyses of sensorimotor control can allow more accurate modeling of the effects of feedback and various experimental perturbations; it also can effectively induce a cost function over eye movement patterns and allow us to predict the eye movements that would result in optimal hand performance (Todorov, 1998). Note that if other forms of state-dependent sensory noise are found, the model can still be useful as a linear approximation.
Methods for Optimal Sensorimotor Control
1089
Intelligent control of a partially observable stochastic plant requires a feedback control law, which is typically a function of a state estimate that is computed recursively over time. In engineering applications, the estimation-control loop is implemented in a noiseless digital computer, and so all noise is external. In models of biological movement, we usually make the same assumption, treating all noise as being a property of the musculoskeletal plant or the sensory apparatus. This is in principle unrealistic, because neural representations are likely subject to internal fluctuations that do not arise in the periphery. It is also unrealistic in modeling practice. An ideal observer model predicts that the estimation error covariance of any stationary feature of the environment will asymptote to 0. In particular, such models predict that if we view a stationary object in the visual periphery long enough, we should eventually know exactly where it is and be able to reach for it as accurately as if it were at the center of fixation. This contradicts our intuition as well as experimental data. Both interval discrimination experiments and reaching to remembered peripheral targets experiments indicate that estimation errors asymptote rather quickly, but not to 0. Instead, the asymptote level depends linearly on eccentricity. The simplest way to model this is to assume another noise process, which we call internal noise, acting directly on whatever state estimate the nervous system chooses to compute. 3 Problem Statement and Assumptions Consider a linear dynamical system with state xt ∈ Rm , control ut ∈ R p , feedback yt ∈ Rk , in discrete time t: Dynamics
xt+1 = Axt + But + ξ t +
Feedback
yt = Hxt + ω t +
Cost per step
xtT Qt xt + utT Rut
d
c
εti Ci ut
i=1
ti Di xt
(3.1)
i=1
The feedback signal yt is received after the control signal ut has been generated. The initial state has known mean x1 and covariance 1 . All matrices are known and have compatible dimensions; making them time varying is straightforward. The control cost matrix R is symmetric positive definite (R > 0), and the state cost matrices Q1 , . . . , Qn are symmetric positive semidefinite (Qt ≥ 0). Each movement lasts n time steps; at t = n, the final cost is xnT Qn xn , and un is undefined. The independent random variables ξ t ∈ Rm , ω t ∈ Rk , εt ∈ Rc , and t ∈ Rd have multidimensional gaussian distributions with mean 0 and covariances ξ ≥ 0, ω > 0, ε = I and = I respectively. Thus, the control-dependent and state-dependent noise terms in equation 3.1 have covariances i Ci ut uTt CiT and i Di xt xTt DiT . When the
1090
E. Todorov
control-dependent noise is meant to be added to the control signal (which is usually the case), the matrices Ci should have the form B Fi where Fi are the actual noise scaling factors. Then the control-dependent part of the plant dynamics becomes B(I + i εti Fi )ut . The problem of optimal control is to find the optimal control law, that is, the sequence of causal control functions ut (u1 , . . . , ut−1 , y1 , . . . , yt−1 ) that minimize the expected total cost over the movement. Note that computing the optimal sequence of functions u1 (·), . . . , un−1 (·) is a different, and in general much more difficult, problem than computing the optimal sequence of open-loop controls u1 , . . . , un−1 . When only additive noise is present (i.e., C1 , . . . , Cc = 0 and D1 , . . . , Dd = 0), this reduces to the classic LQG problem, which has the well-known optimal solution (Davis & Vinter, 1985) Linear-Quadratic Regulator ut = −L t xt
Kalman Filter xt + But + K t (yt − H xt ) xt+1 = A
L t = (R + B T St+1 B)−1 B T St+1 A
K t = At H T (Ht H T + ω )−1
St = Qt + AT St+1 (A − B L t )
t+1 = ξ + (A − K t H) t AT
(3.2)
In that case, the optimal control law depends on the history of control and feedback signals only through the state estimate xt , which is updated recursively by the Kalman filter. The matrices L that define the optimal control law do not depend on the noise covariances or filter coefficients, and the matrices K that define the optimal filter do not depend on the cost and control law. In the case of control-dependent and state-dependent noise, the above independence properties no longer hold. This complicates the problem substantially and forces us to adopt a more restricted formulation in the interest of analytical tractability. We assume that, as in equation 3.2, the entire history of control and feedback signals is summarized by a state estimate xt , which is all the information available to the control system at time t. The feedback control law ut (·) is allowed to be an arbitrary function of xt , but xt can be updated only by a recursive linear filter of the form: xt + But + K t (yt − H xt ) + ηt . xt+1 = A The internal noise η t ∈ Rm has mean 0 and covariance η ≥ 0. The filter gains K 1 , . . . , K n−1 are nonadaptive; they are determined in advance and cannot change as a function of the specific controls and observations within a simulation run. Such a filter is always unbiased: for any K 1 , . . . , K n−1 , we have E [xt | xt ] = xt for all t. Note, however, that under the extended noise model, any nonadaptive linear filter is suboptimal: when xt is computed as defined above, Cov [xt | xt ] is generally larger than Cov [xt |u1 , . . . , ut−1 , y1 , . . . , yt−1 ]. The consequences of this will be explored numerically in section 8.
Methods for Optimal Sensorimotor Control
1091
4 Optimal Controller The optimal ut will be computed using the method of dynamic programming. We will show by induction that if the true state at time t is xt and the unbiased state estimate available to the control system is xt , then the optimal cost-to-go function (i.e., the cost expected to accumulate under the optimal control law) has the quadratic form vt (xt , xt ) = xTt Stx xt + (xt − xt )T Ste (xt − xt ) + st = xTt Stx xt + eTt Ste et + st , xt is the estimation error. At the final time t = n, the optimal where et xt − cost-to-go is simply the final cost xnT Qn xn , and so vn is in the assumed form with Snx = Qn , Sne = 0, sn = 0. To carry out the induction proof, we have to show that if vt+1 is in the above form for some t < n, then vt is also in that form. Consider a time-varying control law that is optimal at times t + 1, . . . , n, and at time t is given by ut = π ( xt ). Let vtπ (xt , xt ) be the corresponding cost-to-go function. Since this control law is optimal after time t, we have π vt+1 = vt+1 . Then the cost-to-go function vtπ satisfies the Bellman equation: xt ) = xTt Qt xt + π( xt )T Rπ ( xt ) + E [vt+1 (xt+1 , xt+1 )|xt , xt , π]. vtπ (xt , To compute the above expectation term, we need the update equations for the system variables. Using the definitions of the observation yt and the estimation error et , the stochastic dynamics of the variables of interest become xt+1 = Axt + Bπ ( xt ) + ξ t +
εti Ci π( xt )
i
et+1 = (A − K t H)et + ξ t − K t ω t − η t +
εti Ci π ( xt ) −
i
ti K t Di xt .
i
(4.1) Then the conditional means and covariances of xt+1 and et+1 are xt , π ] = Axt + Bπ( xt ) E [xt+1 |xt , E [et+1 |xt , xt , π ] = (A − K t H)et Cov [xt+1 |xt , xt , π ] = ξ + Ci π( xt )π ( xt )T CiT i
Cov [et+1 |xt , xt , π ] = + ξ
Ci π( xt )π ( xt )T CiT + η
i ω
+ K t K tT +
i
K t Di xt xTt DiT K tT ,
1092
E. Todorov
and the conditional expectation in the Bellman equation can be computed. The cost-to-go becomes x xt ) = xTt Qt + AT St+1 A + Dt xt vtπ (xt , e + eTt (A − K t H)T St+1 (A − K t H)et T x + tr (Mt ) + π ( B + Ct π ( xt ) xt ) R + B T St+1 x Axt , + 2π( xt )T B T St+1
where we defined the shortcuts e x Ci , CiT St+1 + St+1 Ct i
Dt
e DiT K tT St+1 K t Di ,
and
i
ξ x e ξ + St+1 + η + K t ω K tT . Mt St+1 Note that the control law affects only the cost-go-to function through an expression that is quadratic in π( xt ), which can be minimized analytically. But there is a problem: the minimum depends on xt , while π is only allowed to be a function of xt . To obtain the optimal control law at time t, we have to take an expectation over xt conditional on xt , and find the function π that minimizes the resulting expression. Note that the control-dependent expression is linear in xt , and so its expectation depends on the conditional mean of xt but not on any higher moments. Since E [xt | xt ] = xt , we have π x E vt (xt , xt )T R + B T St+1 xt ) xt )| xt = const + π ( B + Ct π ( x A xt , + 2π( xt )T B T St+1
and thus the optimal control law at time t is −1 T x x xt ; xt ) = −L t L t R + B T St+1 B + Ct B St+1 A. ut = π( Note that the linear form of the optimal control law fell out of the optimization and was not assumed. Given our assumptions, the matrix being inverted is symmetric positive-definite. To complete the induction proof, we have to compute the optimal costto-go vt , which is equal to vtπ when π is set to the optimal control law −L t xt . x x x Using the fact that L Tt (R + B T St+1 B + Ct )L t = L Tt B T St+1 A = AT St+1 B L t , and that xT Z x − 2 xT Zx = (x − x )T Z(x − x ) − xT Zx = eT Ze − xT Zx for a symx metric matrix Z (in our case equal to L Tt B T St+1 A), the result is x xt ) = xTt Qt + AT St+1 (A − B L t ) + Dt xt + tr (Mt ) + st+1 vt (xt , x e + eTt AT St+1 B L t + (A − K t H)T St+1 (A − K t H) et .
Methods for Optimal Sensorimotor Control
1093
We now see that the optimal cost-to-go function remains in the assumed quadratic form, which completes the induction proof. The optimal control law is computed recursively backward in time as Controller ut = −L t xt −1 x e x x Ci L t = R + B T St+1 B+ CiT St+1 + St+1 B T St+1 A i e x Stx = Qt + AT St+1 (A − B L t ) + DiT K tT St+1 K t Di ; Snx = Qn i
x e Ste = AT St+1 B L t + (A − K t H)T St+1 (A − K t H) ; ξ x e ξ η st = tr St+1 + St+1 + + K t ω K tT + st+1 ;
(4.2)
Sne = 0 sn = 0.
x1 + tr((S1x + S1e )1 ) + s1 . The total expected cost is x T1 S1x When the control-dependent and state-dependent noise terms are removed (i.e., C1 , . . . , Cc = 0, D1 , . . . , Dd = 0), the control laws given by equation 4.2 and 3.2 are identical. The internal noise term η, as well as the additive noise terms ξ and ω, do not directly affect the calculation of the feedback gain matrices L. However, all noise terms affect the calculation (see below) of the optimal filter gains K , which in turn affect L. One can attempt to transform equation 3.1 into a fully observable system by setting H = I , ω = η = 0, D1 , . . . , Dd = 0, in which case K = A, and apply equation 4.2. Recall, however, our assumption that the control signal is generated before the current state is measured. Thus, even if we make the sensory measurement equal to the state, we would still be dealing with a partially observable system. To derive the optimal controller for the fully observable case, we have to assume that xt is known at the time when ut is generated. The above derivation is now much simplified: the optimal costto-go function vt is in the form xTt St xt + st , and the expectation term that needs to be minimized with regard to ut = π(xt ) becomes E [vt+1 ] = (Axt + But )T St+1 (Axt + But )
T T Ci St+1 Ci ut + tr [St+1 ξ ] + st+1 , + ut i
and the optimal controller is computed in a backward pass through time as ut = −L t xt −1 T T R + B St+1 B + Ci St+1 Ci B T St+1 A
Fully observable controller
Lt =
i
St = Qt + A St+1 (A − B L t );
Sn = Qn
st = tr (St+1 ξ ) + st+1 ;
sn = 0.
T
(4.3)
1094
E. Todorov
5 Optimal Estimator So far, we have computed the optimal control law L for any fixed sequence of filter gains K . What should these gains be fixed to? Ideally they should correspond to a Kalman filter, which is the optimal linear estimator. However, in the presence of control-dependent and state-dependent noise, the Kalman filter gains become adaptive (i.e., K t depends on xt and ut ), which would make our control law derivation invalid. Thus, if we want to preserve the optimality of the control law given by equation 4.2 and obtain an iterative algorithm with guaranteed convergence, we need to compute a fixed sequence of filter gains that are optimal for a given control law. Once the iterative algorithm has converged and the control law has been designed, we could use an adaptive filter in place of the fixed-gain filter in run time (see section 8). Thus, our objective here is the following: given a linear feedback control law L 1 , . . . , L n−1 (which is optimal for the previous filter K 1 , . . . , K n−1 ), compute a new filter that, in conjunction with the given control law, results in minimal expected cost. In other words, we will evaluate the filter not by the magnitude of its estimation errors, but by the effect that these estimation errors have on the performance of the composite estimation-control system. We will show that the new optimal filter can be designed in a forward pass through time. In particular, we will show that regardless of the new values of K 1 , . . . , K t−1 , the optimal K t can be found analytically as long as K t+1 , . . . , K n−1 still have the values for which L t+1 , . . . , L n−1 are optimal. Recall that the optimal L t+1 , . . . , L n−1 depend only on K t+1 , . . . , K n−1 , and so the parameters (as well as the form) of the optimal cost-to-go function vt+1 cannot be affected by changing K 1 , . . . , K t . Since K t affects only the computation of xt+1 , and the effect of xt+1 on the total expected cost is captured by the function vt+1 , we have to minimize vt+1 with respect to K t . But v is a function of x and x, while K cannot be adapted to the specific values of x and x within a simulation run (by assumption). Thus, the quantity we have to minimize is the unconditional expectation of vt+1 . In doing so, we will use that fact that E [vt+1 (xt+1 , xt+1 )] = Ext ,xt [E [vt+1 (xt+1 , xt+1 )|xt , xt , L t ]]. The conditional expectation was already computed as an intermediate step in the previous section (not shown). The terms in E [vt+1 (xt+1 , xt+1 )|xt , xt , L t ] that depend on K t are
T T e ω T T T e Di xt xt Di K t St+1 . et (A − K t H) St+1 (A − K t H)et + tr K t + i
Defining the (uncentered) unconditional covariances te E [et eTt ] and tx E [xt xTt ], the unconditional expectation of the K t -dependent expression
Methods for Optimal Sensorimotor Control
1095
above becomes e ; a (K t ) = tr (A − K t H)te (A − K t H)T + K t Pt K tT St+1 Pt ω + Di tx DiT . i
The minimum of a (K t ) is found by setting its derivative with regard to K t to 0. Using the matrix identities ∂∂X tr (XU) = U T and ∂∂X tr XU XT V = V XU + e V T XU T , and the fact that the matrices St+1 , ω , te , tx are symmetric, we obtain ∂a (K t ) e = 2St+1 K t Hte H T + Pt − Ate H T . ∂ Kt This expression is equal to 0 whenever K t = Ate H T (Hte H T + Pt )−1 , ree gardless of the value of St+1 . Given our assumptions, the matrix being inverted is symmetric positive-definite. Note that the optimal K t depends on K 1 , . . . , K t−1 (through te and tx ) but is independent of K t+1 , . . . , K n−1 e (since it is independent of St+1 ). This is the reason that the filter gains are reoptimized in a forward pass. To complete the derivation, we have to substitute the optimal filter gains and compute the unconditional covariances. Recall that the variables xt , xt , et are deterministically related by et = xt − xt , so the covariance of any one of them can be computed given the covariances of the other two, and we have a choice of which pair of covariance matrices to compute. The resulting equations are most compact for the pair xt , et . The stochastic dynamics of these variables are xt + K t Het + K t ω t + η t + ti K t Di (et + xt ). xt+1 = (A − B L t ) i xt et+1 = (A − K t H)et + ξ t − K t ω t − η t − εti Ci L t (5.1) i − ti K t Di (et + xt ). i
Define the unconditional covariances, T tx E xt te E et eTt ; xt ;
T txe E x t et ,
x1 is a known constant, noting that tx is uncentered and tex = (txe )T . Since the initialization at t = 1 is 1e = 1 , 1x = x1 xT1 , 1xe = 0. With these defiT nitions, we have tx = E [(et + xt )(et + xt )T ] = te + tx + txe + txe . Using equation 5.1, the updates for the unconditional covariances are e = (A − K t H)te (A − K t H)T + ξ + η + K t Pt K tT t+1 + Ci L t tx L Tt CiT i
1096
E. Todorov
x t+1 = (A − B L t )tx (A − B L t )T + η + K t Hte H T + Pt K tT + (A − B L t )txe H T K tT + K t Htex (A − B L t )T xe t+1 = (A − B L t )txe (A − K t H)T + K t Hte (A − K t H)T
− η − K t Pt K tT . Substituting the optimal value of K t , which allows some simplifications to the above update equations, the optimal nonadaptive linear filter is computed in a forward pass through time as Estimator xt+1 = (A − B L t ) xt + K t (yt − H xt ) + η t T −1 e T e T ω e e x x xe K t = At H Ht H + + Di t + t + t + t Di i e t+1
ξ
η
= + + (A −
K t H)te AT
+
Ci L t tx L Tt CiT ;
1e = 1
i x t+1 = η + K t Hte AT + (A − B L t )tx (A − B L t )T
+ (A − B L t )txe H T K tT + K t Htex (A − B L t )T ; xe t+1 = (A − B L t ) txe (A − K t H)T − η ;
1x = x1 xT1 1xe = 0.
(5.2) It is worth noting the effects of the internal noise η t . If that term did not exist (i.e., η = 0), the last update equation would yield txe = 0 for all t. Indeed, for an optimal filter, one would expect txe = 0 from the orthogonality principle: if the state estimate and estimation error were correlated, one could improve the filter by taking that correlation into account. However, the situation here is different because we have noise acting directly on the state estimate. When such noise pushes xt in one direction, et is (by definition) pushed in the opposite direction, creating a negative correlation between xt and et . This is the reason for the negative sign in front of the η term in the last update equation. The complete algorithm is the following: Algorithm: Initialize K 1 , . . . , K n−1 , and iterate equation 4.2 and equation 5.2 until convergence. Convergence is guaranteed, because the expected cost is nonnegative by definition, and we are using a coordinate-descent algorithm, which decreases the expected cost in each step. The initial sequence K could be set to 0—in which case, the first pass of equation 4.2 will find the optimal open-loop controls, or initialized from equation 3.2—which is equivalent to assuming additive noise in the first pass. We can also derive the optimal adaptive linear filter, with gains K t that depend on the specific xt and ut = −L t xt within each simulation run. This is again accomplished by minimizing E [vt+1 ] with respect to K t , but the expectation is computed with xt being a known constant rather than a random xTt and txe = 0, and so the last two update variable. We now have tx = xt
Methods for Optimal Sensorimotor Control
1097
equations in equation 5.2 are no longer needed. The optimal adaptive linear filter is Adaptive estimator xt+1 = (A − B L t ) xt ) + ηt xt + K t (yt − H −1 K t = At H T Ht H T + ω + Di t + xt xTt DiT i t+1 = ξ + η + (A − K t H) t AT + Ci L t xt xTt L Tt CiT ,
(5.3)
i
xt ] is the conditional estimation error covariance where t = Cov [xt | (initialized from 1 , which is given). When the control-dependent, state-dependent, and internal noise terms are removed (C1 , . . . , Cc = 0, D1 , . . . , Dd = 0, η = 0), equation 5.3 reduces to the Kalman filter in equation 3.2. Note that using equation 5.3 instead of equation 5.2 online reduces the total expected cost because equation 5.3 achieves lower estimation error than any other linear filter, and the expected cost depends on the conditional estimation error covariance. This can be seen from E[vt (xt , xt )| xt ] = xTt Stx xt ] xt + st + tr Stx + Ste Cov[xt | 6 Application to Reaching Movements We now illustrate how the methodology developed above can be used to construct models relevant to motor control. Since this is a methodological rather than a modeling article, a detailed evaluation of the resulting models in the context of the motor control literature will not be given here. The first model is a one-dimensional model of reaching, and includes control-dependent noise but no state-dependent or internal noise. The latter two forms of noise are illustrated in the second model, where we estimate the position of a stationary peripheral target without making a movement. 6.1 Models. We model a single-joint movement (such as flexing the elbow) that brings the hand to a specified target. For simplicity, the rotational motion is replaced with translational motion; the hand is modeled as a point mass (m = 1 kg) whose one-dimensional position at time t is p(t). The combined action of all muscles is represented with the force f (t) acting on the hand. The control signal u(t) is transformed into force f (t) by adding controldependent multiplicative noise and applying a second-order muscle-like ˙ + f (t) = low-pass filter (Winter, 1990) of the form τ1 τ2 f¨ (t) + (τ1 + τ2 ) f(t) u(t), with time constants τ1 = τ2 = 0.04 sec. Note that a second-order filter can be written as a pair of coupled first-order filters (with outputs g and f ) ˙ + g(t) = u(t), τ2 f˙ (t) + f (t) = g(t). as follows: τ1 g(t) The task is to move the hand from the starting position p(0) = 0 m to the target position p ∗ = 0.1 m and stop there at time tend , with minimal energy
1098
E. Todorov
consumption. Movement durations are in the interval tend ∈ [0.25 sec; 0.35 sec]. Time is discretized at = 0.01 sec. The total cost is defined as ( p(tend ) − p ∗ )2 + (wv p˙ (tend ))2 + (w f f (tend ))2 +
n−1 r u(k)2 . n − 1 k=1
The first term enforces positional accuracy; the second and third terms specify that the movement has to stop at time tend , that is, both the velocity and force have to vanish; and the last term penalizes energy consumption. It makes sense to set the scaling weights wv and w f so that wv p˙ (t) and w f f (t) averaged over the movement have magnitudes similar to the hand displacement p ∗ − p(0). For a 0.1 m reaching movement that lasts about 0.3 sec, these weights are wv = 0.2 and w f = 0.02. The weight of the energy term was set to r = 0.00001. The discrete-time system state is represented with the five-dimensional vector xt = [ p(t); p˙ (t); f (t); g(t); p ∗ ] initialized from a gaussian with mean x1 = [0; 0; 0; 0; p ∗ ]. The auxiliary state variable g(t) is needed to implement a second-order filter. The target p ∗ is included in the state so that we can capture the above cost function using a quadratic with no linear terms: defining p = [1; 0; 0; 0; −1], we have pT xt = p(tend ) − p ∗ , and so xTt (ppT )xt = ( p(tend ) − p ∗ )2 . Note that the same could be accomplished by setting p = [1; 0; 0; 0; − p ∗ ] and xt = [ p(t); p˙ (t); f (t); g(t); 1]. The advantage of the formulation used here is that because the target is represented in the state, the same control law can be reused for other targets. The control law, of course, depends on the filter, which depends on the initial expected state, which depends on the target— and so a control law optimal for one target is not necessarily optimal for all other targets. Unpublished simulation results indicate good generalization, but a more detailed investigation of how the optimal control law depends on the target position is needed. The sensory feedback carries information about position, velocity, and force: yt = [ p(t); p˙ (t); f (t)] + ω t . The vector ω t of sensory noise terms has zero-mean gaussian distribution with diagonal covariance, ω = (σs diag[0.02 m; 0.2 m/s; 1 N])2 , where the relative magnitudes are set using the same order-of-magnitude reasoning as before, and σs = 0.5. The multiplicative noise term added to the discrete-time control signal ut = u(t) is σc εt ut , where σc = 0.5. Note that
Methods for Optimal Sensorimotor Control
1099
σc is a unitless quantity that defines the noise magnitude relative to the control signal magnitude. The discrete-time dynamics of the above system are p(t + ) = p(t) + p˙ (t) p˙ (t + ) = p˙ (t) + f (t)/m f (t + ) = f (t) (1 − /τ2 ) + g(t)/τ2 g(t + ) = g(t) (1 − /τ1 ) + u(t) (1 + σc εt ) /τ1 , which is transformed into the form of equation 3.1 by the matrices
1
0
0
0
0 0 0 1 /m A= 0 0 1 − /τ2 /τ2 0 0 1 − /τ1 0 0 0 0 0 0 0 1
0
0 B= 0 /τ1 0
10000
H = 0 1 0 0 0 00100 C1 = Bσc ; c = 1; d = 0 1 = ξ = η = 0.
The cost matrices are R = r , Q1,...,n−1 = 0, and Qn = ppT + vvT + ffT , where p = [1; 0; 0; 0; −1];
v = [0; wv ; 0; 0; 0];
f = [0; 0; w f ; 0; 0].
This completes the formulation of the first model. The above algorithm can now be applied to obtain the control law and filter, and the closed-loop system can be simulated. To replace the control-dependent noise with additive noise of similar magnitude (and compare the effects of the two forms of noise), we will set c = 0 and ξ = (4.6 N)2 B B T . The value of 4.6 N is the average magnitude of the control-dependent noise over the range of movement durations (found through 10,000 simulation runs at each movement duration). We also model an estimation process under state-dependent and internal noise, in the absence of movement. In that case, the state is xt = p ∗ , where the stationary target p ∗ is sampled from a gaussian with mean x1 ∈ {5 cm, 15 cm, 25 cm} and variance 1 = (5 cm)2 . Note that target eccentricity is represented as distance rather than visual angle. The state-dependent noise has scale D1 = 0.5, fixation is assumed to be at 0 cm, the time step is = 10 msec, and we run the estimation process for n = 100 time steps. In one set of simulations, we use internal noise η = (0.5 cm)2 without additive noise. In another set of simulations, we study additive noise with the same magnitude ω = (0.5 cm)2 , without internal noise. There is no actuator to be controlled, so we have A = H = 1 and B = L = 0. Estimation is based on the adaptive filter from equation 5.3.
1100
E. Todorov
6.2 Results. Reaching movements are known to have stereotyped bellshaped speed profiles (Flash & Hogan, 1985). Models of this phenomenon have traditionally been formulated in terms of deterministic open-loop minimization of some cost function. Cost functions that penalize physically meaningful quantities (such as duration or energy consumption) did not agree with empirical data (Nelson, 1983); in order to obtain realistic speed profiles, it appeared necessary to minimize a smoothness-related cost that penalizes the derivative of acceleration (Flash & Hogan, 1985) or torque (Uno et al., 1989). Smoothness-related cost functions have also been used in the context of stochastic optimal feedback control (Hoff, 1992) to obtain bell-shaped speed profiles. It was recently shown, however, that smoothness does not have to be explicitly enforced by the cost function; open-loop minimization of end-point error was found sufficient to produce realistic trajectories, provided that the multiplicative nature of motor noise is taken into account (Harris & Wolpert, 1998). While this is an important step toward a more principled optimization model of trajectory smoothness, it still contains an ad hoc element: the optimization is performed in an open loop, which is suboptimal, especially for movements of longer duration. Our model differs from Harris and Wolpert (1998) in that not only the average sequence of control signals is optimal, but the feedback gains that determine the online sensory-guided adjustments are also optimal. Optimal feedback control of reaching has been studied by Meyer, Abrams, Kornblum, Wright, and Smith (1988) in an intermittent setting, and Hoff (1992) in a continuous setting. However, both of these models assume full state observation. Ours is the first optimal control model of reaching that incorporates sensory noise and combines state estimation and feedback control into an optimal sensorimotor loop. The predicted movement kinematics shown in Figure 1A closely resemble observed movement trajectories (Flash & Hogan, 1985). Another well-known property of reaching movements, first observed a century ago by Woodworth and later quantified as Fitts’ law, is the trade-off between speed and accuracy. The fact that faster movements are less accurate implies that the instantaneous noise in the motor system is controldependent, in agreement with direct measurements of isometric force fluctuations (Sutton and Sykes, 1967; Schmidt et al., 1979; Todorov, 2002) that show standard deviation increasing linearly with the mean. Naturally, this noise scaling has formed the basis of both closed-loop (Meyer et al., 1988; Hoff, 1992) and open-loop (Harris & Wolpert, 1998) optimization models of the speed-accuracy trade-off. Figure 1B illustrates the effect in our model: as the (specified) movement duration increases, the standard deviation of the end-point error achieved by the optimal controller decreases. To emphasize the need for incorporating control-dependent noise, we modified the model by making the noise in the plant dynamics additive, with fixed magnitude chosen to match the average multiplicative noise magnitude over the range of movement durations. With that change, the end-point error showed the opposite trend to the one observed experimentally (see Figure 1B).
Methods for Optimal Sensorimotor Control
1101
Figure 1: (A) Normalized position (Pos), velocity (Vel), and acceleration (Acc) of the average trajectory of the optimal controller. (B) A separate optimal controller was constructed for each instructed duration, the resulting closed-loop system was simulated for 10,000 trials, and the positional standard deviation at the end of the trial was plotted. This was done with either multiplicative (solid line) or additive (dashed line) noise in the plant dynamics. (C) The position of a stationary peripheral target was estimated over time, under internal estimation noise (solid line) or additive observation noise (dashed line). This was done in three sets of trials, with target positions sampled from gaussians with means 5 cm (bottom), 15 cm (middle), and 25 cm (top). Each curve is an average over 10,000 simulation runs.
It is interesting to compare the effects of the control penalty r and the multiplicative noise scaling σc . As equation 4.2 shows, both terms penalize large control signals—directly in the case of r and indirectly (via increased uncertainty) in the case of σc . Consequently, both terms lead to a negative bias in end-point position (not shown), but the effect is much more pronounced for r . Another consequence of the fact that larger controls are more costly arises in the control of redundant systems, where the optimal strategy is to follow a minimal intervention principle, that is, to leave task-irrelevant deviations from the average behavior uncorrected (Todorov & Jordan, 2002a, 2002b). Simulations have shown that this more complex effect is dependent on σc and actually decreases when r is increased while σc is kept constant (Todorov & Jordan, 2002b). Figure 1C shows simulation results from our second model, where the position of a stationary peripheral target is estimated by the optimal adaptive filter in equation 5.3, operating under internal estimation noise or additive observation noise of the same magnitude. In each case, we show results for three sets of targets with varying average eccentricity. The standard deviations of the estimation error always reach an asymptote (much faster in the case of internal noise). In the presence of internal noise, this asymptote depends on target eccentricity; for the chosen model parameters, the dependence is in quantitative agreement with our experimental results (Todorov, 1998). Under additive noise, the error always asymptotes to 0.
1102
E. Todorov
Figure 2: Relative change in expected cost as a function of iteration number, in (A) psychophysical models and (B) random models. (C) Relative variability (SD/mean) among expected costs obtained from 100 different runs of the algorithm on the same model (average over models in each class).
7 Convergence Properties We studied the convergence properties of the algorithm in 10 models of psychophysical experiments taken from Todorov and Jordan (2002b) and 200 randomly generated models. The psychophysical models had dynamics and cost functions similar to the above example. They included two models of planar reaching, three models of passing through sequences of targets, one model of isometric force production, three models of tracking and reaching with a mechanically redundant arm, and one model of throwing. The dimensionalities of the state, control, and feedback were between 5 and 20, and the horizon n was about 100. The psychophysical models included control-dependent dynamics noise and additive observation noise, but no internal or state-dependent noise. The details of all these models are interesting from a motor control point of view, but we omit them here since they did not affect the convergence of the algorithm in any systematic way. The random models were divided into two groups of 100 each: passively stable, with all eigenvalues of A being smaller than 1, and passively unstable, with the largest eigenvalue of A being between 1 and 2. The dynamics were restricted so that the last component of xt was 1—to make the random models more similar to the psychophysical models, which always incorporated a constant in the state description. The state, control, and measurement dimensionalities were sampled uniformly between 5 and 20. The random models included all forms of noise allowed by equation 3.1. For each model, we initialized K 1,...,n−1 from equation 3.2 and applied our iterative algorithm. In all cases convergence was very rapid (see Figures 2A and 2B), with the relative change in expected cost decreasing exponentially. The jitter observed at the end of the minimization (see Figure 2A) is due to numerical round-off errors (note the log scale) and continues indefinitely. The exponential convergence regime does not always start from the first iteration (see Figure 2A). Similar behavior was observed for the absolute
Methods for Optimal Sensorimotor Control
1103
change in expected cost (not shown). As one would expect, random models with unstable passive dynamics converged more slowly than passively stable models. Convergence was observed in all cases. To test for the existence of local minima, we focused on five psychophysical, five random stable, and five random unstable models. For each model, the algorithm was initialized 100 times with different randomly chosen sequences K 1,...,n−1 , and run for 100 iterations. For each model, we computed the standard deviation of the expected cost obtained at each iteration and divided by the mean expected cost at that iteration. The results, averaged within each model class, are plotted in Figure 2C. The negligibly small values after convergence indicate that the algorithm always finds the same solution. This was true for every model we studied, despite the fact that the random initialization sometimes produced very large initial costs. We also examined the K and L sequences found at the end of each run, and the differences seemed to be due to round-off errors. Thus, we conjecture that the algorithm always converges to the globally optimal solution. So far we have not been able to prove this analytically and cannot offer a satisfying intuitive explanation at this time. Note that the system can be unstable even for the optimal controller. Formally, that does not affect the derivation, because in a discrete-time finitehorizon system, all numbers remain finite. In practice, the components of xt can exceed the maximum floating-point number whenever the eigenvalues of (A − B L t ) are sufficiently large. In the applications we are interested in (Todorov, 1998; Todorov & Jordan, 2002b), such problems were never encountered. 8 Improving Performance via Adaptive Estimation Although the iterative algorithm given by equations 4.2 and 5.2 is guaranteed to converge, and empirically it appears to converge to the globally optimal solution, performance can still be suboptimal due to the imposed restriction to nonadaptive filters. Here we present simulations aimed at quantifying this suboptimality. Because the potential suboptimality arises from the restriction to nonadaptive filters, it is natural to ask what would happen if that restriction were removed in run time and the optimal adaptive linear filter from equation 5.3 were used instead. Recall that although the control law is optimized under the assumption of a nonadaptive filter, it yields better performance if a different filter, which somehow achieves lower estimation error, is used in run time. Thus, in our first test, we simply replace the nonadaptive filter with equation 5.3 in run time and compute the reduction in expected total cost. The above discussion suggests a possibility for further improvement. The control law is optimal with respect to some sequence of filter gains K 1,...,n−1 . But the adaptive filter applied in run time uses systematically different gains, because it achieves systematically lower estimation error. We can run
1104
E. Todorov
Table 2: Cost Reduction. Model Method
Psychophysical
Random Stable
Random Unstable
Adaptive Estimator Reoptimized Controller
1.9 % 1.7
0% 0
31.4 % 28.3
Notes: Numbers indicate percent reduction in expected total cost, relative to the cost of the solution found by our iterative algorithm. The two improvement methods are described in the text. Each method is applied to 10 models in each model class. For each model and method, expected total cost is computed from 10,000 simulation runs. A value of 0% indicates that with a sample size of 10 models, the improvement was not significantly different from 0% (t-test, p = 0.05 threshold).
our control law in conjunction with the adaptive filter and find the average 1,...,n−1 that are used online. Now, one would think that if we filter gains K 1,...,n−1 , which better reoptimized the control law for the nonadaptive filter K reflects the gains being used online by the adaptive filter, this will further improve performance. This is the second test we apply. As Table 2 shows, neither method improves performance substantially for psychophysical models or random stable models. However, both methods result in substantial improvement for random unstable models. This is not surprising. In the passively stable models, the differences between the expected and actual values of the states and controls are relatively small, and so the optimal nonadaptive filter is not that different from the optimal adaptive filter. The unstable models, on the other hand, are very sensitive to small perturbations and thus follow substantially different state-control trajectories in different simulation runs. So the advantage of adaptive filtering is much greater. Since musculoskeletal plants have stable passive dynamics, we conclude that our algorithm is well suited for approximating the optimal sensorimotor system. It is interesting that control law reoptimization in addition to adaptive filtering is actually worse than adaptive filtering alone—contrary to our intuition. This was the case for every model we studied. Although it is not clear where the problem with the reoptimization method lies, this somewhat unexpected result provides further justification for the restriction we introduced. In particular, it suggests that the control law that is optimal under the best nonadaptive filter may be close to optimal under the best adaptive filter. 9 Discussion We have presented an algorithm for stochastic optimal control and estimation of partially observable linear dynamical systems, subject to quadratic costs and noise processes characteristic of the sensorimotor system (see
Methods for Optimal Sensorimotor Control
1105
equation 3.1). We restricted our attention to controllers that use state estimates obtained by nonadaptive linear filters. The optimal control law for any such filter was shown to be linear, as given by equation 4.2. The optimal nonadaptive linear filter for any linear control law is given by equation 5.2. Iteration of equations 4.2 and 5.2 is guaranteed to converge to a filter and a control law optimal with respect to each other. We found numerically that convergence is exponential, local minima do not to exist, and the effects of assuming nonadaptive filtering are negligible for the control problems of interest. The application of the algorithm was illustrated in the context of reaching movements. The optimal adaptive filter, equation 5.3, as well as the optimal controller for the fully observable case, equation 4.3, were also derived. To facilitate the application of our algorithm in the field of motor control and elsewhere, we have made a Matlab implementation available at www.cogsci.ucsd.edu/∼todorov. While our work was motivated by models of biological movement, the results presented here could be of interest to a wider audience. Problems with multiplicative noise have been studied in the optimal control literature, but most of that work has focused on the fully observable case (Kleinman, 1969; McLane, 1971; Willems & Willems, 1976; Bensoussan, 1992; El Ghaoui, 1995; Beghi & D’Alessandro, 1998; Rami et al., 2001). Our equation 4.3 is consistent with these results. The partially observable case that we addressed (and that is most relevant to models of sensorimotor control) is much more complex, because the independence of estimation and control breaks down in the presence of signal-dependent noise. The work most similar to ours is Pakshin (1978) for discrete-time dynamics and Phillis (1985) for continuoustime dynamics. These authors addressed a closely related problem using a different methodology. Instead of analyzing the closed-loop system directly, the filter and control gains were treated as open-loop controls to a modified deterministic dynamical system, whose cost function matches the expected cost of the original system. With that transformation, it is possible to use Pontryagin’s maximum principle, which is applicable only to deterministic open-loop control, and obtain necessary conditions that the optimal filter and control gains must satisfy. Although our results were obtained independently, we have been able to verify that they are consistent with Pakshin (1978) by removing from our model the internal estimation noise (which to our knowledge has not been studied before); combining equations 4.2 and 5.2; and applying certain algebraic transformations. However, our approach has three important advantages. First, we managed to prove that the optimal control law is linear under a nonadaptive filter, while this linearity had to be assumed before. Second, using the optimal cost-to-go function to derive the optimal filter revealed that adaptive filtering improves performance, even though the control law is optimized for a nonadaptive filter. And most important, our approach yields a coordinatedescent algorithm with guaranteed convergence, as well as appealing numerical properties illustrated in sections 7 and 8. Each of the two steps of our
1106
E. Todorov
coordinate-descent algorithm is computed efficiently in a single pass through time. In contrast, application of Pontryagin’s maximum principle yields a system of coupled difference (Pakshin, 1978) or differential (Phillis, 1985) equations with boundary conditions at the initial and final time, but no algorithm for solving that system. In other words, earlier approaches obscure the key property we uncovered: that half of the problem can be solved efficiently given a solution to the other half. Finally, there may be an efficient way to obtain a control law that achieves better performance under adaptive filtering. Our attempt to do so through reoptimization (see section 8) failed, but another approach is possible. Using the optimal adaptive filter (see equation 5.3) would make E [vt+1 ] a complex function of xt , ut , and the resulting vt would no longer be in the assumed parametric form (which is why we introduced the restriction to nonadaptive filters). But we could force that complex vt in the desired form by approximating it with a quadratic in xt , ut . This yields additional terms in equation 4.2. We have pursued this idea in our earlier work (Todorov, 1998); an independent but related method has been developed by Moore, Zhou, and Lim (1999). The problem with such approximations is that convergence guarantees no longer seem possible. While Moore et al. did not illustrate their method with numerical examples, in our work we have found that the resulting algorithm is not always stable. These difficulties convinced us to abandon the earlier idea in favor of the methodology presented here. Nevertheless, approximations that take adaptive filtering into account may yield better control laws under certain conditions and deserve further investigation. Note, however, that the resulting control laws will have to be used in conjunction with an adaptive filter, which is much less efficient in terms of online computation. Acknowledgments Thanks to Weiwei Li for comments on the manuscript. This work was supported by NIH grant R01-NS045915. References Anderson, F., & Pandy, M. (2001). Dynamic optimization of human walking. J Biomech. Eng, 123(5), 381–390. Beghi, A., & D’Alessandro, D. (1998). Discrete-time optimal control with controldependent noise and generalized Riccati difference equations. Automatica, 34, 1031–1034. Bensoussan, A. (1992). Stochastic control of partially observable systems. Cambridge: Cambridge University Press. Bertsekas, D., & Tsitsiklis, J. (1997). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Methods for Optimal Sensorimotor Control
1107
Burbeck, C., & Yap, Y. (1990). Two mechanisms for localization? Evidence for separation-dependent and separation-independent processing of position information. Vision Research, 30(5), 739–750. Chow, C., & Jacobson, D. (1971). Studies of human locomotion via optimal programming. Math Biosciences, 10, 239–306. Davis, M., & Vinter, R. (1985). Stochastic modelling and control. London: Chapman and Hall. El Ghaoui, L. (1995). State-feedback control of systems of multiplicative noise via linear matrix inequalities. Systems and Control Letters, 24, 223–228. Flash, T., & Hogan, N. (1985). The coordination of arm movements: An experimentally confirmed mathematical model. Journal of Neuroscience, 5(7), 1688– 1703. Harris, C., & Wolpert, D. (1998). Signal-dependent noise determines motor planning. Nature, 394, 780–784. Hatze, H., & Buys, J. (1977). Energy-optimal controls in the mammalian neuromuscular system. Biol. Cybern., 27(1), 9–20. Hoff, B. (1992). A computational description of the organization of human reaching and prehension. Unpublished doctoral dissertation, University of Southern California. Jacobson, D., & Mayne, D. (1970). Differential dynamic programming. New York: Elsevier. Jones, K., Hamilton, A., & Wolpert, D. (2002). Sources of signal-dependent noise during isometric force production. Journal of Neurophysiology, 88, 1533–1544. Kleinman, D. (1969). Optimal stationary control of linear systems with controldependent noise. IEEE Transactions on Automatic Control, AC-14(6), 673–677. Kording, K., & Wolpert, D. (2004). The loss function of sensorimotor learning. Proceedings of the National Academy of Sciences, 101, 9839–9842. Kuo, A. (1995). An optimal control model for analyzing human postural balance. IEEE Transactions on Biomedical Engineering, 42, 87–101. Kushner, H., & Dupuis, P. (2001). Numerical methods for stochastic optimal control problems in continuous time (2nd ed.). New York: Springer. Li, W., & Todorov, E. (2004). Iterative linear-quadratic regulator design for nonlinear biological movement systems. In First International Conference on Informatics in Control, Automation and Robotics, vol. 1, 222–229. N.P.: INSTICC Press. Loeb, G., Levine, W., & He, J. (1990). Understanding sensorimotor feedback through optimal control. Cold Spring Harb. Symp. Quant. Biol., 55, 791–803. McLane, P. (1971). Optimal stochastic control of linear systems with state- and control-dependent disturbances. IEEE Transactions on Automatic Control, AC-16(6), 793–798. Meyer, D., Abrams, R., Kornblum, S., Wright, C., & Smith, J. (1988). Optimality in human motor performance: Ideal control of rapid aimed movements. Psychological Review, 95, 340–370. Moore, J., Zhou, X., & Lim, A. (1999). Discrete time LQG controls with control dependent noise. Systems and Control Letters, 36, 199–206. Nelson, W. (1983). Physical principles for economies of skilled movements. Biological Cybernetics, 46, 135–147. Pakshin, P. (1978). State estimation and control synthesis for discrete linear systems with additive and multiplicative noise. Avtomatika i Telemetrika, 4, 75–85.
1108
E. Todorov
Phillis, Y. (1985). Controller design of systems with multiplicative noise. IEEE Transactions on Automatic Control, AC-30(10), 1017–1019. Rami, M., Chen, X., & Moore, J. (2001). Solvability and asymptotic behavior of generalized Riccati equations arising in indefinite stochastic LQ problems. IEEE Transactions on Automatic Control, 46(3), 428–440. Schmidt, R., Zelaznik, H., Hawkins, B., Frank, J., & Quinn, J. (1979). Motor-output variability: A theory for the accuracy of rapid motor acts. Psychol Rev., 86(5), 415–451. Sutton, G., & Sykes, K. (1967). The variation of hand tremor with force in healthy subjects. Journal of Physiology, 191(3), 699–711. Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Todorov, E. (1998). Studies of goal-directed movements. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Todorov, E. (2002). Cosine tuning minimizes motor errors. Neural Computation, 14(6), 1233–1260. Todorov, E. (2004). Optimality principles in sensorimotor control. Nature Neuroscience, 7(9), 907–915. Todorov, E., & Jordan, M. (2002a). A minimal intervention principle for coordinated movement. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 27–34). Cambridge, MA: MIT Press. Todorov, E., & Jordan, M. (2002b). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5(11), 1226–1235. Todorov, E., & Li, W. (2004). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems. Manuscript submitted for publication. Uno, Y., Kawato, M., & Suzuki, R. (1989). Formation and control of optimal trajectory in human multijoint arm movement: Minimum torque-change model. Biological Cybernetics, 61, 89–101. Whitaker, D., & Latham, K. (1997). Disentangling the role of spatial scale, separation and eccentricity in Weber’s law for position. Vision Research, 37(5), 515–524. Whittle, P. (1990). Risk-sensitive optimal control. New York: Wiley. Willems, J. L., & Willems, J. C. (1976). Feedback stabilizability for stochastic systems with state and control dependent noise. Automatica, 1976, 277–283. Winter, D. (1990). Biomechanics and motor control of human movement. New York: Wiley.
Received June 21, 2002; accepted October 1, 2004.
LETTER
Communicated by Jochen J. Steil
Universal Approximation Capability of Cascade Correlation for Structures Barbara Hammer
[email protected] Institute of Computer Science, Clausthal University of Technology, 38678 Clausthal-Zellerfeld, Germany
Alessio Micheli
[email protected] Dipartimento di Informatica, Universit`a di Pisa, Pisa, Italy
Alessandro Sperduti
[email protected] Dipartimento di Matematica Pura ed Applicata, Universit`a di Padova, Padova, Italy
Cascade correlation (CC) constitutes a training method for neural networks that determines the weights as well as the neural architecture during training. Various extensions of CC to structured data have been proposed: recurrent cascade correlation (RCC) for sequences, recursive cascade correlation (RecCC) for tree structures with limited fan-out, and contextual recursive cascade correlation (CRecCC) for rooted directed positional acyclic graphs (DPAGs) with limited fan-in and fan-out. We show that these models possess the universal approximation property in the following sense: given a probability measure P on the input set, every measurable function from sequences into a real vector space can be approximated by a sigmoidal RCC up to any desired degree of accuracy up to inputs of arbitrary small probability. Every measurable function from tree structures with limited fan-out into a real vector space can be approximated by a sigmoidal RecCC with multiplicative neurons up to any desired degree of accuracy up to inputs of arbitrary small probability. For sigmoidal CRecCC networks with multiplicative neurons, we show the universal approximation capability for functions on an important subset of all DPAGs with limited fan-in and fan-out for which a specific linear representation yields unique codes. We give one sufficient structural condition for the latter property, which can easily be tested: the enumeration of ingoing and outgoing edges should be compatible. This property can be fulfilled for every DPAG with fan-in and fan-out two via reenumeration of children and parents, and for larger fan-in and fan-out via an expansion of the fan-in and fan-out and reenumeration of children and parents. In addition, the result can be generalized to the case of input-output Neural Computation 17, 1109–1159 (2005)
© 2005 Massachusetts Institute of Technology
1110
B. Hammer, A. Micheli, and A. Sperduti
isomorphic transductions of structures. Thus, CRecCC networks constitute the first neural models for which the universal approximation capability of functions involving fairly general acyclic graph structures is proved.
1 Introduction Pattern recognition methods such as neural networks or support vector machines constitute powerful machine learning tools in a widespread area of applications. Usually they process data in the form of real vectors of fixed and finite dimensionality. Canonical extensions of these basic models to sequential data that occur in language processing, bioinformatics, or time-series prediction, for example, are provided by recurrent networks or statistical counterparts (Bengio & Frasconi, 1996; Kremer, 2001; Sun, 2001). Recently, an increasing interest in the possibility of dealing with more complex nonstandard data has been observed, and several models have been proposed within this topic: specifically designed kernels such as string and tree kernels or kernels derived from statistical models constitute a canonical interface for kernel methods for structures (Haussler, 1999; Jaakkola, Diekhans, & Haussler, 2000; Leslie, Eskin, & Noble, 2002; Lodhi, ShaweTaylor, Christianini, & Watkins, 2000; Watkins, 1999). Alternatively, recurrent neural models can be extended to complex dependencies that occur in tree structures, graphs, or spatial data (Baldi, Brunak, Frasconi, Pollastri, Soda, 1999; Frasconi, Gori, & Sperduti, 1998; Goller & Kuchler, ¨ 1996; Hammer, 2000; Micheli, Sona, & Sperduti, 2000; Pollastri, Baldi, Vullo, & Frasconi, 2002; Sperduti, Majidi, & Starita, 1996; Sperduti & Starita, 1997; Wakuya & Zurada, 2001). These structure processing models have the advantage that complex data such as structured data, tree structures, and graphs can be directly tackled, and thus a possibly time-consuming and usually not information-preserving encoding of the structures by a finite number of real-valued features can be avoided. Successful applications of these models have been reported in such areas as image and document processing, logic, natural language processing, chemistry, DNA processing, homology detection, and protein structure prediction (Baldi et al., 1999; Diligenti, Frasconi, & Gori, 2003; Goller, 1997; Jaakkola et al., 2000; Leslie et al., 2002; Lodhi et al., 2000; de Mauro, Diligenti, Gori, & Maggini, 2003; Pollastri et al., 2002; Sturt, Costa, Lombardo, & Frasconi, 2003; Vullo & Frasconi, 2003). Here, we are interested in a variant of so-called recursive networks, which constitute a straightforward generalization of well-known recurrent networks to tree structures and for which excellent results for large learning tasks have been reported (Frasconi, 2002). Training recurrent neural networks faces severe problems, such as the problem of long-term dependencies, and the design of efficient training algorithms is still a challenging problem of ongoing research (Bengio, Simard,
Universal Approximation Capability
1111
& Frasconi, 1994; Hammer & Steil, 2002). Since the number of potentially different structures increases exponentially with the size of the data, the space is often only sparsely covered by a given training set. In addition, the generalization ability might be fundamentally worse compared to simple pattern recognition tools because the capacity of the models (measured, for example, in terms of the VC dimension) depends on the structures (Hammer, 2001). For recursive models for structures, the long-term dependencies problem is often reduced because structural representations via trees or graphs yield more compact representation of data. However, the other problems remain. Thus, the design of efficient training algorithms that yield sparse models with good generalization ability is a key issue for recursive models for structures. Cascade correlation (CC) has been proposed by Fahlmann and Lebiere (1990) as a particularly efficient training algorithm for feedforward networks that simultaneously determines the architecture of the network and the parameters. Hidden neurons are created and trained consecutively, so as to maximize the correlation of the hidden neuron with the current residual error. Thus, hidden neurons can be used for error correction in consecutive steps, which often leads to very sparse solutions with good generalization ability. CC proved to be particularly appropriate for training problems where standard feedforward network training is difficult, such as the two spirals problem (Fahlmann & Lebiere, 1990). The main difference of CC networks with respect to feedforward networks lies in the particularly efficient constructive training method. Since any feedforward network structure can be embedded into the structure of a CC network and vice versa, the principled representational capabilities of CC networks and standard feedforward networks coincide (Hornik, Stinchcombe, & White, 1989). Like other standard feedforward networks, CC deals only with vectors of fixed dimensionality. Recurrent cascade correlation (RCC) has been proposed as a generalization of CC to sequences of unlimited length (Fahlmann, 1991). Recently, more powerful models for tree structures and acyclic directed graphs have been proposed and successfully applied as prediction models for chemical structures: recursive cascade correlation and contextual recursive cascade correlation, respectively (Bianucci, Micheli, Sperduti, & Starita, 2000; Micheli, 2003; Micheli, Sona, & Sperduti, 2002, in press; Micheli et al., 2000; Sperduti et al., 1996). The efficient training algorithm of CC can be transferred to these latter models, and very good results have been reported. Here, we are interested in the in-principle representational capabilities of these models for different types of structures. RCC extends simple CC by recurrent connections such that input sequences can be processed step by step. The iterative training scheme of hidden neurons is thereby preserved and training is very efficient. However, because of this iterative training scheme, hidden neurons have to be independent of all hidden neurons that are introduced later. This causes the fact that an RCC network, unlike RNNs, is not fully recurrent. Recurrent
1112
B. Hammer, A. Micheli, and A. Sperduti
connections of a hidden neuron are pointing toward the neuron itself and toward hidden neurons that are introduced later, but not toward previous hidden neurons. Because of this fact, RNNs in general cannot be embedded in RCC networks: the two models differ with respect to the network architectures. RNNs constitute universal approximators (Funahashi & Nakamura, 1993), and the question now occurs whether the restricted topology of RCC networks, which makes the particularly efficient iterative training scheme possible, preserves this property. This is not clear because local recurrence instead of full connectivity of neurons might severely limit the representational capabilities. Frasconi and Gori (1996) have shown that the number of functions that can be implemented by a specific locally recurrent network architecture with the Heaviside activation function can be limited regardless of the number of neurons, and thus restrictions apply. The work presented in Giles et al. (1995) explicitly investigates RCC architectures and shows that these networks equipped with the Heaviside function or a monotonically increasing activation function cannot implement all finite automata. Thus, in the long-term limit, RCC networks are strictly weaker than fully connected recurrent networks. However, the universal approximation capability of recurrent networks usually refers to the possibility of approximating every given function on a finite time horizon. This question, which has not yet been answered in the literature, is investigated in this article. We will show that RCC networks are universal approximators with respect to a finite time horizon; they can approximate every function up to inputs of arbitrary small probability arbitrarily well. Hence, RCC networks and recurrent networks do not differ with respect to this notion of approximation capability. Note that RCC networks, unlike RNN networks, are built and trained iteratively, adding one hidden neuron at a time. Thus, the architecture is determined automatically, and only small parts of the network are adapted during a training cycle. RCC networks have been generalized to recursive cascade correlation (RecCC) networks for tree-structured inputs (Bianucci et al., 2000; Sperduti et al., 1996; Micheli, 2003). Recursive networks constitute the analogous extension of recurrent networks to tree structures (Frasconi et al., 1998; Goller & Kuchler, ¨ 1996; Sperduti & Starita, 1997). For the latter, the universal approximation capability has been established in Hammer (2000). However, RecCC networks share the particularly efficient iterative training scheme of RCC and CC networks, and thus recurrence is restricted to recursive connections of hidden neurons to neurons introduced later, but not in the opposite direction also in RecCC networks. Thus, it is not surprising that the same restrictions as for RCC networks apply, and some functions, which can be represented by fully recursive neural networks, cannot be represented by RecCC networks in the long-term limit because of these restrictions (Sperduti, 1997). However, the practically relevant representational capabilities if restricted to a finite horizon, that is, input trees with restricted maximum depth, are not yet answered in the literature. We show
Universal Approximation Capability
1113
in this article that RecCC networks with multiplicative neurons possess the universal approximation property for finite time horizon and thus constitute valuable and fast alternatives to fully connected recursive networks for practical applications. Recurrent and recursive models put a causality assumption on data processing: sequences are processed from one end of the sequences to the other, and trees are processed from the leaves to the root. Thus, the state variables associated with a sequence entry depend on the left context, but not on the right one, and state variables associated with internal nodes in a tree depend on their children, but not the other way around. This causality assumption might not be justified by the data. For time series, the causality assumption implies that events in the future depend on events in the past, but not vice versa, which is reasonable. For spatial data such as sentences or DNA strings, the same causality assumption implies that the value at an interior position of the sequence depends on only the left part of the sequence but not on the right one. This is obviously not true in general. A similar argument can be stated for tree structures and more general graphs such as they occur in logic or chemistry, for example. Therefore, extensions of recursive models for sequences and tree structures have been proposed that also take additional contextual information into account. The model proposed in Baldi et al. (1999) trains two RNNs simultaneously to predict the secondary structure of proteins. The networks are thereby transforming the given sequence in reverse directions, such that the full information is available at each position of the sequence. A similar approach is proposed in Pollastri et al. (2002) for grids, whereby four recursive networks are trained simultaneously in order to integrate all available information for each vertex of a regular two-dimensional lattice. Similar problems for sequences have been tackled by a cascade correlation approach including context in Micheli et al. (2000). Another model with similar ideas has been proposed in Wakuya and Zurada (2001). Note that several of these models simply construct disjoint recursive networks according to different causality assumptions and afterward integrate the full information in a feedforward network. However, all reported experiments show that the accuracy of the models can be improved significantly by this integration of context if spatial data or graph structures are dealt with. The restricted recurrence of cascade correlation offers a particularly simple, elegant, and, as we will see, fundamentally different way to include contextual information: the hidden neurons are consecutively trained and, once they are trained, frozen. Thus, we can introduce recursive as well as contextual connections of these neurons to neurons introduced later without getting cyclic (i.e., ill-defined) dependencies; that is, given a vertex in a tree or graph, the hidden neuron activation for this vertex can have direct access to the state of its children and its parents for each previous hidden neuron. Thereby, the effective iterative training scheme of CC is preserved. Obviously, this possibility of context integration essentially depends on the
1114
B. Hammer, A. Micheli, and A. Sperduti
recurrences being restricted, and it cannot be transferred to fully recursive networks. This generalization of RecCC networks toward context integration has recently been proposed in Micheli et al. (2002, in press). These models are applied to prediction tasks for chemical structures, and they compare favorably to models without context integration. Plots of the principal components of the hidden neurons’ states of the contextual models indicate that the additional contextual information is effectively used by the network. Unlike the approaches noted, contextual information is here directly integrated into the recursive part of the network such that the information can be used in an earlier stage than in Baldi et al. (1999) and Pollastri et al. (2002). Thanks to these characteristics, in the cascade approach, the processing of contextual information is efficiently extended, with respect to the previous approaches, from sequence or grids to more complex structured data such as trees and acyclic graphs. The question now occurs whether the in principle extension to contextual processing and the improved accuracy in experiments can be accompanied by more general theoretical results on the universal approximation capability of the model. A first study of the enlarged capacity of the model can be found in Micheli et al. (in press) and Micheli (2003), where it is shown that in a directed acyclic graph, contextual RecCC (CRecCC) takes into account the context of each vertex, including both successors and predecessors of the vertex (as formally proven in Micheli, Sona, & Sperduti, 2003). Moreover, the integration of context extends the class of functions that can be approximated in the sense that more general structures can be distinguished: contextual models can differentiate between vertices with different parents but the same content and children. Finally, contextual RecCC can implement classes of contextual functions where a desired output for a given vertex may depend on the predecessors of the vertex, up to the whole structure (contextual transductions) (Micheli et al., in press; Micheli, 2003). Of course, contextual RecCC networks can still implement all the functions that are computable by RCC models (Micheli et al., in press; Micheli, 2003). We will show in this article that the proposed notion of context yields universal approximators for rooted directed acyclic graphs with appropriate positioning of children and parents. The argumentation can be transferred to the more general case of input-output (IO) isomorphic transduction of such structures. Hence, context integration enlarges the class of functions that can be approximated to acyclic graph structures. We first formally introduce the various network models and clarify the notation. We then consider the universal approximation capability of cascade correlation networks in detail. We start with the simplest case: the universal approximation property of standard RCC networks. Only two parts of the proof are specific for the case of sequences; the other parts can be transferred to more general cases. We then prove the universal approximation capability of RecCCs, substituting the specific parts of the previous proof. Finally we investigate CRecCCs, which are particularly interesting,
Universal Approximation Capability
1115
since CRecCCs do not constitute universal approximators for all structures, but only a subset. We provide one example for graph structures that cannot be differentiated by these models. We then define a linear encoding of acyclic graph structures and discuss one sufficient property that can easily be tested and ensures that this encoding is unique. This property requires that positioning of children and parents can be done in a compatible way. We show that this can always be achieved by reenumeration and expansion by empty vertices for directed positional acyclic graphs with limited fan-in and fan-out. Structures that fulfill this property are referred to as directed acyclic bipositional graphs (DBAGs). Afterward, we show the universal approximation for contextual RecCC networks on DBAGs. This result can immediately be extended to general graph structures for which the proposed encoding is unique. An informal discussion of this property is given. We conclude with a discussion. 2 Cascade Correlation Models for Structures Standard feedforward neural networks (FNN) consist of neurons connected in an acyclic directed graph. Neuron i computes the function xi = σi wi j x j + θi j→i
of the outputs x j of its predecessors, where wi j ∈ R are the weights, θi ∈ R is the bias, and σi : R → R constitutes the activation function of neuron i. The input neurons (i.e., neurons without predecessors) are directly set to external values, and the output of the network can be found at specified output neurons. All other neurons are called hidden neurons. Often neurons are arranged in layers with full connectivity of the neurons of one layer to the consecutive one. Activation functions σi that are of interest in this article include the logistic function, 1 , 1 + exp(−x) the Heaviside function, sgd(x) =
H(x) =
1 if x ≥ 0 , 0 otherwise
and the identity id(x) = x. More generally, a squashing activation function f : R → R is a monotonic function such that values l f < h f exist with limx→∞ f (x) = h f , limx→−∞ f (x) = l f . A network that uses activation functions from a set F of functions is referred to as an F-FNN. It is well known that FNNs with only one hidden layer with a squashing activation function and linear outputs fulfill a universal approximation property (Hornik et al.,
1116
B. Hammer, A. Micheli, and A. Sperduti
1989). The universal approximation property holds for FNNs, if for a given probability measure P on Rn , any Borel-measurable f : Rn → Rm , and any , δ > 0 we can find an FNN that computes a function g such that P(x ∈ Rn | | f (x) − g(x)| > δ) < . There exist alternative characterizations of the hidden neurons activation function for which the universal approximation property holds, for example, being analytic (Scarselli & Tsoi, 1998). Often, more powerful neurons computing products of inputs are considered (Alquezar & Sanfeliu, 1995; Forcada & Carrasco, 1995; Hornik et al., 1989). Let us denote the indices of predecessors of neuron i by i 1 , . . . , i j . A multiplicative neuron of degree d ≥ 1 computes a function of the form xi = σi
d
wi j1 ... jk x j1 · . . . · x jk + θi ,
k=1 j1 ,..., jk ∈{i 1 ,...,i j }
where wi j1 ,..., jk ∈ R refers to the weights and θi is the bias as above. We will explicitly indicate if multiplicative neurons are used. Otherwise, the term neuron refers to standard neurons. Standard neural network training first chooses the network architecture and then fits the parameters on the given training set. In contrast, cascade correlation (CC) determines the architecture and the parameters simultaneously. CC starts with a minimum architecture without hidden neurons and iteratively adds hidden neurons unless the approximation accuracy is satisfiable. Each hidden neuron is connected to all inputs and all previous hidden neurons. The weights associated with these connections are trained in such a way as to maximize the correlation between their output and the residual error of the network constructed so far. Afterward, the weights entering the hidden neuron are frozen. All weights to the outputs are retrained according to the given task. If needed, new hidden neurons are iteratively added in the same way. One CC network with two hidden neurons is depicted in Figure 1. Since only one neuron is trained at each stage, this algorithm is very fast. Moreover, it usually provides excellent classification results with a small number of hidden neurons, which, because of the specific training method (Fahlmann & Lebiere, 1990), serve as error-correcting units. This training scheme is used for all further cascade correlation models introduced in this article. Since we are not interested here in the specific training method but in the in-principle representation and approximation capabilities of the architectures, we will not explain the training procedure in more detail. The only point of interest in this article is that the iterative training method assumes that hidden neurons are functionally dependent on previous hidden neurons, while they are functionally independent of hidden neurons introduced later, because of the iterative training scheme.
Universal Approximation Capability
1117
Figure 1: (Left) An example of a cascade correlation network with cascaded
hidden neurons xi added iteratively to the network. (Right) A recursive cascade correlation network where recursive connections, indicated by thick lines marked with r , are used to store contextual information from the first part of a processed sequence. The activation of the neuron of the previous time step is propagated through recurrent connections. Note that recurrent connections point from x1 to x2 but not vice versa.
2.1 Recurrent Models. Obviously, FNNs can deal only with input vectors of finite and fixed size. Recurrent networks (RNNs) constitute one alternative approach, which allows the processing of sequences of unlimited length. Denote by L a set where the labels are taken from, for example, L = Rn for continuous labels or L = {1, . . . , r } for discrete labels. Assume a sequence [l1 , . . . , l T ] of length T over L is given. Denote the set of finite sequences over L by L∗ . A recurrent network recursively processes a given sequence from left to right. Assume that the output of neuron i has to be computed for a given sequence entry lt at position t of the sequence. Assume N neurons are present. We assume for simplicity that the immediate predecessors of neuron i are the neurons from 1 to i − 1.1 The functional dependence of neuron i can then be written as xi (lt ) = τi (lt , x1 (lt ), . . . , xi−1 (lt ), x1 (lt−1 ), . . . , xN (lt−1 )),
1
(2.1)
It can always be achieved via reenumeration that the predecessors are a subset of this set of neurons. Thus, setting some weights to 0, we can represent every FNN in this form.
1118
B. Hammer, A. Micheli, and A. Sperduti
where τi is a function computed by a single neuron,2 and for the initial case t = 1, we set xi (l0 ), interpreted as the output for the empty sequence, to some specified value, for example 0. The computed values depend on the values of all neurons in the previous time step and thus, indirectly, on the left part of the sequence. The final output of the sequence is given by the values xi (l T ) obtained after processing the last entry. Recurrent cascade correlation (RCC) also refers to the activation of neurons in the previous time step. We assume that the neurons are enumerated in the order in which they are created during training—neuron i is added to the RCC network in the ith construction step. Because of the iterative training scheme, hidden neuron i is functionally dependent only on neurons introduced in previous steps. The functional dependence now reads as xi (lt ) = τi (lt , x1 (lt ), . . . , xi−1 (lt ), x1 (lt−1 ), . . . , xi (lt−1 )), where τi is the function computed by a single neuron. Again, xi (l0 ) is set to some specified initial value. Note that unlike RNNs, RCC networks are not fully connected. Recurrent connections that start at neurons j > i to neuron i are dropped, because all weights to neuron i are, once trained, frozen. One example network with two hidden neurons is depicted in Figure 1. It is well known that fully recurrent neural networks constitute universal approximators in the sense that, given appropriate activation functions of the neurons, any measurable function from sequences into a real vector space can be approximated arbitrarily well, that is, the distance of the desired output from the network output is at most δ for inputs of probability at least 1 − (Hammer, 2000). We will show that an analogous result also holds for RCC networks with restricted recurrence. 2.2 Recursive Models. Both RNNs and RCC networks have been generalized to tree structures in the literature. A tree over L consists of either an empty structure ξt or a vertex v with label l(v) ∈ L and a finite number of subtrees, some of which may be empty. We refer to the ith child of v by chi (v). A tree has limited fan-out k if each vertex has at most k children or subtrees. In this case, we can always assume that exactly k children are present for any nonempty vertex, introducing empty vertices if necessary. The set of finite trees with fan-out k over L is denoted by L∗k . For simplicity, we deal only with fan-out 2 in the following. The generalization to larger 2 Here we assume only dependencies from the previous time step. Dependencies from time steps t − i, i > 1 could be considered and added to equation 2.1 since the definition would still be well defined. Because more information is then available at one time step, benefits for concrete learning problems could result. Obviously, the universal approximation capability of equation 2.1 immediately transfers to the more general definition. Thus, for simplicity, we restrict ourselves to equation 2.1 in the following, and we restrict in a similar way for the case of input trees or structures.
Universal Approximation Capability
1119
fan-out will be immediate. Fan-out 1 gives sequences. Since trees form a recursive structure, recursive processing of each vertex in the tree starting from the leaves up to the root is possible. This leads to the notion of recursive networks (RecNNs) as introduced, for example, in Frasconi et al. (1998), Goller and Kuchler ¨ (1996), and Sperduti and Starita (1997) and recursive cascade correlation (RecCC) as proposed in Bianucci et al. (2000) and Sperduti et al. (1996). As before, for RecNN, we assume direct dependence of neuron i from neurons 1 by interpreting the output of the last m neurons xmax −m , . . . , xmax as the output of the network. Lemma 1. Assume f : Ri → Rn and g : Rm → R are computed by FNNs. Assume h : (Rn )∗ → Rm is computed by an RCC network. Then the composition g ◦ h ◦ f D : (Ri )∗ → R can also be computed by a RCC network. Proof. This statement is easy because we can interpret all neurons of f and g as additional hidden neurons of the RCC network, setting connections that are not used in these subnetworks but are introduced due to the definition of RCC architectures to 0. More precisely, denote the neurons in f by n1 , . . . , ni f , whereby connections ni → n j exist only if i < j. Denote the neurons in g by m1 , . . . , mi g , again with connections existing only to neurons with larger indices. Denote the neurons of the RCC network by o 1 , . . . , o ih arranged in the order of their appearance. Then we can choose the weights of an RCC network such that hidden neuron number i takes the role of if i ≤ i f ni if i f < i ≤ i f + i h xi = o i−i f mi−i f −ih otherwise. This is possible because the functional dependence of the RCC hidden neurons ensures direct access to the respective predecessors in the single networks: each hidden neuron has access to the activation of all previous neurons. All recursive connections in the RCC network for ni and mi are set to 0. The same holds for all additional feedforward as well as recursive connections not present in the original part ni , o i , and mi . Note that recurrence in the first part corresponding to f has the effect that f D is computed on input sequences. Recurrence in g has no effect because the output of the last sequence entry is considered output of the entire computation. Note that an analogous result holds for RecCC networks and CRecCC networks. It is not necessary to change the proof if h and g ◦ h ◦ f D are computed by RecCC networks or CRecCC networks. We obtain as an immediate result for the sequence case: Theorem 1. Assume L = {1, . . . , r } is a finite set. Assume P is some probability measure on finite sequences over L. Assume f : L∗ → R is a function. Assume > 0. Then there exists an RCC network with activation functions in {id, σ } that computes a function g such that P(x ∈ L∗ | f (x) = g(x)) < . σ is thereby some activation function for which the universal approximation property of FNNs is fulfilled.
1126
B. Hammer, A. Micheli, and A. Sperduti
Proof. L∗ decomposes into sets of sequences of different length. We denote these sets by (L∗ )i for i = 1, . . . , ∞ and an arbitrary enumera∞ 3 ∗ ∗ ∗ tion. Since (L ) is measurable, 1 = P(L ) = i i=1 P((L )i ); thus, we find ∞ ∗ P((L ) ) < /4 for some n . Therefore, we can find some length h (the i 0 i=n0 maximum length in the sets (L∗ )i for i < n0 ) such that larger sequences have probability smaller than . For sequences of length at most h over L, we can find an RCC network with the identity as an activation function unit that maps these inputs to unique values, as shown above. Because FNNs with activation function σ and linear outputs possess the universal approximation property, we can find an FNN that maps these encodings exactly to the outputs given by f (Sontag, 1992). The composition of these two networks can be interpreted as an RCC network because of lemma 1 with the desired properties. Note that we have restricted ourselves to outputs in R for convenience. It is obvious from the above construction that the same result can be obtained for functions f : L∗ → Rm if we integrate several output neurons for RCC networks. The same remark holds for all universal approximation results that we present here, thus giving us approximation results for output set Rm instead of R. We can extend the above result to sequences with real labels, introducing a first part that maps real-valued labels to a discrete set. Theorem 2. Assume L = Rn . Assume P is some probability measure on finite sequences over L. Assume f : L∗ → R is a Borel-measurable function. Assume > 0, δ > 0. Then there exists an RCC network with activation functions in {id, σ, H} that computes a function g such that P(x ∈ L∗ | | f (x) − g(x)| > δ) < . σ is thereby some activation function for which the universal approximation property of FNNs is fulfilled. Proof. We can fix some length h such that longer sequences have probability at most /4. We can find a constant C > 0 such that sequences with one entry outside [−C, C)n have probability smaller than /4.4 For sequences of length at most h and labels in [−C, C)n , we can approximate f by a continuous function such that the probability of values with distance larger than δ/2 3 We could simply choose (L∗ ) as the set of sequences of length i. However, we would i like to achieve compatibility with the more general case of tree structures or DPAGs, such that we use an arbitrary enumeration of the subsets with fixed structure. 4 This is the same argument as above because the set of sequences decomposes into subsets of sequences with entries in a fixed interval with boundaries given by natural numbers.
Universal Approximation Capability
1127
is at most /4.5 Hence, substituting by /4 and δ by δ/2, it is sufficient to prove the stated theorem for continuous functions f on sequences of limited length at most h with limited labels in [−C, C)n . Note that such function f is uniformly continuous, being a function on a compact interval. Hence, we can find 0 > 0 with the following property6 : given any two sequences D and D of the same structure (i.e., the same length). Assume lv is an entry of D and lv is an entry of D at the same position. Assume that for every such pair of entries, the inequality |(lv )i − (lv )i | < 0 holds for all positions i = 1, . . . , n, (lv )i and (lv )i , respectively, denoting component number i of the respective entry. Then | f (D) − f (D )| < δ. Next we choose points I0 < I1 < . . . , I p in [−C, C] such that I0 = −C, I p = C, and |I j − I j−1 | < 0 for all j = 1, . . . , p. Note that varying the components of the entries of a sequence within some interval [I j−1 , I j ) does not change the output with respect to f by more than δ. Therefore, we can find a function f 1 that differs from f by at most δ for all inputs and maps all sequences of the same structure where the components of the sequence elements are contained in the same interval [I j−1 , I j ) to a constant value. We will now show that this function f 1 can be computed by an RCC network, which approximates f as stated in the theorem. The characteristic function of [I j−1 , I j ) can be computed as 1[I j−1 ,I j ) (x) = H(x − I j−1 ) − H(x − I j ). The function d : [−C, C)n → {1, . . . , ( p + 1)n − 1}(x1 , . . . , xn ) →
p n
1[I j−1 ,I j ) ( p + 1)i−1
i=1 j=1
uniquely encodes the intervals in which the components of the input lie. d thereby uses an encoding to base p + 1 of the p intervals in n dimensions. Since f 1 is piecewise constant, we find a function f 2 such that f 1 = f 2 ◦ d D where, as before, d D denotes the application of d to each element in a sequence. This function f 2 maps a code to the output of some fixed sequence with entries in the encoded intervals as indicated by the input. Because of theorem 1, we can find an RCC network g 1 that computes f 2 . (Since we are dealing with only a finite length at this point, the approximation can be done for all input values.) Hence, g 1 ◦ d D equals with f 1 . Note that d can be computed by a feedforward network with activation function in {H, id}.
5 This is a well-known fact about functions over the real numbers, and it follows from, for example, the approximation results in Hornik et al. (1989). 6 We use a notation that could refer to sequences as well as trees or DPAGs for convenience. An identical proof shows an analogous result for these more general cases.
1128
B. Hammer, A. Micheli, and A. Sperduti
Because of lemma 1, g 1 ◦ d D can be computed by an RCC network with activation function in {H, id, σ }. Usually a sigmoidal activation function is used as activation function of the RCC network for all neurons but the last one, which is linear to account for the fact of as a priori unbounded output set. We are interested in the possibility of substituting the above specific activation functions by a sigmoidal function σ . Note that for any function σ : R → R that is continuously differentiable in the vicinity of some point x0 with nonvanishing derivative σ (x0 ) the following holds: σ (x0 + · x) − σ (x0 ) →x · σ (x0 )
( → 0).
This convergence is uniform on compacta. If the above property holds for σ , we refer to σ as being locally C 1 . Moreover, for a squashing function σ with limits lσ and h σ , we find (σ (x/) − lσ )/(h σ − lσ ) → H(x) ( → 0) for all x = 0. The convergence is uniform on (−∞, −δ] ∪ [δ, ∞) for any δ > 0. These two properties are in particular fulfilled for the standard logistical activation function. The following theorem allows us to substitute the above specific activation functions by an activation function as used in practice. Theorem 3. Assume L = Rn . Assume P is some probability measure on finite sequences over L. Assume f : L∗ → R is a Borel-measurable function. Assume > 0, δ > 0. Assume σ : R → R is a continuous and locally C1 squashing function. Then there exists an RCC network with activation functions σ and linear activation function for the last neuron of the network, which computes a function g such that P(x ∈ L∗ | | f (x) − g(x)| > δ) < . Proof. Note that a squashing activation function leads to the universal approximation property of FNNs (Hornik et al., 1989), such that we can construct as in theorem 2 an RCC network that contains the activation functions id and H in addition to σ . We now show that the activation functions id and H can be substituted in the network by the above approximations for appropriate , such that the difference of the outputs is small for inputs of high probability. Note that we argued in the proof of theorem 2 that we can restrict to finite length h and labels in [−C, C)n . Moreover, we can choose the interval borders I j , which have been used to discretize the inputs in such a way that the probability of sequences that contain a label with a component in (I j − δ0 , I j + δ0 ) is smaller than /2 for some δ0 . Denote p by L = ([−C, C]\ ∪ j=0 (I j − δ0 , I j + δ0 ))n . Since the involved functions are
Universal Approximation Capability
1129
continuous, the set of activations xi (lv ) for all elements lv of sequences of length at most h with labels in L is compact; say it is contained in [−C0 , C0 ]. We formally substitute each activation function id (except for the output neuron, that is, the last neuron in the network) by the term (which we simply regard as a slightly more complicated activation function, for the moment) (σ (x0 + 0 · x) − σ (x0 ))/(0 · σ (x0 )) with parameter 0 , which has to be fixed later. Each activation function H in the network is substituted by the term (σ (x/0 ) − lσ )/(h σ − lσ ), where 0 has to be chosen later. Denote by xi the original neurons and by xi0 the neurons that involve the above more complicated activation functions; thereby, xi = xi0 if the activation function is σ . We now show that some 0 can be found such that the resulting network differs from the original one by no more than δ/2 for all inputs in (L )∗ with length at most h. The following observations are easy: 1. Given some neuron xi with activation function H. Such a neuron is functionally dependent only on the label of a given entry lv in the above network by construction; hence, we can write xi (lv ) as a shorthand notation. Given some δ1 > 0. Then 0 can be found such that |xi (lv ) − xi0 (lv )| ≤ δ1 for all inputs lv ∈ L . This follows immediately because of the uniform convergence of (σ (x/0 ) − lσ )/(h σ − lσ ) to H if a neighborhood of 0 is not contained in the inputs. 2. Given some neuron xi with linear activation function. Assume lv are the inputs of xi directly referring to input entries, and assume a = (a 1 , . . . , a i0 ) are the additional inputs referring to outputs of other neurons. We write xi ( lv , a ), for convenience. Given some δ1 > 0. Then we can find 0 and δ2 such that |xi ( lv , a ) − xi0 ( lv , a )| ≤ δ1 for all a , a with a i , a i ∈ [−C0 − δ0 , C + δ0 ] with |a i − a i | ≤ δ2 and lv labels in L . This follows from the fact that xi is continuous and, hence, uniformly continuous on compacta, and (σ (x0 + · x) − σ (x0 ))/( · σ (x0 )) converges to id uniformly on compacta; hence, xi0 converges to xi uniformly on compacta. We therefore find 0 and δ2 such that |xi ( lv , a ) − xi0 ( lv , a )| ≤ |xi ( lv , a ) − xi ( lv , a )| + |xi ( lv , a ) − xi0 ( lv , a )| ≤ δ1 /2 + δ1 /2 for |a i − a i | ≤ δ2 . 3. Given some neuron xi = xi0 with activation function σ . By construction, xi is functionally dependent only on outputs of other neurons. Denote, for convenience, the direct functional dependence by xi ( a ). Given some δ1 > 0. Then we can find δ2 such that |xi ( a ) − xi ( a )| ≤ δ1 for all a , a with a i , a i ∈ [−C0 − δ0 , C + δ0 ] and |a i − a i | ≤ δ2 . This follows immediately from the fact that σ is continuous, hence uniformly continuous on compacta.
1130
B. Hammer, A. Micheli, and A. Sperduti
These properties allow us to substitute each single activation function and control the effect of this substitution. We now formally unfold the RCC for all possible sequences of length at most h, obtaining several feedforward computations composed of functions computed by single neurons. For each such feedforward computation, we start at the right-most neuron and consider the neurons in the order of their appearance in this feedforward computation from right to left (thereby visiting the original neurons more than once since recurrence accounts for duplicates). The final output should deviate from the original one by at most δ/2 on sequences with entries in L . Starting with δ1 = δ/2, we determine for each neuron the value 0 and a number δ2 > 0 such that the following holds: if the inputs that come from other neurons deviate by no more than δ2 and the activation function is substituted by the above term including 0 , the output deviates by at most δ1 . For the three types of neurons, this can be achieved due to properties 1 through 3 as explained above. We can then continue with neurons to the left, setting δ to the minimum of the found δ2 and δ0 . (The latter bound δ0 thereby ensures that values lie in the compactum [−C0 − δ0 , C0 + δ0 ] on which uniform approximation is guaranteed.) So we collect a number 0 for each copy of a neuron such that this choice guarantees appropriate approximation of the whole function. We can now choose for the entire network the minimum of all values 0 as a uniform value that simultaneously fulfills all constraints. Finally, it remains to show that the network that involves the above complicated activation functions can be interpreted as an RCC network with linear output and hidden neurons with activation function σ . Note that the above terms are both of the following form a · σ (b · x + c) + d, for real values a , b, c, d. We can simply integrate the term c into the bias of the respective neuron, b into the weights, a into the weights of all neurons that have direct connections from the neuron, and d into the bias of these neurons. Note that the last neuron of the RCC network (which does not possess a successor) is linear; hence, no substitution has to be done at this part. Note furthermore that outgoing connections include recurrent connections. (We do not have to deal with cycles because the integration of a and b is done only once for each connection of the network.) Hence the initial values for the empty sequence xi (l0 ) are affected by this change too. We account for this fact, resetting the term to xi (l0 )/a − b. Hence RCC networks constitute universal approximators for sequences. 3.2 Recursive CC. Referring to the above proof, the essential steps to show approximation completeness for RecCC networks for input tree structures are (I) the construction of a unique linear code for tree structures over
Universal Approximation Capability
1131
a finite set L and (II) the construction of a network with possibly multiplicative neurons and appropriate activation functions that can compute these codes. Then construction steps III to V can be transferred from the previous case. The linear representation of trees is slightly more difficult than the representation of sequences since they are usually represented as abstract nonlinear data types. The prefix notation of trees offers one possibility, which we use in the following. Assume L is a finite set that does not contain the symbols “(,” “),”and “,”. Assume ξ and f are additional symbols not contained in L. Define for each tree t over L with root vertex v and subtrees t1 and t2 the term-representation rep(t) recursively over the structure as follows: rep(ξt ) = ξ , rep(t) = f (l(v), rep(t1 ), rep1 (t2 )) . This representation is obviously unique, and hence step I is achieved. Step II can be done as follows: assume L = {1, . . . , r }, assume n f , nξ are numbers not contained in L, assume the maximum length of these numbers is L, and assume the numbers are extended by leading entries 0 to occupy exactly L positions. Denote by nrep(t) the number that we obtain if we substitute in rep(t) the symbols f by n f and ξ by nξ and extend the numbers from L by leading entries 0. Then nrep(t) is unique too. Note that it is not necessary to introduce symbols for “(”, “)”, and “,”, because we deal with only one function symbol. We now show that two multiplicative hidden neurons x1 and x2 of an RecCC network with the identity as activation function can be constructed that compute x1 (t) = exp0.1 length (nrep(t)) and x2 (t) = 0.nrep(t), where length denotes the length of the number including leading entries 0. We set x1 (ξt ) = 0.1 L ,
x2 (ξt ) = 0.nξ .
Assume the root vertex of a tree t is v and the two children are t1 and t2 . The functional dependence of x1 is x1 (t) = τ1 (l(v), x1 (t1 ), x1 (t2 )), where by structural induction, x1 (t1 ) and x1 (t2 ) represent the terms
1132
B. Hammer, A. Micheli, and A. Sperduti
exp0.1 (nrep(t1 )) and exp0.1 (nrep(t2 )). Hence, the choice τ1 (a 1 , a 2 , a 3 ) = 0.12L · a 2 · a 3 gives the appropriate length. The functional dependence of x2 is x2 (t) = τ2 (l(v), x1 (t), x1 (t1 ), x2 (t1 ), x1 (t2 ), x2 (t2 )), where by structural assumption, x2 (t1 ) and x2 (t2 ) constitute 0.nrep(t1 ) and 0.nrep(t2 ). Hence, the function τ2 (a 1 , . . . , a 6 ) = 0.1 L · n f + 0.12L · a 1 + 0.12L · a 3 + 0.12L · a 2 · a 5 gives 0.nrep(t). Note that these functions can be computed by multiplicative neurons of order 2 with the linear activation function. Thus, in contrast to the case of sequences, we deal with multiplicative neurons in these approximation results. Now we can proceed with steps III, IV, and V just as in the case of RCC networks and we obtain as a result that theorems 1 through 3 also hold for tree structures instead of sequences, that is, for functions f : L∗2 → R, whereby the RecCC network has multiplicative neurons. Due to the construction, we use multiplicative neurons of order 2 in this case. In practical applications, usually standard neurons instead of multiplicative neurons are used. One can directly substitute the multiplication if, instead of single neurons, a feedforward network is attached in each recursive construction step; that is, τi may be computed by a feedforward network with a fixed number of neurons. Note that an activation function for which some point x1 exists such that σ is twice continuously differentiable in a neighborhood of x1 with nonvanishing second derivative σ
(x1 ) = 0 can be used to approximate the function x 2 in the following way: σ (x1 + · x) + σ (x1 − · x) − 2 · σ (x1 ) → x2 2 · σ
(x1 )
( → 0),
with uniform convergence on compacta. We refer to this property as σ being locally C 2 . In addition, multiplication can be substituted by the square using the identity x · y = ((x + y)2 − (x − y)2 )/4. Note that the property locally C 2 implies σ being locally C 1 . Analogous to theorem 3, we can thus substitute each function computed by a multiplicative neuron τi by a feedforward network, which in this case contains only one hidden layer with eight hidden neurons in the hidden layer and activation function σ .7 7 We do not specify this remark for trees but formalize the result later for the case of graph structures.
Universal Approximation Capability
1133
RecCC networks are usually not complete if they deal with DPAGs as inputs that are first transformed to a tree structure. However, some specific settings can be observed where uniqueness of representation is ensured and hence approximation is possible such as the case of DPAGs such that no two vertices in a DPAG contain the same label. In this case, the label can also serve as an identifier of the vertex, and the original graph can easily be reconstructed from the tree. We have considered only supersource transductions so far. For tree structures, IO isomorphic transductions as a generalization of the more simple supersource transductions are interesting. This means that an input tree is mapped to an output tree of the same structure whereby the labels are substituted by values computed by the transduction. Thereby, the output label l (v) of a vertex v might depend on the vertex v and the whole tree structure. We can use RecCC networks for IO isomorphic transduction by substituting the label at each vertex by the output computed with the network for the subtree rooted at this vertex. It is an immediate consequence of the principled dynamics of RecCC networks, and our previous approximation results that exactly those (measurable) supersource transductions can be approximated, for which a strict causality assumption applies: the output label of a vertex can depend on the subtree rooted at the vertex but not on any other vertex of the tree structure. Since the approximation of functions that fulfill this causality assumption can be reduced to the problem to approximate a supersource transduction for any given subtree of the considered trees, the approximation completeness is obvious in this case. Conversely, the RecCC dynamics is based on this causality assumption with respect to functional dependencies of vertices. Thus, no functions violating the causality assumption can be approximated by an RecCC network. 3.3 Contextual RecCC. CRecCC networks integrate contextual information since hidden neurons have access to the parents’ activation of previous neurons. This possibility expands the functions that can be addressed with these networks to functions on graphs and IO isomorphic transductions, as we will see. However, the situation is more complicated for DPAGs than for tree structures because it is not clear how to represent DPAGs in a linear code based on the information available within a contextual recursive neuron. We now discuss this issue in detail and develop a code for a subset of DPAGs. The context integrated in CRecCC neurons is of crucial importance. The available context of vertex v at neuron i can, for example, be measured in terms of the vertices of the DPAG, which contribute directly or indirectly to xi (v). This context is precisely computed in Micheli et al. (2003). As demonstrated in Micheli et al. (in press) and Micheli (2003), this contextual information in principle can be used to compute certain transductions on graphs that require global knowledge of the structure and hence cannot be computed with simple recursive cascade correlation networks.
1134
B. Hammer, A. Micheli, and A. Sperduti
Figure 2: Example of two graph structures that cannot be differentiated by RecCC networks if the labels of the vertices at each layer are identical.
Hence, the functions that can be computed by RecCC networks on DPAGs form a proper subset of the functions that can be computed by CRecCC networks, for example, the two structures depicted in Figure 2 cannot be differentiated by an RecCC network, but they can be differentiated by an CRecCC network. RecCC networks constitute universal approximators for tree structures. RecCC networks and CRecCC networks do not cover different function classes if restricted to tree structures of limited height. For DPAGs, CRecCC networks are strictly more powerful. Here we go a step further and identify a subset of DPAGs that fulfill a canonical structural constraint, and we show that CRecCCs possess the universal approximation property for these structures. In addition, we construct a simple counterexample showing that there exist structures outside this class that cannot be distinguished even by CRecCC networks regardless of the chosen functions of the neurons. As before, we restrict ourselves to DPAGs where the fan-in and fan-out is restricted by two. 3.3.1 A Counterexample. The question arises whether CRecCC networks can differentiate all possible DPAG structures. The following example shows that this is not the case. However, this example requires a highly symmetrical structure where different parts of the DPAG cannot be differentiated by the structure or the labels of the vertices. Hence, the example can be regarded as an artificial example. Theorem 4. There exist graphs in DPAG that cannot be distinguished by any CRecCC network. The counterexample is shown in Figure 3. The formal proof that these two structures cannot be differentiated by a CRecCC network is in the appendix.
Universal Approximation Capability
1135
Figure 3: Example of two graphs mapped to identical values with CRecCC
networks for a noncompatible enumeration of parents. The enumeration of parents and children is indicated by boxes at the vertices; the left box is number one and the right box number two. Note that this construction would not have been possible if a different enumeration of children and parents had been chosen: consider edges (0, 4) and (0 , 4 ); 4 and 4 constitute the second child of 0 and 0 , respectively. Assume we had enumerated parents in such a way that 0 and 0 also constituted the second parent of 4 and 4 , respectively. Then the proof for the counterexample in theorem 4 would no longer hold: the vertices 1 and 4 could then be differentiated by an appropriate choice of the weights and activation functions of the neurons, and the same would hold for 1 and 4 . This can be seen due to the activation of the second hidden neuron whose functional dependence is x2 (1) = τ2 l, x1 (1), x1 (2), x2 (2), x1 (3), x2 (3), x1 (0), 01p = τ2 l, x1 (1), x1 (2), x2 (2), x1 (3), x2 (3), 01p , x1 (0) = x2 (4) . This inequality holds for appropriate τ2 because the inputs 01p and x1 (0) are given at different positions. The possibility of differentiating these two vertices leads to the possibility of differentiating all vertices i in one of the structures, and finally to the possibility of differentiating the two structures since the vertices are connected in a different way in the two structures. This example leads to a simple structural condition ensuring that different structures can be differentiated by CRecCC networks. 3.3.2 Identification of a Subclass of DPAGs. Motivated by the previous example, we now identify one subclass of DPAGs for which such an example cannot be found and which allows a linear encoding of the structures. The following property constitutes one sufficient condition to ensure this property; it is not a necessary condition. However, it can be easily tested and does often not put severe constraints on the respective application.
1136
B. Hammer, A. Micheli, and A. Sperduti
Definition 1. Assume D is a DPAG. We say that the ordering of the parents and children is compatible if for each edge (u, v) of D the equality Pv (u) = Cu (v) holds. This property states that if vertex u is enumerated as parent number i of v, then v is enumerated as child number i of u, and vice versa. We call the subset of DPAGs with compatible positional ordering rooted directed bipositional acyclic graphs (DBAGs). Note that the above counterexample used elements in DPAG\DBAG. A different enumeration as DBAG would have been possible in this case, such that the structures could be distinguished. We now investigate in which sense the requirement on the presence of a compatible enumeration puts restrictions on the considered structures. Due to the definition, DPAGs with compatible positioning necessarily have the same fan-in and fan-out. However, we can always extend a DPAG with fan-in k1 = fan-out k2 by empty children and parents, respectively, such that the fan-in and fan-out coincide. Moreover, the following theorem shows that any given DPAG with limited fan-in and fan-out two can always be transformed into a DBAG via reenumeration of children and parents. Theorem 5. Given a DPAG D with limited fan-in and fan-out two, we can find an enumeration Pv and Cv of parents and children for each vertex v such that the resulting DPAG constitutes a DBAG with respect to the enumerations Pv and Cv
for vertices v of the structure. The proof is in the appendix. It holds only for fan-in and fan-out two. For larger fan-in and fan-out, an analogous result can be stated if, in addition, the vertices might be expanded by empty children and parents. Theorem 6. Assume D is a DPAG with fan-in and fan-out limited by at most k. Then we can find a DBAG that arises from D by an expansion of the vertices of D with empty children and parents to fan-in and fan-out 2k − 1 and reenumeration of children and parents. The proof is in the appendix. We conjecture that D can also be transformed to a DBAG via reenumeration of children and parents with fan-in and fanout k and without expanding. However, we could not find a proof or a counterexample for this conjecture. In applications such as chemistry, positioning of children and parents is often done only by convention, without referring to semantic implications such as in the case of directed acyclic graphs (without prior positioning). Hence, restricting to DBAGs does not limit the applicability of CRecCC in these cases. In alternative scenarios such as logic, positioning might be crucial, and it might not be possible to change the positioning
Universal Approximation Capability
1137
without affecting the semantic. A simple example for this fact is the term implies(loves(Bill,Mary),smiles at(Bill,Mary)), or, as a whole sentence, “If Bill loves Mary then Bill smiles at Mary”. This cannot be transferred into a DBAG unless “Bill” becomes the first child of the first symbol “loves”and the second child of the second symbol “smiles”, or vice versa. That is, we would obtain, “If Bill loves Mary then Mary smiles at Bill” or “If Mary loves Bill then Bill smiles at Mary”, both nice sentences but with a different meaning from the original one. We will see that alternative guarantees for the universal approximation ability can be found and a large number of structures in DPAG\DBAG can be appropriately processed as well. This alternative characterization is not as straightforward as the restriction to DBAGs, which constitutes a purely structural property and can easily be tested. 3.3.3 Encoding of DBAGs and Certain DPAGs. Next we introduce a recursive mapping of vertices in a DPAG to terms that describe information contained in the DPAG. The purpose of this transformation is to obtain linear representations of DPAGs. It will turn out that these linear representations characterize DBAGs up to a certain height uniquely. Naturally, the representation also characterizes a large number of DPAGs uniquely, and approximation completeness will hold for these DPAGs too. However, whether this holds for any given set of DPAGs needs to be tested for each application. For DBAGs, it is always valid because of the structural restrictions. Assume L is a finite set that does not contain the symbols ”(”, ”)”, and ”,”. Assume c , p , and f are additional symbols not contained in L. Define for each i ≥ 0 and vertex v in a given DPAG the term-representation repi (v) recursively over i and the structure as follows: repi (ξc ) = c , repi (ξ p ) = p , rep1 (v) = f (l(v), rep1 (ch1 (v)), rep1 (ch2 (v))), repi (v) = f (l(v), repi (ch1 (v)), repi (ch2 (v)), repi−1 (pa1 (v)), repi−1 (pa2 (v))), for i > 1. This representation contains all information available for a CRecCC network, since the following result holds: Theorem 7. Assume D and D are DPAGs that can be differentiated by hidden neuron h of some CRecCC network. Then reph (root(D)) and reph (root(D )) are different. Proof. The functional dependence of CRecCC neurons has the form xi (v) = τi (l(v), x1 (v), . . . , xi−1 (v), x1 (ch1 (v)), . . . , xi (ch1 (v)), x1 (ch2 (v)), . . . , xi (ch2 (v)), x1 (pa1 (v)), . . . , xi−1 (pa1 (v)), x1 (pa2 (v)), . . . , xi−1 (pa2 (v))).
1138
B. Hammer, A. Micheli, and A. Sperduti
Hence, all available information for neuron i is contained in the term that we obtain if we evaluate l(v) to the respective label and all other terms to symbols in the corresponding Herbrand-algebra. We denote the evaluation l(v) by the same symbol for simplicity. xi (ξ p ) is evaluated to the constant 0ip , and xi (ξc ) is evaluated to the constant 0ic . We denote the resulting string for xi (v) by repi (v). Two DPAGs that can be differentiated by hidden neuron h lead to different representations rep h (root(D)) and rep h (root(D )). We now show that if for vertices v of D and v of D and some i the terms repi (v) and repi (v ) differ, then so do repi (v) and repi (v ). This is done by simultaneous induction over i and the structure of the terms: if one of the vertices v or v is an empty child or parent, we get by definition 0ip or 0ic for the representation repi . However, repi also gives a unique identifier p or c for the empty child or parent, respectively. So we can assume that v and v are not empty. i = 1: We obtain rep 1 (v) = τ1 (l(v), rep 1 (ch1 (v)), rep 1 (ch2 (v))). and rep 1 (v ) = τ1 (l(v ), rep 1 (ch1 (v )), rep 1 (ch2 (v ))) , Hence, one of the three subterms is different: it holds l(v) = l(v ), or rep 1 (ch1 (v)) = rep 1 (ch1 (v )), or rep 1 (ch2 (v))) = rep 1 (ch2 (v )). In the latter two cases, we can assume by structural assumption that rep1 (ch1 (v)) = rep1 (ch1 (v )) or rep1 (ch2 (v)) = rep1 (ch2 (v )) holds. rep1 evaluates to rep1 (v) = f (l(v), rep1 (ch1 (v)), rep1 (ch2 (v))) and rep1 (v ) = f (l(v ), rep1 (ch1 (v )), rep1 (ch2 (v ))), and therefore also gives two different terms. Inductive step for i: We find
repi (v) = τi (l(v), rep 1 (v), . . . , repi−1 (v),
rep 1 (ch1 (v)), . . . , repi (ch1 (v)), rep 1 (ch2 (v)), . . . , repi (ch2 (v)),
rep 1 (pa1 (v)), . . . , repi−1 (pa1 (v)), rep 1 (pa2 (v)), . . . , repi−1 (pa2 (v)))
and a similar term for repi (v ). If these terms are different, one of the subterms differs. Hence one of the following holds: 1. l(v) = l(v ). 2. rep j (ch1 (v)) = rep j (ch1 (v )) for some j ≤ i.
Universal Approximation Capability
1139
3. rep j (ch2 (v)) = rep j (ch2 (v )) for some j ≤ i. 4. rep j (pa1 (v)) = rep j (pa1 (v )) for some j < i. 5. rep j (pa2 (v)) = rep j (pa2 (v )) for some j < i. In cases 2 and 3, we can assume by structural assumption that rep j gives different values for the respective terms; in cases 4 and 5, we can assume this fact because of the inductive hypothesis. If rep j yields different values for two terms, then so does rep j for all j ≥ j as we will show in lemma 2. Hence we find that the terms repi (v) = f (l(v), repi (ch1 (v)), repi (ch2 (v)), repi−1 (pa1 (v)), repi−1 (pa2 (v))) and repi (v ) = f (l(v ), repi (ch1 (v )), repi (ch2 (v )), repi−1 (pa1 (v )), repi−1 (pa2 (v ))) are also different. We postponed in theorem 7 the proof of the following lemma: Lemma 2. Assume D and D are DPAGs with vertex v in D and v in D , respectively. Assume repi (v) = repi (v ) for some i. Then rep j (v) = rep j (v ) for all j ≥ i. Proof. The proof is by simultaneous induction over j and the structure. For j = 1, i = 1 also holds and the result is obvious. Inductive step for j: If one of the vertices, say, v, is empty, the result is obvious because rep j (v) yields p or c , respectively. rep j (v ) yields a different value by definition for all j unless v also constitutes an empty child or parent. Assume the vertices are both not empty. Assume without loss of generally that j > i. Assume repi (v) = repi (v ). The representations are of the form repi (v) = f (l(v), repi (ch1 (v)), repi (ch2 (v)), repi−1 (pa1 (v)), repi−1 (pa2 (v))) and repi (v ) = f (l(v ), repi (ch1 (v )), repi (ch2 (v )), repi−1 (pa1 (v )), repi−1 (pa2 (v ))).
1140
B. Hammer, A. Micheli, and A. Sperduti
(The last two subterms are missing for i = 1.) Hence, one of the subterms differs: l(v) = l(v ), repi (ch1 (v)) = repi (ch1 (v )), repi (ch2 (v)) = repi (ch2 (v )), repi−1 (pa1 (v)) = repi−1 (pa1 (v )), or repi−1 (pa2 (v)) = repi−1 (pa2 (v )). In cases 2 and 3, we can assume by structural assumption that the same holds for the index j. In the last two cases, we can assume by the inductive hypothesis that the same holds also for the index j − 1. In all cases, we obtain at least one subterm of rep j (v) that is different from the respective subterm of rep j (v ); hence, rep j (v) = rep j (v ) holds. Therefore, CRecCC networks can constitute universal approximators at most on DPAGs for which repi yields a unique representation for some i by definition of the functional dependencies. We will now show that the mapping that maps DBAGs of height at most h to reph+1 (v), v denoting the root of the respective DBAG, is injective. As a consequence of this fact, all information available in a given DBAG is provided to a CRecCC network via the defined dynamics. For this purpose, we describe an algorithm that allows recovering the original DBAG from the respective linear representation. Note that each term repi (v) can uniquely be decomposed into its subterms. The reconstruction is in several steps. Definition 2. Assume a path of the DBAG D that starts at the root is given and consecutively visits the vertices vi0 , . . . , vih . A linear path representation consists of a string s = s1, . . . , sih , where s j ∈ {1, 2}∗ equals s j = Cvi j−1 vi j for all j = 1, . . . , h. Hence a linear path representation collects which child of the previously visited vertex is visited at each position in the path. si = 1 indicates that the left child of the actual vertex in the path is visited; si = 2 indicates that the next vertex is given by the right child. We assume the convention that the empty string represents the path of length 0, which stays at the root. Obviously, the mapping of rooted paths to linear path representations is injective. Lemma 3. For any vertex v of height at most i in a DBAG D, it is possible to recover the linear path representation of all paths from the root of D to v from repi+1 (v). Proof. The proof is via induction. If i = 1, v can only be the root of the DBAG with linear path representation of the path of length 0. If i > 1, then we find repi+1 (v) = f (. . . , repi (pa1 (v)), repi ( pa 2 (v))).
Universal Approximation Capability
1141
Note that we can decide whether the left or right parent is empty since p is a symbol not contained in L. If a parent is not empty, we can recover the linear path representation of all paths from the root to pa1 (v) and pa2 (v) from repi (pa1 (v)) or repi (pa2 (v)), respectively, by the inductive hypothesis, because the height of pa1 (v) and pa 2 (v) is smaller than i. Note that the DBAG has a compatible positioning of children and parents. Hence, we obtain the linear path representation of all paths from the root to v in the following way: we extend the linear path representation of the paths to pa1 (v) (if this parent is existing) by the letter 1. To this set we add the linear path representation of the paths to pa2 (v) (if existing) extended by the letter 2. Lemma 3 tells us that we can recover all paths from the root to a vertex from the given linear representation. Note that the linear path representations of a rooted path to a vertex differs for each vertex. The next lemma extends this result to the extraction of all vertices and their respective paths from the linear representation. Lemma 4. For a DBAG D of height at most h, it is possible to extract from reph+1 (root(D)) the set S = {(l(v), B) | v ∈ vert(D), B ⊂ {1, 2}∗ , B is ordered lexicographically, B contains the linear path representation of all paths from the root to v in D}. Proof. For an empty DBAG, the term reph+1 (root(D)) equals c , and we simply extract the empty set. Otherwise, we can start with the empty set S and iteratively add tuples (l(v), B) to the set S that collect all vertices v of the DBAG and the set of linear path representations of the vertex. This starts from the root and the given term reph+1 (root(D)), and recursively extracts vertices and subterms that describe all paths for these vertices until we arrive at the leaves. The procedure is as follows. Note that all vertices in the DBAG have height at most h by assumption. Given the term reph+1 (v) = f (l(v), reph+1 (ch1 (v)), reph+1 (ch2 (v)), . . .), we recover the vertex v: its label l(v) can be obtained from the term reph+1 (v). All linear path representation of all paths from the root to v can be recovered because of lemma 3. We sort the linear path representations lexicographically, getting B, and add the tuple (l(v), B) to the set S. Note that this tuple might already be contained in S if vertex v can be reached by more than one path from the root. In this case, S is not changed since we take the union and since the linear path representation of a path to a vertex characterizes the vertex uniquely. We can decide, given reph+1 (v), whether a left and/or right child of v exists, because empty children are represented by the unique symbol c as the second or third subterm of reph+1 (v). If nonempty children exist, we continue with each of the two children and the representation
1142
B. Hammer, A. Micheli, and A. Sperduti
reph+1 (ch1 (v)) and/or reph+1 (ch2 (v)), respectively. Both representations can be obtained from reph+1 (v). This iterative step yields all vertices that are direct or indirect successors of the previous vertex v. Thus, the entire structure will be visited during this iterative process, starting from the root of D. Hence, we finally get the uniqueness of the linear representation for DBAGs: Theorem 8. The mapping that maps DBAGs D over L of height at most h to reph+1 (root(D)) is injective. Proof. Because of lemma 4, we can recover from the given linear representation for any DBAG D of height at most h and each vertex of D a tuple consisting of the label of the vertex and the linear path representations of all paths from the root to the respective vertex. This information uniquely characterizes D because the DBAG can easily be reconstructed using this information. We introduce a vertex for each tuple and assign the given label to it. The problem is that linear path representations of paths do not include the identifier for the vertices we have just chosen. However, the root is uniquely characterized by the fact that the only path to the root is empty. We can then iteratively add all edges to the DBAG and add the positioning of children and parents implicitly contained in the linear path representations as follows: we start with the disconnected vertices and iteratively add connections starting from the root up to the leaves. In an iterative step, we consider all paths starting from the root in the already constructed part, and we choose a shortest path in the linear path representations of vertices of which not yet all edges are included (it is possible to test this property since we can identify the root and afterward follow the successors given by the identifiers of the linear path representation). Because we choose the shortest path, only the last edge of the path, say (u, v), is not yet included in the DBAG. We can uniquely identify u via the first part of the linear path representation. The identifier of v is known (since the path is contained in the tuple which describes v), and hence we can connect u and v in the DBAG, that is, introduce the edge and assign the positioning as indicated in the linear path representation to u and v. Since DBAGs are considered, the positioning of children, which is given in the linear path representation, also gives the positioning of parents. The representation repi preserves all information available for a CRecCC network as we have shown in theorem 7. Hence, CRecCC networks with appropriate choice of the weights and activation function (such that different terms repi are mapped to different values also by a CRecCC network) can exactly map all DPAGs appropriately for which repi yields a unique representation for large enough i. We will later show that multiplicative CRecCC neurons with a standard activation function can perform this task.
Universal Approximation Capability
1143
Since DBAGs can uniquely be reconstructed from repi , CRecCC networks are complete with respect to DBAGs. Note that the reconstruction of a DPAG from the encoding repi is in general not possible because forward paths in a DPAG together with the respective positioning of the vertices (i.e., linear path representations) cannot be obtained from backward connections from the children toward its parents, as has been done in lemma 3. Thereby, backward connections are connections from a vertex to a parent (together with the respective positioning), and forward connections are connections from a vertex to a child (together with the respective positioning). Nevertheless, we can extract all paths that start at the root, follow a certain number of forward and/or backward connections, and end up at some vertex even for a DPAG. This is possible because a forward connection starting from v can be obtained from repi (ch1 (v)) and repi (ch2 (v)), while a backward connection can be obtained from repi−1 (pa1 (v)) and repi−1 (pa1 (v)), whereby full information is available for large enough i. Hence, repi can differentiate for large enough i between any two DPAGs such that at least one path starting from the root exists (following forward and/or backward connections) that leads to a vertex with a different label or different number of children or parents in the two DPAGs. Theorem 4 shows that this is not always possible. However, we claim that the capacity of CRecCC networks to differentiate between these DPAGs (and to approximate functions on these DPAGs as we will see later) is sufficient for practical applications because these highly symmetrical structures usually do not occur in practice. 3.3.4 Approximation with a CRecCC Network. We will now show that specific CRecCC networks can “compute” the representation repi in some form. Hence, they can uniquely encode DBAGs (and DPAGs for which the encoding is unique for some i) into a real vector. This gives us step II of the general proof of approximation completeness. We restrict the presentation in the next section to DBAGs for convenience, keeping in mind that similar constructions can be done for certain DPAGs too. Lemma 5. Assume L = {1, . . . , r }. Assume h ∈ N. Then a CRecCC network with multiplicative neurons of order at most 4 and linear activation function with 2(h + 1) hidden neurons can be found such that the output x2(h+1 ) (root(D)) yields unique representations of DBAGs D over L of height at most h. Proof. Fix different numbers n f , n, , n( , n) , n p , and nc not contained in {1, . . . , r }. Assume that 0 is not contained in this set of numbers. Assume L is the maximum length of the digital representation of the above symbols (including elements of L). We assume in the following that all numbers are extended by leading entries 0 such that they occupy precisely L places. We can then uniquely represent the term repi (v) by a number: we set nc for c , n p for p . A general term repi (v) is represented by the number that we obtain substituting in repi (v) the symbols “ f ” by n f , “(” by n( , “,” by n, , “)” by
1144
B. Hammer, A. Micheli, and A. Sperduti
n) , l(v) by the respective element of L, and the contained references to rep by the recursively constructed numbers (always including leading entries 0). We refer to the number that represents repi (v) by the symbol nrepi (v). We will now iteratively construct hidden neurons x2i−1 and x2i for i ≥ 1 such that x2i−1 (v) = exp0.1 (length(nrepi−1 (v)))
and
x2i (v) = 0.nrepi−1 (v).
The latter term denotes the decimal number smaller than 1 with digits nrepi−1 (v)), and length(nrepi−1 (v)) denotes the length of the term including leading entries 0. Since the representation reph+1 (root(D)) is unique for DBAGs of height at most h, the output x2(h+1) (root(D)) of hidden neuron 2(h + 1) gives unique values for these DBAGs. According to our definition, xi (ξ p ) = 0ip and xi (ξc ) = 0ic are fixed symbols for all i. We choose 0ip = 0ic = (0.1) L for odd i and 0ip = 0.n p and 0ic = 0.nc for even i. Assume in the following that v is nonempty. For the first hidden neuron, we find x1 (v) = τ1 (l(v), x1 (ch1 (v)), x1 (ch2 (v))). Assume l1 and l2 , respectively, denote the length of nrep1 of the first and second child, respectively, of v. Since the term rep1 (v) = f (l(v), rep1 (ch1 (v)), rep1 (ch2 (v))) consists of six symbols and two subterms that represent ch1 (v) and ch2 (v), the length of rep1 (v) is given by 6L + l1 + l2 . Hence, the choice τ1 (a 1 , a 2 , a 3 ) = (0.1)6L · a 2 · a 3 gives us the desired output, since via structural induction, the latter two entries can be identified with a 2 = (0.1)l1 and a 3 = (0.1)l2 . Note that this function can be computed by a multiplicative neuron of order 2 with linear activation function. The functional dependence of the second hidden neuron is x2 (v) = τ2 (l(v), x1 (v), x1 (ch1 (v)), x2 (ch1 (v)), x1 (ch2 (v)), x2 (ch2 (v)), x1 (pa1 (v)), x1 (pa2 (v))). By structural induction, we can assume that x2 (ch1 (v)) and x2 (ch2 (v)) contain the number representation of rep1 of the two children. As shown previously, x1 (ch1 (v)) and x1 (ch2 (v)) represent the length of rep1 (ch1 (v)) and rep1 (ch2 (v)), respectively. Hence, the function τ2 (a 1 , . . . , a 8 ) = (0.1) L n f + (0.1)2L n( + (0.1)3L a 1 + (0.1)4L n, + (0.1)4L a 4 + (0.1)5L a 3 · n, + (0.1)5L a 3 · a 6 + (0.1)6L a 3 · a 5 · n) computes the desired number representing nrep1 (v). Thereby, concatenation of digits representing the symbols in rep1 (v) is done by summing the appropriately shifted numbers. Note that the length of the computed subnumbers representing children is available. This function τ2 can be computed by a multiplicative neuron of order 2 with linear activation function.
Universal Approximation Capability
1145
Assume we have constructed hidden neurons up to number 2i. It is repi+1 (v) = f (l(v), repi+1 (ch1 (v)), repi+1 (ch2 (v)), repi (pa1 (v)), repi (pa2 (v))). The functional dependence of hidden neuron number 2i + 1 is x2i+1 (v) = τ2i+1 (l(v), x1 (v), . . . , x2i (v), x1 (ch1 (v)), . . . , x2i+1 (ch1 (v)), x1 (ch2 (v)), . . . , x2i+1 (ch2 (v)), x1 (pa1 (v)), . . . , x2i (pa1 (v)), x1 (pa2 (v)), . . . , x2i (pa2 (v))) By induction, x2i−1 (∗) yields the term exp0.1 (length(nrepi (∗))), where ∗ is one of ch1 (v), ch2 (v), pa1 (v) and pa2 (v). By structural induction, the inputs x2i+1 (ch1 (v)) and x2i+1 (ch2 (v)), respectively, give the terms exp0.1 (length(nrepi+1 (ch1 (v)))) and exp0.1 (length(nrepi+1 (ch2 (v)))). Since repi+1 consists of eight symbols and contains the subterms that describe the length of the representations of children and parents, the function τ2i+1 (a 1 , . . . , a 10i+3 ) = (0.1)8L · a 4i+2 · a 6i+3 · a 8i+2 · a 10i+2 computes exp0.1 (length(nrepi+1 (v))). Note that this function can be computed by a multiplicative neuron of order 4 with linear activation function. The functional dependence of hidden neuron 2i + 2 is x2i+2 (v) = τ2i+2 (l(v), x1 (v), . . . , x2i+1 (v), x1 (ch1 (v)), . . . , x2i+2 (ch1 (v)), x1 (ch2 (v)), . . . , x2i+2 (ch2 (v)), x1 (pa1 (v)), . . . , x2i+1 (pa1 (v)), x1 (pa2 (v)), . . . , x2i+1 (pa2 (v))). x2 j−1 (∗) yields the term exp0.1 (length(nrep j (∗))) for j ≤ i + 1, and ∗ ∈ {ch1 (v), ch2 (v), pa1 (v), pa2 (v)}. x2(i+1) (∗) yields the term 0.nrepi+1 (∗) for ∗ ∈ {pa1 (v), pa2 (v)}. x2 j (∗) gives the term 0.nrep j (∗) for j ≤ i + 1 and ∗ ∈ {ch1 (v), ch2 (v)}. Hence, the function τ2i+2 (a 1 , . . . , a 10i+8 ) = (0.1) L n f + (0.1)2L n( + (0.1)3L a 1 + 0.14L n, + (0.1)4L a 4i+4 + (0.1)5L a 4i+3 · n, + (0.1)5L a 4i+3 · a 6i+6 + (0.1)6L a 4i+3 · a 6i+5 · n, + (0.1)6L a 4i+3 · a 6i+5 · a 8i+6 + (0.1)7L a 4i+3 · a 6i+5 · a 8i+5 · n, + (0.1)7L a 4i+3 · a 6i+5 · a 8i+5 · a 10i+7 + (0.1)8L a 4i+3 · a 6i+5 · a 8i+5 · a 10i+6 · n)
1146
B. Hammer, A. Micheli, and A. Sperduti
computes the representation 0.nrepi+1 (v). Note that this function can be computed by a multiplicative neuron of order 4 with linear activation function. A similar construction is possible if fan-in and fan-out k ≥ 2 is considered. Then products of at most 2k factors are to be dealt with; hence, CRecCC networks with multiplicative neurons of order at most 2k can in this case compute a unique representation. It follows in the same way as lemma 1 that integration of feedforward networks into a CRecCC network is possible. In analogy to theorem 2, this can be extended to real-valued labels, and, mimicking the proof of theorem 3, we can substitute specific activation functions by standard ones. Hence, CRecCC networks with multiplicative neurons constitute universal approximators for structures analogous to theorem 3. This result has been proved for DBAGs with limited fan-in and fan-out two. It can immediately be generalized to larger fan-in and fan-out k, giving us approximation completeness for these DBAGs for CRecCC networks with multiplicative neurons of order 2k. In practical applications, usually standard neurons instead of multiplicative neurons are used. In analogy to the case of trees, we can substitute multiplicative neurons by networks of standard neurons in the following way: Theorem 9. Assume L = Rn . Assume P is some probability measure on finite DBAGs over L. Assume f : DBAGL2 → R is a Borel-measurable function. Assume > 0, δ > 0. Assume σ : R → R is a continuous and locally C 2 squashing function. Then there exists a CRecCC network that has a linear activation function for the output neuron. The function of the other hidden states τi is computed by a feedforward network with at most two hidden layers and a universally bounded number of neurons in the network with activation functions σ . This CRecCC network computes a function g such that P(x ∈ DBAGL2 | | f (x) − g(x)| > δ) < . Proof. The network that we constructed in theorem 3 uses multiplicative neurons because of the computation of the linear codes of DBAGs as constructed in lemma 5. The multiplicative neurons that are used in this construction contain multiplications of order at most 4. We can substitute one multiplication x · y by the term ((x + y)2 − (x − y)2 )/4. The function x → x 2 can be approximated uniformly by the term (σ (x1 + 0 · x) + σ (x1 − 0 · x) − 2 · σ (x1 ))/(02 · σ
(x1 )). Hence, we can approximate the multiplicative neurons arbitrarily well by networks of standard neurons with approximation function σ . The same argumentation as in theorem 3 shows that the approximation can be done uniformly up to inputs of arbitrarily small probability. Since the multiplicative neuron with maximum complexity in lemma 5
Universal Approximation Capability
1147
includes two multiplicative terms of order 2, two terms of order 3, and two terms of order 4, a feedforward network with at most two hidden layers and 49 neurons in the feedforward network can approximate the computation of a multiplicative neuron. 3.3.5 Some Implications. Note that the key point of the construction is given by theorem 8 and lemma 5. Theorem 8 defines a linear code that represents DBAGs up to a given height h over a finite alphabet L uniquely, and lemma 5 shows that this code can be computed in an appropriate way with a CRecCC network. The other ingredients are steps III to V of the construction for RCC networks, which can be directly transferred to trees and DBAGs. Note that we can perform the above construction steps formally for DPAGs instead of DBAGs as well. The only problem is that the linear codes provided by repi might not encode all DPAGs uniquely. We have already found one example of DPAGs (see theorem 4) that are not uniquely encoded by any repi . If such situations are prevented, universal approximation for DPAGs can be guaranteed as well: Theorem 10. Assume L = Rn . Assume P is some probability measure on finite DPAGs over L. Assume f : DPAG2L → R is a Borel-measurable function. Assume > 0, δ > 0. Assume σ : R → R is a continuous and locally C 1 squashing function. Assume the set {D ∈ DPAG2L | ∃D ∈ DPAG2L f (D) = f (D ), ∀i repi (root(D)) = repi (root(D ))} has probability at most δ/2. Then there exists a CRecCC network with multiplicative neurons with activation functions σ and linear activation function for the last neuron of the network, which computes a function g such that P(x ∈ DPAG2L | | f (x) − g(x)| > δ) < . Proof. We can disregard the set of DPAGs that are not uniquely encoded by repi by assumption. For the remaining part, we can construct an approximative CRecCC network as in the previous case. Note that encoding has to be done only for DPAGs of limited height with discretized labels, that is, for a finite number of structures, such that a maximum number i that differentiates all these structures can be found. The remaining steps are analogous to those for DBAGs whereby the encoding network computes repi for the above i. In practice, this allows us to approximate reasonable mappings on DPAGs as well: repi (D) = repi (D ) for two DPAGs D = D and all i require that for any vertex in the given two DPAGs and all paths following the connections toward children or parents, only vertices of the two DPAGs with the same label and the same number of children and parents are reached in D and D . Hence, this requires highly symmetrical situations such that
1148
B. Hammer, A. Micheli, and A. Sperduti
CRecCC networks can be expected to perform quite well in practical applications even for general DPAGs. We would like to add a remark on the form of functions to be approximated. So far we have considered supersource transductions, that is, functions that map a structure to a single value—the recursively computed output of the supersource of the structure. IO isomorphic transductions refer to more general functions, which map a given input DPAG to a DPAG of the same structure where each vertex label of the input DPAG l(v) is substituted by a value l (v). Thereby, the value l (v) of an output vertex might depend on the entire input DPAG. Thus, IO isomporphic transductions are functions f : DPAG2L → DPAG2Y with label sets L and Y such that f (x) has the same structure as x for every x ∈ DPAG2L . As before, we assume that L and Y are subsets of a real-vector space. We call a function f measurable if for every fixed input structure of DPAGs and every fixed vertex in the structure, the mapping, that maps the input DPAG to the output value of the vertex is Borel measurable. The distance of two output DBAGs of the same structure is measured by taking the maximum distance of the labels of vertices at the same position. We can obviously obtain IO isomorphic transductions from CRecCC networks by setting the output label of a vertex v of a given input structure to the output of the CRecCC network after processing the vertex v. The question now is whether all measurable IO isomorphic transductions on DBAGs can be approximated by some CRecCC network. To assess this question, we can use the same steps as before, whereby the only difference occurs for step I. We have to compute an appropriate output after each presentation of a vertex; thus, we have to find unique linear codes for each vertex in each possible DBAG. If this is possible, we can solve the problem to map each encoded vertex of a DBAG to the desired output as before. Thus, steps II to V can be transferred from the previous case. Assume a DBAG D and a vertex v in D is given. Consider the representation reph (v). If the height of v is smaller than h, we can reconstruct the actual vertex v, including its label l(v), and all linear path representations of the root to v from this term because of lemma 3. In addition, we can mimic the construction of lemma 4 and collect all direct or indirect successors of v together with the linear path representations of all paths from the root to these vertices. For predecessors of v, however, we have to add an argument: note that reph (v) has the form τh (l(v), reph (ch1 (v)), reph (ch2 (v)), reph−1 (pa1 (v)), reph−1 (pa2 (v))). Thus, if the DBAG has height at most h − 2, we can also reconstruct the two parents of v and all linear path representations from reph−1 (pa1 (v)) and reph−1 (pa2 (v)). Iterating this argument yields the result that the vertices and all linear path representations of the DBAG can be reconstructed from reph (v) as in lemma 4 if the DBAG has height at most h/2 − 1. Note that for each reconstruction step, the index i of repi (v) is decreased by one, such that we have to start with twice the height. In other words, rep2h+1 (v) is a unique
Universal Approximation Capability
1149
linear representation of v with respect to the vertex v and the entire DBAG if the height of the DBAG is at most h. As a consequence, we can conclude that CRecCC networks are also universal approximators for IO isomorphic transduction on DBAGs. 4 Discussion We have considered various cascade correlation models enhanced by recurrent or recursive connections for structured input data. These models have been proposed in the literature based on recurrent or recursive neural networks in analogy to simple cascade correlation. Recurrent and recursive networks constitute well-established machine learning tools to deal with sequences or tree structures as input, respectively. They have been applied successfully for various types of classification and regression problems on structured data ranging from picture processing up to chemistry (Frasconi et al., 1998). However, it is well known that training recurrent networks poses severe problems (Bengio et al., 1994) and the generalization ability might be considerably worse compared to standard feedforward networks (Hammer, 2001). Moreover, classification tasks for sequences and tree structures are likely more difficult than standard learning problems for feedforward networks due to the increased complexity of data. Hence, constructive learning algorithms that divide the problem into several subtasks and automatically find an appropriate (preferably small) architecture are particularly useful within structured domains. Cascade correlation for structures has shown superior performance for several problems in chemistry (Bianucci et al., 2000) and seems particularly appropriate due to very low training effort (the weights of only one neuron are trained at a time) and an automatically growing structure. Unlike the modification of standard feedforward networks to feedforward cascade correlation networks, the specific training scheme of CC restricts the possible dynamics of recurrent cascade architectures to only partially recurrent systems. As a consequence, the class of recurrent cascade architectures constitutes a proper subclass of the class of recurrent network architectures. This might restrict the capability of these types of networks compared to the general model, as demonstrated, for example, in Giles et al. (1995). Hence, the question was whether the restriction of the recurrence in cascade models poses severe restrictions on the applicability of the models because possibly only very simple functions can be represented with these restricted architectures. We have shown in this article that in terms of the universal approximation capability of these models, no restriction can be observed: CC architectures can approximate every measurable function with sequences or tree structures as input, respectively, arbitrarily well up to a set of arbitrary small probability in spite of their restricted recurrence. So if fundamental differences concerning the capacity of these
1150
B. Hammer, A. Micheli, and A. Sperduti
systems in comparison to their fully recurrent counterparts exist (as demonstrated in Giles et al., 1995), then these fundamental differences will not manifest for any given finite set of training data or for any given finite time horizon or height of the trees, respectively, provided enough neurons are available. The restriction of recurrence in cascade models offers a very natural possibility to enlarge the dynamics: not only information provided by successors of vertices of a graph or tree structure can be taken into account but also contextual information provided by the predecessors of the vertices. These contextual recursive models have been proposed in Micheli et al. (2002, in press). It has been demonstrated that this additional information is actually used by the contextual models such that tasks involving global information of the structure can be solved more easily with these models. Note also that for standard recurrent and recursive networks, several models that integrate global information for a restricted form of structured data such as sequences or grids have been proposed in the literature and successfully been applied in bioinformatics, for example where spatial data, instead of temporal data, often have to be dealt with (Baldi et al., 1999; Micheli et al., 2000; Pollastri et al., 2002; Wakuya & Zurada, 2001). However, unlike contextual cascade models, structure processing is restricted to special subclasses, and integration of contextual information is done on top of the basic recursive networks. This is necessary because of full recurrence of the networks. Context integration in the form of structural induction would yield a cyclical, ill-defined dynamics. The restriction of the recurrence in cascade models allows integrating the contextual information directly into the model in an elegant and efficient way. As already demonstrated in Micheli et al. (in press), this enlarges the class of structures the models can deal with from tree structures to acyclic rooted positional graphs with limited fan-in and fan-out. We went a step further in this article by showing that contextual cascade models in fact constitute universal approximators for the class of functions defined on acyclic directed graphs as input. More precisely, we have identified a subclass of DPAGs, which we called DBAGs, where enumeration of children and parents is compatible, and we have shown the universal approximation capability for this subclass. Moreover, we have shown that any DPAG can be transformed into a DBAG in a natural way via possible expansion by empty children and parents and reenumeration of children and parents. Hence, acyclic graph structures where the positioning is not relevant can easily be dealt with in this way. In addition, the approximation capability is extended to IO isomorphic transductions; CRecCC networks constitute one of the first models for which a proof exists that they can also approximate functions with structured outputs. Thus, the universal approximation capability of cascade architectures with restricted recurrence and context integration is enlarged to general data structures such as rooted directed acyclic graphs with limited fan-in and fan-out.
Universal Approximation Capability
1151
Appendix A.1 Proof of Theorem 4. Assume a stationary CRecCC network is given. Consider the two graphs depicted in Figure 3. We refer to the vertices by the numbers indicated in the figure. Thereby, enumeration of children and parents is done from left to right as they appear in the figure: the upper left box of a vertex refers to parent number one and the upper right box to parent number two, and the lower left box indicates child number one and the lower right box to child number two. The labels of all vertices are identical, say l. Note that the structures are highly symmetrical: each pair of vertices i and i has the same number of children and parents and the same label. Moreover, i and i are pointing to vertices that have the same structure in the left and the right DPAG. We point out one aspect of these structures: consider the edges (0, 4) and (0 , 4 ). 4 and 4 constitute the respective second child of the root, whereas 0 and 0 constitute the first parent of 4 and 4 , respectively. For all other edges, the positioning of the respective parent coincides with the positioning of the respective child. As a first step, we show by induction over i that the following equalities hold for the hidden neurons’ activation: xi (1) = xi (4), xi (2) = xi (5), xi (3) = xi (6), xi (1 ) = xi (4 ), xi (2 ) = xi (5 ), xi (3 ) = xi (6 ). This indicates that a large symmetry can already be found within each structure: i =1: x1 (3) = τ1 l, 01c , 01c = x1 (6), x1 (2) = τ1 l, x1 (3), 01c = τ1 l, x1 (6), 01c = x1 (5), x1 (1) = τ1 (l, x1 (2), x1 (3)) = τ1 (l, x1 (5), x1 (6)) = x1 (4) and
x1 (3 ) = τ1 l, 01c , 01c = x1 (6 ), x1 (2 ) = τ1 l, x1 (3 ), 01c = τ1 l, x1 (6 ), 01c = x1 (5 ), x1 (1 ) = τ1 (l, x1 (2 ), x1 (6 )) = τ1 (l, x1 (5 ), x1 (3 )) = x1 (4 )
Induction step for i: xi (3) = τi l, x1 (3), .. xi−1 (3), 01c , .. 0ic , 01c , .. 0ic , x1 (2), .. xi−1 (2), x1 (1), .. xi−1 (1) = τi l, x1 (6), .. xi−1 (6), 01c , .. 0ic , 01c , .. 0ic , x1 (5), .. xi−1 (5), x1 (4), .. xi−1 (4) = xi (6)
1152
B. Hammer, A. Micheli, and A. Sperduti
xi (2) = τi l, x1 (2), .. xi−1 (2), x1 (3), .. xi (3), 01c , .. 0ic , x1 (1), .. xi−1 (1), 01p , .. 0i−1 p = τi l, x1 (5), .. xi−1 (5), x1 (6), .. xi (6), 01c , .. 0ic , x1 (4), .. xi−1 (4), = xi (5) 01p , .. 0i−1 p xi (1) = τi l, x1 (1), .. xi−1 (1), x1 (2), .. xi (2), x1 (3), .. xi (3), x1 (0), .. xi−1 (0), 01p , .. 0i−1 p = τi l, x1 (4), .. xi−1 (4), x1 (5), .. xi (5), x1 (6), .. xi (6), = xi (4) x1 (0), .. xi−1 (0), 01p , .. 0i−1 p and
xi (3 ) = τi l, x1 (3 ), .. xi−1 (3 ), 01c , .. 0ic , 01c , .. 0ic , x1 (2 ), .. xi−1 (2 ), x1 (4 ), .. xi−1 (4 ) = τi l, x1 (6 ), .. xi−1 (6 ), 01c , .. 0ic , 01c , .. 0ic , x1 (5 ), .. xi−1 (5 ), x1 (1 ), .. xi−1 (1 ) = xi (6 ) xi (2 ) = τi l, x1 (2 ), .. xi−1 (2 ), x1 (3 ), .. xi (3 ), 01c , .. 0ic , x1 (1 ), .. xi−1 (1 ), 01p , .. 0i−1 p = τi l, x1 (5 ), .. xi−1 (5 ), x1 (6 ), .. xi (6 ), 01c , .. 0ic , x1 (4 ), .. xi−1 (4 ), 01p , .. 0i−1 p = xi (5 ) xi (1 ) = τi l, x1 (1 ), .. xi−1 (1 ), x1 (2 ), .. xi (2 ), x1 (6 ), .. xi (6 ), x1 (0 ), .. xi−1 (0 ), 01p , .. 0i−1 p = τi l, x1 (4 ), .. xi−1 (4 ), x1 (5 ), .. xi (5 ), x1 (3 ), .. xi (3 ), = xi (4 ). x1 (0 ), .. xi−1 (0 ), 01p , .. 0i−1 p
Next, we show by induction over i that xi (n) = xi (n ) holds for all n ∈ {0, . . . , 6}. In particular, any CRecCC network yields the same output for both graphs independent of the precise form of the network functions τi . i =1:
x1 (3) = x1 (6) = τ1 l, 01c , 01c = x1 (3 ) = x1 (6 ) x1 (2) = x1 (5) = τ1 l, x1 (3), 01c = τ1 l, x1 (3 ), 01c = x1 (2 ) = x1 (5 ) x1 (1) = x1 (4) = τ1 (l, x1 (2), x1 (3)) = τ1 (l, x1 (2 ), x1 (6 )) = x1 (1 ) = x1 (4 ) x1 (0) = τ1 (l, x1 (1), x1 (4)) = τ1 (l, x1 (1 ), x1 (4 )) = x1 (0 )
Universal Approximation Capability
1153
Induction step for i: xi (3) = τi l, x1 (3), .. xi−1 (3), 01c , .. 0ic , 01c , .. 0ic , x1 (2), .. xi−1 (2), x1 (1), .. xi−1 (1) = τi l, x1 (3), .. xi−1 (3), 01c , .. 0ic , 01c , .. 0ic , x1 (2), .. xi−1 (2), x1 (4), .. xi−1 (4) = τi l, x1 (3 ), .. xi−1 (3 ), 01c , .. 0ic , 01c , .. 0ic , x1 (2 ), .. xi−1 (2 ), x1 (4 ), .. xi−1 (4 ) = xi (3 ) = xi (6 ) = xi (6) xi (2) = τi l, x1 (2), .. xi−1 (2), x1 (3), .. xi (3), 01c , .. 0ic , x1 (1), .. xi−1 (1), 01p , .. 0i−1 p = τi l, x1 (2 ), .. xi−1 (2 ), x1 (3 ), .. xi (3 ), 01c , .. 0ic , x1 (1 ), .. xi−1 (1 ), 01p , .. 0i−1 p = xi (2 ) = xi (5 ) = xi (5) xi (1) = τi l, x1 (1), .. xi−1 (1), x1 (2), .. xi (2), x1 (3), .. xi (3), x1 (0), .. xi−1 (0), 01p , .. 0i−1 p = τi l, x1 (1), .. xi−1 (1), x1 (2), .. xi (2), x1 (6), .. xi (6), x1 (0), .. xi−1 (0), 01p , .. 0i−1 p = τi l, x1 (1 ), .. xi−1 (1 ), x1 (2 ), .. xi (2 ), x1 (6 ), .. xi (6 ), x1 (0 ), .. xi−1 (0 ), 01p , .. 0i−1 p = xi (1 ) = xi (4 ) = xi (4) xi (0) = τi l, x1 (0), .. xi−1 (0), x1 (1), .. xi (1), 1 i−1 x1 (4), .. xi (4), 01p , .. 0i−1 p , 0 p , .. 0 p = τi l, x1 (0 ), .. xi−1 (0 ), x1 (1 ), .. xi (1 ), 1 i−1 x1 (4 ), .. xi (4 ), 01p , .. 0i−1 p , 0 p , .. 0 p = xi (0 ). Hence, the structures cannot be differentiated by any CRecCC network regardless of the chosen specific weights of the network and form of the neurons.
1154
B. Hammer, A. Micheli, and A. Sperduti
Figure 4: Example of the transformation of a graph (left) for which a com-
patible positioning has to be found to the graph where each vertex is split into a parent vertex denoted by 0, . . . , 3 and a child vertex denoted by 0 , . . . , 3 (middle). Thereby, connections to empty childen and connections from empty parents are dropped. If the connections are considered as undirected connections, the dual graph can be constructed (right). Now an enumeration of the vertices in the dual graph that corresponds to a compatible positioning of children and parents in the original DPAG can be constructed.
A.2 Proof of Theorem 5. Assume a DPAG D is given. If D is empty, the result is obvious. Assume D is not empty. The first construction steps can be done for any limited fan-in and fan-out at most k; thus we describe the steps for general k. Given D, consider the following related (split) graph D where the role of parents and children is explicitly assigned to the vertices: vertices in D are {u p | ∃(u, v) ∈ edge(D)} and {vc | ∃(u, v) ∈ edge(D)}, edges in D are {(u p , vc ) | ∃(u, v) ∈ edge(D)}. See Figure 4 (middle) for an example. This graph can be transformed to its dual graph D ∨ with labeled edges with labels in {i, o}, which we obtain as follows: the dual graph introduces a vertex ve for each edge e in D and an edge (ve , v f ) if a vertex in D exists such that e and f are both connected to this vertex, that is, e = (u p , vc ) and f = (u p , vc ) (the edges are connected via a child) or e = (u p , vc ) and f = (u p , vc ) (the edges are connected via a parent) for some vertices u p , vc , u p , vc in D . An edge (ve , v f ) in D ∨ is labeled with i if e and f are connected via a child in D ; both are ingoing edges in D . An edge (ve , v f ) in D ∨ is labeled with o if e and f are connected via a parent in D ; both are outgoing edges in D . See Figure 4 (right) for an example. Now we can relate positionings of D to positionings of D and positionings of D to enumerations of D ∨ to prove the theorem. First, consider D and D . D has fan-in and fan-out k like D. Since the positioning of the children of a fixed vertex can be done independent of the positioning of the parents
Universal Approximation Capability
1155
Figure 5: Example of the structure of the dual graph (right) for a given DPAG (left). It decomposes into linear subgraphs and ring structures with an even number of vertices.
of this vertex, the following holds: there is a one-to-one connection between compatible positionings of (Pv , Cv ) of the vertices v in D and (Pv , Cv ) of the vertices in D. Thereby, D is no longer a rooted DPAG; however, the notion of a compatible positioning in the above form can also be applied for unrooted DPAGs. The one-to-one connection is given by the identities Pv c (u p ) = Pv (u) and Cu p (vc ) = Cu (v). This holds because positioning of the (empty) children of vc and of the (empty) parents of u p can be done arbitrarily. Note that any compatible positioning of D corresponds to the assignment of a number in {1, . . . , k} to each edge in the graph such that edges connected to the same vertex have different numbers. An edge (u p , vc ) is assigned the number Pv c (u p ) = Cu p (vc ); these values coincide for a compatible positioning. Since the numbers come from a positioning, edges connected to the same vertex have different numbers. Conversely, any assignment of numbers {1, . . . , k} to the edges of D such that edges connected to the same vertex have different numbers can be transformed to a compatible positioning. If a value i is assigned to edge (u p , vc ), we set Pv c (u p ) = Cu p (vc ) = i. Since edges connected to the same vertex have different numbers, this defines a valid positioning. Now consider the dual graph D ∨ . Assigning numbers to the edges in D
such that edges in D connected to the same vertex have different numbers means to assign numbers to the vertices in D ∨ such that vertices in D ∨ that are connected in D ∨ have different numbers. We have already argued that such an assignment for D leads to a compatible positioning of the graph D. We now use the restriction on k. Since D has limited fan-in and fan-out k equal to two, each vertex in D ∨ has at most one connection labeled with i and one connection labeled with o. Thus, each vertex in D ∨ has at most two edges. Hence, D ∨ decomposes into connected components that have a very simple structure: the components either constitute a linear graph or have a ring structure (see Figure 5). Thereby, each ring necessarily has an even number of vertices because each vertex is connected by one edge labeled
1156
B. Hammer, A. Micheli, and A. Sperduti
with i and one edge labeled with o. The assignment of values in {1, 2} can be done independently for each connected component of D ∨ . For a linear structure, we can simply assign consecutively alternating values to the vertices. For a ring, we can do the same starting at some arbitrary position because the number of vertices in a ring is even. Thus, an appropriate assignment of numbers to the vertices of D ∨ such that a compatible positioning of D arises is always possible. A.3 Proof of Theorem 6. As before, we can construct the graph D from D with vertices {u p | ∃(u, v) ∈ edge(D)} and {vc | ∃(u, v) ∈ edge(D)} and edges in D are {(u p , vc ) | ∃(u, v) ∈ edge(D)}, and its dual graph D ∨ . Each vertex in D ∨ has at most 2(k − 1) connections, since each edge in D is connected to a child with at most k − 1 further ingoing edges in D and a parent with at most k − 1 further outgoing edges in D . Hence, we can obviously assign numbers {1, . . . , 2k − 1} to each vertex in D ∨ such that no two vertices connected in D ∨ have the same numbers assigned. Even if all 2(k − 1) neighbors of a vertex are already enumerated, there is still one number available. These numbers correspond to a compatible positioning of children and parents in the original DPAG D with positions in {1, . . . , 2k − 1}. Hence, filling the unassigned positions for children and parents with empty children and parents, we obtain a DPAG from D with compatible positioning and fan-in and fan-out 2k − 1, thereby expanding with empty children and parents. Acknowledgments This work has been partially supported by MIUR grant 2002093941 004. We thank two anonymous referees for profound and valuable comments on an earlier version of the manuscript. References Alquezar, R., & Sanfeliu, A. (1995). An algebraic framework to represent finite state machines in single-layer recurrent neural networks. Neural Computation, 7(5), 931– 949. Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., & Soda, G. (1999). Exploiting the past and future in protein secondary structure prediction. Bioinformatics, 15(11), 937– 946. Bengio Y., & Frasconi, P. (1996). Input/output HMMs for sequence processing. IEEE Transactions on Neural Networks, 7(5), 1231–1249. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. Bianucci, A. M., Micheli, A., Sperduti, A., & Starita, A. (2000). Application of cascade correlation networks for structures to chemistry. Journal of Applied Intelligence, 12, 117–146.
Universal Approximation Capability
1157
de Mauro, C., Diligenti, M., Gori, M., & Maggini, M. (2003). Similarity learning for graph-based image representations. Pattern Recognition Letters, 24(8), 1115–1122. Diligenti, M., Frasconi, P., & Gori, M. (2003). Hidden tree Markov models for document image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4), 519–523. Fahlmann, S. E. (1991). The recurrent cascade-correlation architecture. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 190–196). San Mateo, CA: Morgan Kaufmann. Fahlmann, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 524–532). Son Mateo, CA: Morgan Kaufmann. Forcada, M. L., & Carrasco, R. C. (1995). Learning the initial state of a second order recurrent neural network during regular language inference. Neural Computation, 7(5), 923–949. Frasconi, P. (2002). Comparing convolution kernels and RNNs on a wide-coverage computational analysis of natural language. NIPS 2002 workshop. Frasconi, P., & Gori, M. (1996). Computational capabilities of local-feedback recurrent networks acting as finite-state machines. IEEE Transactions on Neural Networks, 7(6), 1521–1524. Frasconi, P., Gori, M., & Sperduti, A. (1998). A general framework of adaptive processing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786. Funahashi K., & Nakamura, Y. (1993). Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 6(6), 801–806. Giles, C. L., Chen, D., Sun, G. Z., Chen, H. H., Lee, Y. C., & Goudreau, M. W. (1995). Constructive learning of recurrent neural networks: Limitations of recurrent cascade correlation and a simple solution. IEEE Transactions on Neural Networks, 6(4), 829–836. Goller, C. (1997). A connectionist approach for learning search control heuristics for automated deduction systems. Unpublished doctoral dissertation, Technische Universit¨at Munchen. ¨ Goller, C., & Kuchler, ¨ A. (1996). Learning task-dependent distributed structurerepresentations by backpropagation through structure. In IEEE International Conference on Neural Networks (pp. 347–352). Piscataway, NJ: IEEE Computer Society Press. Hammer, B. (2000). Learning with recurrent neural networks. Berlin: Springer. Hammer, B. (2001). Generalization ability of folding networks. IEEE Transactions on Knowledge and Data Engineering, 13(2), 196–206. Hammer, B., & Steil, J. J. (2002). Perspectives on learning with recurrent neural networks. In M. Verleysen (Ed.), ESANN’02 (pp. 357–368). Brussels: D-side Publications. Haussler, D. (1999). Convolution kernels on discrete structure (Tech. Rep.). Santa Cruz: University of California at Santa Cruz. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Jaakkola, T., Diekhans, M., & Haussler, D. (2000). A discriminative framework for detecting remote protein homologies. Journal of Computational Biology, 7, 95– 114.
1158
B. Hammer, A. Micheli, and A. Sperduti
Kremer, S. (2001). Spatio-temporal connectionist networks: A taxonomy and review. Neural Computation, 13(2), 249–306. Leslie, C., Eskin, E., & Noble, W. (2002). The spectrum kernel: A string kernel for SVM protein classification. In Proc. Pacific Symposium on Biocomputing (pp. 564– 575). Available online: http://psb.stanford.edu/psb.online/. Lodhi, H., Shawe-Taylor, J., Cristianini, N., & Watkins, C.J.C.H. (2000). Text classification using string kernels. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in information processing systems, 12 (pp. 563–569). Cambridge, MA: MIT Press. Micheli, A. (2003) Recursive processing of structured domains in machine learning. Unpublished doctoral dissertation, University of Pisa. Micheli, A., Sona, D., & Sperduti, A. (2000). Bi-causal recurrent cascade correlation. In Proceedings of the International Joint Conference on Neural Networks, (Vol. 3, pp. 3–8). Piscataway, NJ: IEEE Computer Society Press. Micheli, A., Sona, D., & Sperduti, A. (2002). Recursive cascade correlation for contextual processing of structured data, In Proceedings of the International Joint Conference on Neural Networks 2002, 1 (pp. 268–273). N. P. Micheli, A., Sona, D., & Sperduti, A. (2003). Formal definition of context in contextual recursive cascade correlation networks. In E. Alpaydin, E. Oja, & L. Xu (Eds.), Proceedings of ICANN/ICONIP 2003, LNCS 2714 (pp. 173–130). Berlin: Springer. Micheli, A., Sona, D., & Sperduti, A. (In press). Contextual processing of structured data by recursive cascade correlation. IEEE Transactions on Neural Networks. Pollastri, G., Baldi, P., Vullo, A., & Frasconi, P. (2002). Prediction of protein topologies using GIOHMMs and GRNNs. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Scarselli, F., & Tsoi, A. C. (1998). Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Networks, 11, 15–37. Sontag, E. D. (1992). Feedforward nets for interpolation and classification. Journal of Computer and System Sciences, 45, 20–48. Sperduti, A. (1997). On the computational power of neural networks for structures. Neural Networks, 10(3), 395–400. Sperduti, A., Majidi, D., & Starita, A. (1996). Extended cascade-correlation for syntactic and structural pattern recognition. In P. Perner, P. Wand, & A. Rosenfeld (Eds.), Advances in structured and syntactical pattern recognition (pp. 90–99). New York: Springer. Sperduti, A., & Starita, A. (1997). Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks, 8(3), 714–735. Sturt, P., Costa, F., Lombardo, V., & Frasconi, P. (2003). Learning first-pass structural attachment preferences with dynamic grammars and recursive neural networks. Cognition, 88(2), 133–169. Sun, R. (2001). Introduction to sequence learning. In R. Sun & C. L. Giles (Eds.), Sequence learning: Paradigms, algorithms, and applications (pp. 1–10), New York: Springer. Vullo, A., & Frasconi, P. (2003). A recursive connectionist approach for predicting disulfide connectivity in proteins. In Proceedings of the Eighteenth Annual ACM Symposium on Applied Computing (SAC 2003). Melbourne, FL.
Universal Approximation Capability
1159
Watkins, C. (1999). Dynamic alignment kernels (Tech. Rep.). Royal Holloway, University of London. Wakuya, H., & Zurada, J. (2001). Bi-directional computing architectures for time series prediction. Neural Networks, 14, 1307–1321.
Received January 13, 2004; accepted August 10, 2004.
LETTER
Communicated by Peter Bartlett
SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming Qiang Wu
[email protected] Ding-Xuan Zhou
[email protected] Department of Mathematics, City University of Hong Kong, Kowloon, Hong Kong, China
Support vector machine (SVM) soft margin classifiers are important learning algorithms for classification problems. They can be stated as convex optimization problems and are suitable for a large data setting. Linear programming SVM classifiers are especially efficient for very large size samples. But little is known about their convergence, compared with the well-understood quadratic programming SVM classifier. In this article, we point out the difficulty and provide an error analysis. Our analysis shows that the convergence behavior of the linear programming SVM is almost the same as that of the quadratic programming SVM. This is implemented by setting a stepping-stone between the linear programming SVM and the classical 1-norm soft margin classifier. An upper bound for the misclassification error is presented for general probability distributions. Explicit learning rates are derived for deterministic and weakly separable distributions, and for distributions satisfying some Tsybakov noise condition. 1 Introduction Support vector machines (SVMs) form an important subject in learning theory. They are very efficient for many applications, especially for classification problems. The classical SVM model, the so-called 1-norm soft margin SVM, was introduced with polynomial kernels by Boser, Guyon, and Vapnik (1992) and with general kernels by Cortes and Vapnik (1995). Since then, many different forms of SVM algorithms have been introduced for different purposes (e.g., Niyogi & Girosi, 1996; Vapnik, 1998). Among them, the linear programming (LP) SVM (Bradley & Mangasarian, 2000; Kecman & Hadzic, 2000; Niyogi & Girosi, 1996; Pedroso & Murata, 2001; Vapnik, 1998) is important because of its linearity and flexibility for large data setting. The term linear programming means that the algorithm is based on linear programming optimization. Correspondingly, the 1-norm soft margin SVM is Neural Computation 17, 1160–1187 (2005)
© 2005 Massachusetts Institute of Technology
SVM Soft Margin Classifiers
1161
also called quadratic programming (QP) SVM since it is based on quadratic programming optimization (Vapnik, 1998). Many experiments demonstrate that LP-SVM is efficient and performs even better than QP-SVM for some purposes: it is capable of solving huge sample size problems (Bradley & Mangasarian 2000), improving computational speed (Pedroso & Murata, 2001), and reducing the number of support vectors (Kecman & Hadzic, 2000). While the convergence of QP-SVM has become well understood because of recent work (Steinwart, 2002; Zhang, 2004; Wu & Zhou, 2004; Scovel & Steinwart, in press; Wu, Ying, & Zhou, in press), little is known about LPSVM. The purpose of this article is to point out the main difficulty and then provide error analysis for LP-SVM. Consider the binary classification setting. Let (X, d) be a compact metric space and Y = {1, −1}. A binary classifier is a function f : X → Y, which labels every point x ∈ X with some y ∈ Y. Both LP-SVM and QP-SVM considered here are kernel-based classifiers. A function K : X × X → R is called a Mercer kernel if it is continuous, symmetrical, and positive semidefinite, that is, for any finite set of distinct points {x1 , · · · , x } ⊂ X, the matrix (K (xi , x j ))i, j=1 is positive semidefinite. Let z = {(x1 , y1 ), · · · , (xm , ym )} ⊂ (X × Y)m be the sample. Motivated by reducing the number of support vectors of the 1-norm soft margin SVM, Vapnik (1998) introduced the LP-SVM algorithm associated with a Mercer kernel K . It is based on the following linear programming optimization problem:
m m 1 1 min ξi + αi α∈Rm m i=1 C i=1 + , b∈R m α j y j K (xi , x j ) + b ≥ 1 − ξi subject to yi
(1.1)
j=1
ξi ≥ 0, i = 1, · · · , m. The trade-off parameter Here α = (α1 , · · · , αm ), ξi ’s are slack variables. C = C(m) > 0 depends on m and is crucial. If αz = (α1,z , · · · , αm,z ), b z solves the optimization problem, equation 1.1, the LP-SVM classifier is given by sgn( f z ) with f z (x) =
m
αi,z yi K (x, xi ) + b z .
(1.2)
i=1
For a real-valued function f : X → R, its sign function is defined as sgn( f )(x) = 1 if f (x) ≥ 0 and sgn( f )(x) = − 1 otherwise.
1162
Q. Wu and D.-X. Zhou
The QP-SVM is based on a quadratic programming optimization problem: m m 1 1 min ξi + αi yi K (xi , x j )α j y j α∈Rm m i=1 2C + , b∈R i, j=1
m (1.3) subject to yi α j y j K (xi , x j ) + b ≥ 1 − ξi j=1
ξi ≥ 0, i = 1, · · · , m. = C(m) Here, C > 0 is also a trade-off parameter depending on the sample b z ) solves the optimization problem 1.3, then size m. If ( αz = (α˜ 1,z , · · · , α˜ m,z ), the 1-norm soft margin classifier is defined by sgn( f z ) with f z (x) =
m
αi,z yi K (x, xi ) + bz .
(1.4)
i=1
Observe that both the LP-SVM classifier (see equation 1.1) and QP-SVM classifier (see equation 1.3) are implemented by convex optimization problems. Compared with this, neural network learning algorithms are often performed by nonconvex optimization problems. The reproducing kernel property of Mercer kernels ensures nice approximation power of SVM classifiers. Recall that the reproducing kernel Hilbert space (RKHS) H K associated with a Mercer kernel K is defined (Aronszajn, 1950) to be the closure of the linear span of the set of functions {K x := K (x, ·) : x ∈ X} with the inner product ·, · K satisfying K x , K y K = K (x, y). The reproducing property is given by f, K x K = f (x),
∀x ∈ X, f ∈ H K .
(1.5)
The QP-SVM is well understood. It has attractive approximation properties (see section 2) because the learning scheme can be represented as a Tikhonov regularization (Evgeniou, Pontil, & Poggio, 2000) (modified by an offset) associated with the RKHS, fz =
arg min
f = f ∗ + b∈H K +R
m 1 1 (1 − yi f (xi ))+ + f ∗ 2K m i=1 2C
,
(1.6)
where (t)+ = max{0, t}. Set H K := H K + R. For a function f = f 1 + b 1 ∈ H K , we denote f ∗ = f 1 and b f = b 1 . Write b fz as b z . It turns out that equation 1.6 is the same as 1.3 together with 1.4. To m see this, we first note that f ∗z must lie in the span of {K xi }i=1 according
SVM Soft Margin Classifiers
1163
to the representation theorem (Wahba, 1990). Next, the dual problem of equation 1.6 shows (Vapnik, 1998) that the coefficient of K xi , αi yi , has the same the definition of the H K norm yields m sign as y2i . Finally, f ∗z 2K = i=1 αi yi K xi K = i,mj = 1 αi yi K (xi , x j )α j y j . The rich knowledge on Tikhonov regularization schemes and the idea of bias-variance trade-off developed in the neural network literature provide a mathematical foundation of the QP-SVM. In particular, the convergence is well understood due to the work done within the past few years. Here the form 1.6 illustrates some advantages of the QP-SVM: the minimization is taken over the whole space H K , so we expect the QP-SVM has some good approximation power, similar to the approximation error of the space H K . Things are totally different for LP-SVM. Set H K ,z =
m
αi yi K (x, xi ) : α = (α1 , · · · , αm ) ∈
Rm +
.
i=1
Then the LP-SVM scheme, equation 1.1, can be written as fz =
arg min
f = f ∗ + b∈H K ,z +R
m 1 1 ∗ (1 − yi f (xi ))+ + ( f ) . m i=1 C
(1.7)
m m αi for f ∗ = i=1 αi yi K xi with Here we have denoted ( f ∗ ) = yα 1 = i=1 ∗ αi ≥ 0. It plays the role of a norm of f in some sense. This is not a Hilbert space norm, which raises the technical difficulty for mathematical analysis. More serious, the hypothesis space H K ,z depends on the sample z. The “centers” xi of the basis functions in H K ,z are determined by the sample z; they are not free. One might consider regularization schemes in the space of all linear combinations with free centers, but whether the minimization can be reduced to a convex optimization problem of size m, like equation 1.1, is unknown. Also, it is difficult to relate the corresponding optimum (in a ball with radius C) to f z∗ with respect to the estimation error. Thus, separating the error for LP-SVM into two terms of sample error and approximation error is not as immediate as for the QP-SVM or neural network methods (Niyogi & Girosi 1996), where the centers are free. In this article, we overcome this difficulty by setting a stepping-stone. Turn to the error analysis. Let ρ be a Borel probability measure on Z := X × Y and (X , Y) be the corresponding random variable. The prediction power of a classifier f is measured by its misclassification error, that is, the probability of the event f (X ) = Y: R( f ) = Prob{ f (X ) = Y} =
P(Y = f (x)|x) dρ X . X
(1.8)
1164
Q. Wu and D.-X. Zhou
Here ρ X is the marginal distribution, and ρ(·|x) is the conditional distribution of ρ. The classifier minimizing the misclassification error is called the Bayes rule f c . It takes the form f c (x) =
1,
if P(Y = 1|x) ≥ P(Y = − 1|x),
−1,
if P(Y = 1|x) < P(Y = − 1|x).
If we define the regression function of ρ as f ρ (x) =
ydρ(y|x) = P(Y = 1|x) − P(Y = − 1|x),
x ∈ X,
Y
then f c = sgn( f ρ ). Note that for a real-valued function f, sgn( f ) gives a classifier, and its misclassification error will be denoted by R( f ) for abbreviation. Although the Bayes rule exists, it cannot be found directly since ρ is unm known. Instead, we have in hand a set of samples z = {zi }i=1 = {(x1 , y1 ), · · · , (xm , ym )} (m ∈ N). Throughout the article, we assume {z1 , · · · , zm } are independently and identically distributed according to ρ. A classification algorithm constructs a classifier f z based on z. Our goal is to understand how to choose the parameter C = C(m) in algorithm 1.1 so that the LP-SVM classifier sgn( f z ) can approximate the Bayes rule f c with satisfactory convergence rates (as m → ∞). Our approach provides clues to studying learning algorithms with penalty functional different from the RKHS norm (Niyogi & Girosi 1996; Evgeniou et al., 2000). It can be extended to schemes with general loss functions (Rosasco, De Vito, Caponnetto, Piana, & Verri, 2004; Lugosi & Vayatis, 2004; Wu et al., in press). 2 Main Results In this article, we investigate learning rates: the decay of the excess misclassification error R( f z ) − R( f c ) as m and C(m) become large. Consider the QP-SVM classification algorithm f z defined by equation 1.3. = C(m) Steinwart (2002) showed that R( f z ) − R( f c ) → 0 (as m and C → ∞), when H K is dense in C(X), the space of continuous functions on X with the norm · ∞ . Lugosi and Vayatis (2004) found that for the exponential loss, the excess misclassification error of regularized boosting algorithms can be estimated by the excess generalization error. An important result on the relation between the misclassification error and generalization error for a convex loss function is due to Zhang (2004). (See Bartlett, Jordan, & McAuliffe, 2003, and Wu, Ying, & Zhou, in press, for extensions to general
SVM Soft Margin Classifiers
1165
loss functions.) Here we consider the hinge loss V(y, f (x)) = (1 − yf (x))+ . The generalization error is defined as E( f ) =
V(y, f (x)) dρ. Z
Note that f c is a minimizer of E( f ). Then Zhang’s results assert that R( f ) − R( f c ) ≤ E( f ) − E( f c ),
∀ f : X → R.
(2.1)
Thus, the excess misclassification error R( f z ) − R( f c ) can be bounded by the excess generalization error E( f z ) − E( f c ), and the following error decomposition (Wu & Zhou, 2004) holds: C). E( f z ) − E( f c ) ≤ {E( f z ) − Ez ( f z ) + Ez ( f K ,C ) − E( f K ,C )} + D( Here, Ez ( f ) = m1 defined as
m i=1
(2.2)
and is V(yi , f (xi )). The function f K ,C depends on C
1 f ∗ 2K , f K ,C : = arg min E( f ) + 2C f ∈H K
C > 0.
(2.3)
The decomposition, equation 2.2, makes the error analysis for QP-SVM easy, similar to that in Niyogi and Girosi (1996). The second term of equation 2.2 measures the approximation power of H K for ρ. Definition 1. The regularization error of the system (K , ρ) is defined by
1 inf f ∗ 2K . D(C) := f ∈H E( f ) − E( f c ) + K 2C
(2.4)
The regularization error for a regularizing function f K ,C ∈ H K is defined as D(C) := E( f K ,C ) − E( f c ) +
1 f ∗ 2 . 2C K ,C K
(2.5)
In Wu and Zhou (2004) we showed that E( f ) − E( f c ) ≤ f − f c L 1ρ . X Hence, the regularization error can be estimated by the approximation in 1 a weighted L space, as done in Smale and Zhou (2003) and Chen et al. (2004). Definition 2. We say that the probability measure ρ can be approximated by H K with exponent 0 < β ≤ 1 if there exists a constant c β such that H1 :
D(C) ≤ c β C −β ,
∀C > 0 .
1166
Q. Wu and D.-X. Zhou
The first term of equation 2.2 is called the sample error. It has been well understood in learning theory by concentration inequalities (Vapnik, 1998; Devroye, Gyorfi, ¨ & Lugosi, 1997; Niyogi, 1998; Cucker & Smale, 2001; Bousquet & Elisseeff, 2002). The approaches developed in Barron (1990), Bartlett (1998), Niyogi and Girosi (1996), and Zhang (2004) separate the regularization error and the sample error concerning f z . In particular, for the QP-SVM, Zhang (2004) proved that f z )} ≤ inf Ez∈Zm {E(
f ∈H K
E( f ) +
1 2C . f ∗ 2K + m 2C
(2.6)
C) + 2C . When assumption H1 f z ) − E( f c )} ≤ D( It follows that Ez∈Zm {E( m fz) − holds, Zhang’s bound in connection with equation 2.1 yields Ez∈Zm {R( −β ) + 2C . This is similar to some well-known bounds for the R( f c )} = O(C m neural network learning algorithms (see, e.g., theorem 3.1 in Niyogi & Girosi, 1996). The best learning rate derived from equation 2.6 by choos = m1/(β+1) is ing C f z ) − R( f c )} = O(m−α ), Ez∈Zm {R(
α=
β . β +1
(2.7)
Observe that the sample error bound 2mC in equation 2.6 is independent of the kernel K or the distribution ρ. If some information about K or ρ is available, the sample error, and hence the excess misclassification error, can be improved. The information we need about K is the capacity measured by covering numbers. Definition 3. Let F be a subset of a metric space. For any ε > 0 , the covering number N (F, ε) is defined to be the minimal integer ∈ N such that there exist balls with radius ε covering F. In this article, we use only the uniform covering number. Covering numbers measured by empirical distances are also used in the literature (van der Vaart & Wellner, 1996). (For comparisons, see Pontil, 2003.) Let B R = { f ∈ H K : f K ≤ R}. It is a subset of C(X), and the covering number is well defined. We denote the covering number of the unit ball B1 as N (ε) := N (B1 , ε),
ε > 0.
(2.8)
SVM Soft Margin Classifiers
1167
Definition 4. The RKHS H K is said to have logarithmic complexity exponent s ≥ 1 if there exists a constant c s > 0 such that H2 :
logN (ε) ≤ c s (log(1 /ε))s .
It has polynomial complexity exponent s > 0 if there is some c s > 0 such that H2’ :
logN (ε) ≤ c s (1 /ε)s .
The uniform covering number has been extensively studied in learning theory. In particular, we know that for the gaussian kernel K (x, y) = exp{−|x − y|2 /σ 2 } with σ > 0 on a bounded subset X of Rn , assumption H2 holds with s = n + 1 (see Zhou, 2002); if K is C r with r > 0 (Sobolev smoothness), then assumption H2 is valid with s = 2n/r (see Zhou, 2003). The information we need about ρ is a Tsybakov noise condition (Tsybakov, 2004). Definition 5. Let 0 ≤ q ≤ ∞. We say that ρ has Tsybakov noise exponent q if there exists a constant c q > 0 such that H3 :
PX ({x ∈ X : | f ρ (x)| ≤ c q t}) ≤ t q .
All distributions have at least noise exponent 0. Deterministic distributions (which satisfy | f ρ (x)| ≡ 1) have the noise exponent q = ∞ with c ∞ = 1. Using the above conditions about K and ρ, Scovel and Steinwart (in press) showed that when assumptions H1, H2 , and H3 hold, for every > 0 and every δ > 0, with confidence 1 − δ, R( f z ) − R( f c ) = O(m−α ),
α=
4β(q + 1) − . (2q + sq + 4)(1 + β)
(2.9)
When no conditions are assumed for the distribution (i.e., q = 0) or s = 2 for the kernel (the worst case when empirical covering numbers are used; β see van der Vaart and Wellner, 1996), the rate is reduced to α = β+1 − , arbitrarily close to Zhang’s rate (2004) (see equation 2.7). Recently, Wu et al. (in press) improved the rate (see equation 2.9) and showed that under the same assumptions—H1, H2 , and H3—for every , δ > 0, with confidence 1 − δ, R( f z ) − R( f c ) = O(m−α ),
β(q + 1) 2β α = min − , . β(q + 2) + (q + 1 − β)s/2 β +1
(2.10)
1168
Q. Wu and D.-X. Zhou
When some condition is assumed for the kernel but not for the distribution, β 2β that is, s < 2 but q = 0, the rate 2.10 has power α = min{ 2β+(1−β)s/2 − , β+1 }. This is better than equation 2.7 or 2.9 (or the rates given in Bartlett et al. 2003, and Blanchard, Bousquet, & Massart, 2004; see Chen et al., 2004, and Wu et al., in press, for detailed comparisons) if β < 1. This improvement is possible due to the projection operator. Definition 6. The projection operator π is defined on the space of measurable functions f : X → R as if f (x) > 1, 1, if f (x) < −1, π ( f )(x) = −1, f (x), if − 1 ≤ f (x) ≤ 1. The idea of projections appeared in margin-based bound analysis, for example, Bartlett (1998), Lugosi and Vayatis (2004), Zhang (2002), and Anthony and Bartlett (1999). We used the projection operator for the purpose of bounding misclassification and generalization errors in Chen et al. (2004). It helped us get sharper bounds of the sample error: probability inequalities are applied to random variables involving functions π( f z ) (bounded by 1), not to f z (the corresponding bound increases to infinity as C becomes large). In this article, we apply the projection operator to the LP-SVM. Let us turn to our main goal: the LP-SVM classification algorithm f z defined by equation 1.1. To our knowledge, the convergence of the algorithm has not been verified, even for distributions strictly separable by a universal kernel. What is the main difficulty in the error analysis? One difficulty lies in the error decomposition: nothing like equation 2.2 exists for LP-SVM in the literature. Bounds for the regularization or approximation error independent of z are not available. We do not know whether it can be bounded by a norm in the whole space H K or a norm similar to those in Niyogi and Girosi (1996). In this article, we overcome the difficulty by means of a stepping-stone from QP-SVM to LP-SVM. Then we can provide error analysis for general distributions. In particular, explicit learning rates will be presented. To this end, we first make an error decomposition. Theorem 1. Let C > 0 , 0 < η ≤ 1, and f K ,C ∈ H K . There holds R( f z ) − R( f c ) ≤ 2ηR( f c ) + S(m, C, η) + 2D(ηC), where S(m, C, η) is the sample error defined by S(m, C, η) := {E(π ( f z )) − Ez (π ( f z ))} + (1 + η){Ez ( f K ,C ) − E( f K ,C )}. (2.11)
SVM Soft Margin Classifiers
1169
Theorem 1 will be proved in section 4. The term D(ηC) is the regularization error (Smale & Zhou 2004) defined for a regularizing function f K ,C (arbitrarily chosen) by equation 2.5. In Chen et al. (2004), we showed that D(C) ≥ D(C) ≥
κ2 2C
(2.12)
where κ := E0 /(1 + κ),
κ = sup
K (x, x),
x∈X
E0 := inf {E(b) − E( f c )}. b∈R
Also, κ = 0 only for very special distributions. Hence, the decay of D(C) cannot be faster than O(1/C) in general. Thus, to have satisfactory convergence rates, C cannot be too small, and it usually takes the form of mτ for some τ > 0. The constant κ is the norm of the inclusion H K ⊂ C(X): f ∞ ≤ κ f K ,
∀ f ∈ HK .
(2.13)
Next we focus on analyzing the learning rates. Since a uniform rate is impossible for all probability distributions as shown in theorem 7.2 of Devroye et al. (1997), we need to consider subclasses. The choice of η is important in the upper bound in theorem 1. If the distribution is deterministic, that is, R( f c ) = 0, we may choose η = 1. When R( f c ) > 0, we must choose η = η(m) → 0 as m → ∞ in order to get the convergence rate. Of course, the latter choice may lead to a slightly worse rate. Thus, we will consider these two cases separately. The following proposition gives the bound for deterministic distributions. Proposition 1. Suppose R( f c ) = 0. If f K ,C is a function in H K satisfying V(y, f K ,C (x)) ∞ ≤ M, then for every 0 < δ < 1 , with confidence 1 − δ, there holds R( f z ) ≤ 32εm,C +
20M log(2/δ) + 8D(C), 3m
where with a constant c s depending on c s , κ and s, εm,C is given by
s 22 2 2
, log + c s log C M log + log (mCD(C)) m δ δ if assumption H2 holds; s 32c (CD(C)) 1+s 35 log2/δ 1 s s
1+s 1+s ) (C M) , 1 + (c + s m 3m1/(1+s)
if assumption H2 holds.
1170
Q. Wu and D.-X. Zhou
Proposition 1 will be proved in section 6. As corollaries, we obtain learning rates for strictly separable distributions and for weakly separable distributions. Definition 7. We say that ρ is strictly separable by H K with margin γ > 0 if there is some function f γ ∈ H K such that f γ∗ K = 1 and yf γ (x) ≥ γ almost everywhere. For QP-SVM, the strictly separable case is well understood (see, e.g., Vapnik, 1998, and Cristianini & Shawe-Taylor, 2000). For LP-SVM, we have Corollary 1. If ρ is strictly separable by H K with margin γ > 0 and assumption H2 holds, then R( f z ) ≤
s 1 704 2 4 . log + c s log m + log 2 + m δ γ Cγ 2
In particular, this will yield the learning rate O
(log m)s m
by taking C = m/γ 2 .
Proof. Take f K ,C = f γ /γ . Then V(y, f K ,C (x)) ≡ 0 and D(C) equals 1 1 f γ∗ /γ 2K = 2Cγ 2 . The conclusion follows from proposition 1 by choosing 2C M = 0. Remark 1. For strictly separable distributions, we verify the optimal rate when assumption H2 holds. Similar rates are true for more general kernels. But we omit the details here. Definition 8. We say that ρ is (weakly) separable by H K if there is some function ∗ f sp ∈ H K , called the separating function, such that f sp K = 1 and yf sp (x) > 0 almost everywhere. It has separating exponent θ ∈ (0, ∞] if for some γθ > 0, there holds ρ X (0 < | f sp (x)| < γθ t) ≤ t θ .
(2.14)
Corollary 2. Suppose that ρ is separable by H K with equation 2.14 valid. (i) If assumption H2 holds, then
(log m + log C)s θ − θ+ 2 . +C R( f z ) = O m This gives the learning rate O(
(log m)s ) m
by taking C = m(θ+2)/θ .
SVM Soft Margin Classifiers
1171
(ii) If assumption H2 holds, then 2s 11+s s 1 +s θ +2 C C θ R( f z ) = O + C − θ +2 . + m m θ
θ +2
This yields the learning rate O(m− sθ +2s+θ ) by taking C = m sθ +2s+θ . 1
Proof. Take f K ,C = C θ +2 f sp /γθ . By the definition of f sp , we have yf K ,C (x) ≥ 0 almost everywhere. Hence, 0 ≤ V(y, f K ,C (x)) ≤ 1. Moreover, E( f K ,C ) =
X
1
C θ +2 | f sp (x)| 1− γθ
dρ X +
1 = ρ X 0 < | f sp (x)| < γθ C − θ +2 , θ
θ
which is bounded by C − θ +2 . Therefore, D(C) ≤ (1 + 2γ1 2 )C − θ +2 . Then θ the conclusion follows from proposition 1 by choosing M = 1. Example. Let X = [−1/2, 1/2] and ρ be the Borel probability measure on Z such that ρ X is the Lebesgue measure on X and
f ρ (x) =
−1, 1,
if if
− 1/2 ≤ x < 0, 0 < x < 1/2.
If we take the linear kernel K (x, y) = x · y, then θ = 1, γθ = 1/2. Since aslog m sumption H2 is satisfied with s = 1, the learning rate is O( m ) by taking C = m3 . Remark 2. Condition 2.14 with θ = ∞ is exactly the definition of strictly separable distribution, and γθ is the margin. The choice of f K ,C and the regularization error play essential roles in getting our error bounds. It influences the strategy of choosing the regularization parameter (model selection) and determines learning rates. For weakly separable distributions, we chose f K ,C to be multiples of a separating function in corollary 2. For the general case, it can be equation 2.3. We analyze learning rates for distributions having polynomially decaying regularization error, that is, assumption H1 with β ≤ 1. This is reasonable because of equation 2.12. Theorem 2. Suppose that R( f c ) = 0 and assumptions H1 and H2 hold with 1 2 0 < s < ∞ and 0 < β ≤ 1 , respectively. Take C = mζ with ζ := min { s+β , 1 +β }.
1172
Q. Wu and D.-X. Zhou
Then for every 0 < δ < 1 , there exists a constant c depending on s, β, δ such that with confidence 1 − δ, R( f z ) ≤ cm−α ,
α = min
2β β , . 1 +β s+β
Next we consider general distributions satisfying the Tsybakov condition (Tsybakov, 2004). Theorem 3. Assume assumptions H1, H2 , and H3 with 0 < s < ∞, 0 < β ≤ 1 , and 0 ≤ q ≤ ∞. Take C = mζ with
ζ := min
2 (q + 1 )(β + 1 ) , . β + 1 s(q + 1 ) + β(q + 2 + q s + s)
c depending on For every > 0 and every 0 < δ < 1 , there exists a constant s, q , β, δ, and such that with confidence 1 − δ, cm−α , R( f z ) − R( f c ) ≤
2β β(q + 1 ) α = min , − . β + 1 s(q + 1 ) + β(q + 2 + q s + s) Remark 3. Since R( f c ) is usually small for a meaningful classification problem, the upper bound in theorem 1 tells us that the performance of LP-SVM is similar to that of QP-SVM. However, to have convergence rates, we need to choose η = η(m) → 0 as m becomes large. This makes our rate worse than that of QP-SVM. This is the case when the capacity index s is large. When s is very small, the rate is O (m−α ) with α close to min{ qq +1 , 2β }, which co+2 β+1 incides to rate 2.10 and is better than rates 2.7 or 2.9 for QP-SVM. As any C ∞ kernel satisfies assumption H2 for an arbitrarily small s > 0 (Zhou, 2003), this is the case for polynomial or gaussian kernels, usually used in practice. Remark 4. Here we use a stepping-stone from QP-SVM to LP-SVM, so the derived learning rates for the LP-SVM are essentially no worse than those of QP-SVM. It would be interesting to introduce different tools to get learning rates for the LP-SVM that are better than those of QP-SVM. Also, the choice of the trade-off parameter C in theorem 3 depends on the indices β (approximation), s (capacity), and q (noise condition). This gives a rate that is optimal by our approach. One can take other choices ζ > 0 (for C = mζ ), independent of β, s, q , and then derive learning rates according to the proof of theorem 3. But the derived rates are worse than the one stated in theorem 3. It would be of importance to give some methods for choosing C adaptively.
SVM Soft Margin Classifiers
1173
Remark 5. When empirical covering numbers are used, the capacity index can be restricted to s ∈ [0, 2]. Similar learning rates can be derived, as done in Blanchard et al. (2004) and Wu et al. (in press). 3 Stepping-Stone Recall that in equation 1.7, the penalty term ( f ∗ ) is usually not a norm. This makes the scheme difficult to analyze. Since the solution f z of the LPSVM has a representation similar to f z in QP-SVM, we expect close relations between these schemes. Hence, the latter may play roles in the analysis for the former. To this end, we need to estimate ( f ∗z ), the l 1 -norm of ∗ the coefficients of the solution f z to equation 1.4. > 0, the function Lemma 1. For every C f z defined by equations 1.3 and 1.4 satisfies ( f ∗z ) =
m
z( αi,z ≤ CE fz) + f ∗z 2K .
i=1
Proof. The dual problem of the 1-norm soft margin SVM (Vapnik, 1998) tells us that the coefficients αi,z in the expression 1.4 of f z satisfy 0 ≤ αi,z ≤
C m
and
m
αi,z yi = 0.
(3.1)
i=1
The definition of the loss function V implies that 1 − yi f z (xi )). f z (xi ) ≤ V(yi , Then m i=1
αi,z −
m
αi,z yi f z (xi ) ≤
i=1
m
αi,z V(yi , f z (xi )).
i=1
Applying the upper bound for αi,z in equation 3.1, we can bound the right side above as m
αi,z V(yi , f z (xi )) ≤
i=1
m C z( V(yi , f z (xi )) = CE f z ). m i=1
Applying the second relation in equation 3.1 yields m i=1
αi,z yi b z = 0.
1174
Q. Wu and D.-X. Zhou
It follows that m
αi,z yi f z (xi ) =
i=1
m
m ∗ f z (xi ) + αi,z yi bz = αi,z yi f ∗z (xi ).
i=1
But f ∗z (xi ) =
m
i=1
α j,z y j K (xi , x j ). We have
j=1 m
αi,z yi f z (xi ) =
i=1
m
αi,z yi α j,z y j K (xi , x j ) = f ∗z 2K .
i, j=1
Hence, the bound for ( f ∗z ) follows.1 4 Error Decomposition In this section, we estimate R( f z ) − R( f c ). Since sgn(π( f )) = sgn( f ), we have R( f ) = R(π ( f )). Using equation 2.1 to π( f ), we obtain R( f ) − R( f c ) = R(π( f )) − R( f c ) ≤ E(π( f )) − E( f c ).
(4.1)
It is easy to see that V(y, π ( f )(x)) ≤ V(y, f (x)). Hence, E(π( f )) ≤ E( f )
and Ez (π ( f )) ≤ Ez ( f ).
(4.2)
We are in a position to prove theorem 1, which, by equation 4.1, is an easy consequence of the following result. Proposition 2. Let C > 0, 0 < η ≤ 1, and f K ,C ∈ H K . Then E(π ( f z )) − E( f c ) +
1 ( f z∗ ) ≤ 2ηR( f c ) + S(m, C, η) + 2D(ηC), C
where S(m, C, η) is defined by equation 2.11. = ηC. Proof. Take f z to be the solution of equation 1.4 with C We see from the definition of f z and equation 4.2 that
1 1 f z ) + ( f ∗z ) ≤ 0. Ez (π( f z )) + ( f z∗ ) − Ez ( C C 1 Yiming Ying pointed out to us that actually the equality holds in lemma 1. This follows from the KKT conditions. But we need only the inequality here.
SVM Soft Margin Classifiers
1175
This enables us to decompose E((π( f z )) + C1 ( f z∗ ) as E(π( f z )) +
1 1 f z ) + ( ( f z∗ ) ≤ {E(π ( f z )) − Ez (π ( f z ))} + Ez ( f ∗z ) . C C
z( = ηC. Hence, Lemma 1 gives ( f ∗z ) ≤ CE fz) + f ∗z 2K . But C E(π( f z )) +
1 ( f z∗ ) ≤ {E(π( f z )) − Ez (π ( f z ))} C 1 ∗ 2 + (1 + η)Ez ( fz) + f . C z K
Next, we use the function f K ,C to analyze the second term of the above bound and get Ez ( fz) +
1 1 ∗ 2 1 fz) + f ∗z 2K ≤ Ez ( f z K ≤ Ez ( f K ,C ) + f ∗ 2 . K ,C K (1 + η)C 2C 2C
This bound can be written as {Ez ( f K ,C ) − E( f K ,C )} + {E( f K ,C ) +
1 ∗ 2 f K ,C K }. 2C
Combining the above two steps, we find that E(π( f z )) − E( f c ) + is bounded by
1 ( f z∗ ) C
{E(π( f z )) − Ez (π ( f z ))} + (1 + η){Ez ( f K ,C ) − E( f K ,C )} 1 + (1 + η) E( f K ,C ) − E( f c ) + 2ηC f K∗ ,C 2K + ηE( f c ). By the fact E( f c ) = 2R( f c ) and the definition of D(C), we draw our conclusion. 5 Probability Inequalities In this section we give some probability inequalities. They modify the Bernstein inequality and extend our previous work in Chen et al. (2004), which was motivated by sample error estimates for the square loss (Barron, 1990; Bartlett, 1998; Cucker & Smale, 2001; Mendelson, 2002). Recall the Bernstein inequality: Let ξ be a random variable on Z with mean µ and variance σ 2 . If |ξ − µ| ≤ M. Then m 1 mε 2 . Prob µ − ξ (zi ) > ε ≤ 2 exp − m i=1 2(σ 2 + 1 Mε) 3
The one-side Bernstein inequality holds without the leading factor 2.
1176
Q. Wu and D.-X. Zhou
Proposition 3. Let ξ be a random variable on Z satisfying µ ≥ 0 , |ξ − µ| ≤ M almost everywhere, and σ 2 ≤ cµτ for some 0 ≤ τ ≤ 2. Then for every ε > 0 , there holds m ξ (zi ) µ − m1 i=1 mε2−τ 1 − τ2 . Prob >ε ≤ exp − 1 2(c + 13 Mε 1 −τ ) (µτ + ε τ ) 2 Proof. The one-side Bernstein inequality tells us that m ξ (zi ) µ − m1 i=1 1− τ2 Prob >ε 1 (µτ + ε τ ) 2 mε2−τ (µτ + ε τ ) ≤ exp − . τ 1 2 σ 2 + 13 Mε1− 2 (µτ + ε τ ) 2 Since σ 2 ≤ cµτ , we have M 1− τ τ M 1 ε 2 (µ + ε τ ) 2 ≤ cµτ + ε 1−τ (µτ + ε τ ) 3 3
1 ≤ (µτ + ε τ ) c + Mε1−τ . 3
σ2 +
This yields the desired inequality. Note that f z depends on z and thus runs over a set of functions as z changes. We need a probability inequality concerning the uniform conver gence. Denote E g := Z g(z)dρ. Lemma 2. Let 0 ≤ τ ≤ 1 , M > 0 , c ≥ 0 , and G be a set of functions on Z such that for every g ∈ G, E g ≥ 0 , |g − E g| ≤ M and E g 2 ≤ c(E g)τ . Then for ε > 0 ,
Eg −
1 m
m i=1
g(zi )
1− τ2
> 4ε 1 ((E g)τ + ε τ ) 2 −mε 2−τ ≤ N G, ε exp . 2(c + 13 Mε1−τ )
Prob sup g∈G
Proof. Let {g j }Nj = 1 ⊂ G with N = N G, ε such that for every g ∈ G, there is some j ∈ {1, . . . , N } satisfying g − g j ∞ ≤ ε. Then by proposition 3, a standard procedure (Cucker & Smale, 2001; Mukherjee, Rifkin, & Poggio, 2002; Chen et al. 2004) leads to the conclusion.
SVM Soft Margin Classifiers
1177
Remark 6. Various forms of probability inequalities using empirical covering numbers can be found in the literature. For simplicity we give the current form in lemma 2, which is enough for our purpose. Let us find the hypothesis space covering f z when z runs over all possible samples. This is implemented in the following two lemmas. By the idea of bounding the offset from Wu and Zhou (2004) and Chen et al. (2004), we can prove the following. Lemma 3. For any C > 0 , m ∈ N, and z ∈ Zm , we can find a solution f z of min | f (x )| ≤ 1. Hence, |b | ≤ 1 + f ∗ . equation 1.7 satisfying 1≤i≤m z i z z ∞ We shall always choose f z as in lemma 3. In fact, the only restriction we need to make for the minimizer f z is to choose αi = 0 and b z = y∗ , that is, f z (x) = y∗ whenever yi = y∗ for all 1 ≤ i ≤ m with some y∗ ∈ Y. Lemma 4. For every C > 0 , we have f z∗ ∈ H K and f z∗ K ≤ κ( f z∗ ) ≤ κC. Proof. It is trivial that f z∗ ∈ H K . By the reproducing property 1.5,
f z∗ K
=
m
1/2 αi,z α j,z yi y j K (xi , x j )
i, j = 1
≤κ
m
i, j = 1
1/2 αi,z α j,z
= κ( f z∗ ).
Bounding the solution to equation 1.7 by the choice f = 0 + 0, we have E( f z ) + C1 ( f z∗ ) ≤ E(0) + 0 = 1. This gives ( f z∗ ) ≤ C, and completes the proof. By lemmas 4 and 3, we know that π ( f z ) lies in F R := {π( f ) : f ∈ B R + [−(1 + κ R), 1 + κ R]},
(5.1)
with R = κC. The following lemma (Chen et al., 2004) gives the covering number estimate for F R . Lemma 5. Let F R be given by equation 5.1 with R > 0 . For any ε > 0 , there holds
2(1 + κ R) ε N (F R , ε) ≤ +1 N . ε 2R
1178
Q. Wu and D.-X. Zhou
Using the function set F R defined by equation 5.1, we set for R > 0, G R = {V(y, f (x)) − V(y, f c (x)) : f ∈ F R }.
(5.2)
By lemma 5 and the additive property of the log function, we have: Lemma 6. Let G R given by equation 5.2 with R > 0: (i) If assumption H2 holds, then there exists a constant c s > 0 such that
R s log N (G R , ε) ≤ c s log . ε (ii) If assumption H2 holds, then there exists a constant c s > 0 such that s R log N (G R , ε) ≤ c s . ε The following lemma was proved by Scovel and Steinwart (2003) for general functions f : X → R. With the projection, f here has range [−1, 1], and a simpler proof is given. Lemma 7. Assume assumption H3. For every function f : X → [−1 , 1 ] there holds E{(V(y, f (x)) − V(y, f c (x)))2 } ≤ 8
1 2c q
q /(q +1)
q
(E( f ) − E( f c )) q +1 .
Proof. Since f (x) ∈ [−1, 1], we have V(y, f (x)) − V(y, f c (x)) = y( f c (x) − f (x)). It follows that E( f ) − E( f c ) =
X
= X
( f c (x) − f (x)) f ρ (x)dρ X | f c (x) − f (x)| | f ρ (x)|dρ X
and E{(V(y, f (x)) − V(y, f c (x)))2 } =
| f c (x) − f (x)|2 dρ X . X
Let t > 0, and separate the domain X into two sets: Xt+ := {x ∈ X : | f ρ (x)| > c q t} and Xt− := {x ∈ X : | f ρ (x)| ≤ c q t}. On Xt+ , we have | f c (x) − f (x)|2 ≤ | f (x)| 2| f c (x) − f (x)| cρq t . On Xt− , we have | f c (x) − f (x)|2 ≤ 4. It follows from
SVM Soft Margin Classifiers
1179
assumption H3 that 2(E( f ) − E( f c )) | f c (x) − f (x)|2 dρ X ≤ + 4ρ X (Xt− ) cq t X ≤
2(E( f ) − E( f c )) + 4t q . cq t
Choosing t = {(E( f ) − E( f c ))/(2c q )}1/(q +1) yields the desired bound. Take the function set G in lemma 2 to be G R . Then a function g in G R takes the form g(x, y) = V(y, π ( f )(x)) − V(y, f c (x)) with m π ( f ) ∈ F R . Obviously we have g ∞ ≤ 2, E g = E(π( f )) − E( f c ) and m1 i=1 g(zi ) = Ez (π ( f )) − Ez ( f c ). τ When assumption H3 is valid, lemma 7 tells us that E g 2 ≤ c E g with q /(q +1) q and c = 8 2c1q . Applying lemma 2 and solving the equation τ = q +1 mε2−τ log N G R , ε − = log δ, 2 c + 13 · 2ε 1−τ we see the following corollary from lemmas 6 and 7. Corollary 3. Let G R be defined by equation 5.2 with R > 0 and assumption H3 hold with 0 ≤ q ≤ ∞. For every 0 < δ < 1 , with confidence at least 1 − δ, there holds q +2
q
2(q +1) {E( f ) − E( f c )} − {Ez ( f ) − Ez ( f c )} ≤ 4εm,R + 4εm,R {E( f ) − E( f c )} 2(q +1)
for all f ∈ F R , where εm,R is given by q +1
q /(q +1) log 1δ + c s (log R + log m)s q +2 1 1 5 8 + , 2cq 3 m if assumption H2 holds, + 1)
q +1
s q +(q2+q s+s q /(q +1) log δ2 q +2 R c s 1 1 , 8 8 + + 2cq 3 m m if assumption H2 holds. 6 Rate Analysis Let us now prove the main results stated in section 2. We first prove proposition 1. Proof of Proposition 1. Since R( f c ) = 0, V(y, f c (x)) = 0 almost everywhere and E( f c ) = 0. Take η = 1 in proposition 2.
1180
Q. Wu and D.-X. Zhou
We first consider the random variable ξ = V(y, f K ,C (x)). Since 0 ≤ ξ ≤ M and Eξ = E( f K ,C ) ≤ D(C), we have σ 2 (ξ ) ≤ Eξ 2 ≤ MEξ ≤ MD(C). Applying the one-side Bernstein inequality to ξ , we see by solving the 2 quadratic equation − 2(σ 2mε = log δ/2 that with probability 1 − δ/2, +Mε/3) 2M log Ez ( f K ,C ) − E( f K ,C ) ≤ 3m 5M log ≤ 3m
!
2 δ
2 δ
+
2σ 2 (ξ ) log 2/δ m
+ D(C).
(6.1)
Next we estimate E(π ( f z )) − Ez (π ( f z )). By the definition of f z , there holds 1 1 1 f z ) + ( ( f z∗ ) ≤ Ez ( f z ) + ( f z∗ ) ≤ Ez ( f ∗z ). C C C 1 ∗ 2 f z) + f z K ). This in According to lemma 1, this is bounded by 2(Ez ( 2C connection with the definition of f z yields
1 1 ∗ 2 1 ∗ ∗ 2 ≤ 2 Ez ( f K ,C ) + ( f z ) ≤ 2 Ez ( f z ) + f f . C 2C z K 2C K ,C K Since E( f c ) = 0, D(C) = E( f K ,C ) +
1 f ∗ 2 . It follows that 2C K ,C K
1 ( f z∗ ) ≤ 2(Ez ( f K ,C ) − E( f K ,C ) + D(C)). C Together with lemma 4 and equation 6.1, this tells us that with probability 1 − δ/2,
f z∗ K
≤ κ(
f z∗ )
≤ R : = 2κC
5M log(2/δ) + 2D(C) . 3m
As we are considering a deterministic case, assumption H3 holds with q = ∞ and c ∞ = 1. Recall the definition of G R in equation 5.2. Corollary 3
SVM Soft Margin Classifiers
1181
with q = ∞ and R given as above implies that √ E(π( f z )) − Ez (π ( f z )) ≤ 4εm,C + 4 εm,C E(π ( f z )) with confidence 1 − δ where εm,C is defined in the statement. Putting the above two estimates into proposition 2, we have with confidence 1 − δ, 10M log(2/δ) √ E(π( f z )) ≤ 4εm,C + 4 εm,C E(π( f z )) + + 4D(C). 3m Solving the quadratic inequality for E(π( f z )) ≤ 32εm,C +
E(π( f z )) leads to
20M log(2/δ) + 8D(C). 3m
Then our conclusion follows from equation 4.1. Finally, we turn to the proof of theorems 2 and 3. To this end, we need a 1 ∗ bound for f ∗K ,C K . According to the definition, 2C f K ,C 2K ≤ D(C). Then we have: 1/ 2 Lemma 8. For every C > 0 , there hold f ∗K ,C K ≤ (2C D(C)) and 1/2 f K ,C ∞ ≤ 1 + 2κ(2C D(C)) .
Proof of Theorem 2. Take f K ,C = f K ,C in proposition 2. Then by lemma 8, 1/2 we may take M = 2 + 2κ(2C D(C)) . Proposition 1 with assumption H2 yields R( f z ) ≤ c s,β,δ
C (1−β)s/(s+1) 1
m 1+s
+
C (1−β)s/(s+1) 1
m 1+s
C (1+β)/2 m
s
1+s
1−β C 2 −β +C . + m 1
2
Take C = min{m s+β , m 1+β }. Then
C (1+β)/2 m
≤ 1, and the proof is complete.
Proof of Theorem 3. Denote z = E(π( f z )) − E( f c ) + C1 ( f z∗ ). Then we have ( f z∗ ) ≤ Cz . This in connection with lemma 4 yields f z∗ K ≤ κ( f z∗ ) ≤ κCz .
(6.2)
1182
Q. Wu and D.-X. Zhou
= ηC in proposition 2. It tells us that Take f K ,C = f K ,C with C z ≤ 2ηR( f c ) + S(m, C, η) + 2D(ηC). = ηC = C 1/(β+1) . By the fact R( f c ) ≤ Set η = C −β/(β+1) . Then C tion H1,
1 2
and assump-
β
z ≤ S(m, C, η) + (1 + 2c β )C − β+1 .
(6.3)
f K ,C . So we have Recall expression 2.11 for S(m, C, η). Here, f K ,C = S(m, C, η) = {(E(π( f z )) − E( f c )) − (Ez (π ( f z )) − Ez ( f c ))} f K ,C ) − Ez ( f c )) − (E( f K ,C ) − E( f c ))} + (1 + η){(Ez ( + η{Ez ( f c ) − E( f c )} =: S1 + (1 + η)S2 + ηS3 . Take t ≥ 1, C ≥ 1 to be determined later. For R ≥ 1, denote W(R) := {z ∈ Zm : f z∗ K ≤ R}.
(6.4)
For S1 , we apply corollary 3 with δ = e −t ≤ 1/e. We know that there is a (1) set VR ⊂ Zm of measure at most δ = e −t such that S1 ≤ c s,q t
Rs m
q +1
q +2+q s+s
+
Rs m
q +1 q +2
q +2+q s+s · 2(q +1)
q 2(q +1)
z
,
(1)
∀z ∈ W(R) \ VR . Here c s,q := 32(8( 2c1q )q /(q +1) + 13 )(c s + 1) ≥ 1 is a constant depending only on q and s. To estimate S2 , consider ξ = V(y, f K ,C (x)) − V(y, f c (x)) on (Z, ρ). By lemma 8, we have " 1−β C) D( ≤ 1 + 2κ 2c β C 2(β+1) . f K ,C ∞ ≤ 1 + 2κ 2C Write ξ = ξ1 + ξ2 where f K ,C (x)) − V(y, π ( f K ,C )(x)), ξ1 := V(y, ξ2 := V(y, π ( f K ,C )(x)) − V(y, f c (x)).
SVM Soft Margin Classifiers
1183
1−β 2(β+1) . Hence, σ 2 (ξ ) is bounded by It is 1 easy to1−βcheck that 0 ≤ ξ1 ≤ 2κ 2c β C 2κ 2c β C 2(β+1) Eξ1 . Then the one-side Bernstein inequality with δ = e −t tells us that there is a set V (2) ⊂ Zm of measure at most δ = e −t such that for every z ∈ Zm \ V (2) , there holds ! 1−β m 4κ 2c β C 2(β+1) t 2σ 2 (ξ1 )t 1 ξ1 (zi ) − Eξ1 ≤ + m i=1 3m m ≤
1−β 10κ 2c β C 2(β+1) t + Eξ1 . 3m
For ξ2 , by lemma 7, σ 2 (ξ2 ) ≤ 8
1 2c q
q /(q +1)
q
(Eξ2 ) q +1 .
But |ξ2 | ≤ 2. So the one-side Bernstein inequality tells us again that there is a set V (3) ⊂ Zm of measure at most δ = e −t such that for every z ∈ Zm \ V (3) , there holds m 4t 1 ξ1 (zi ) − Eξ1 ≤ + m i=1 3m
≤
! 4σ 2 (ξ1 )t m
q q +1 4t 1 q +2 t q +2 + Eξ2 . + 32 3m 2c q m q
Here we have used the following elementary inequality with b:= (Eξ2 ) 2q +2 and a := (32( 2c1q )q /(q +1) t/m)1/2 : a ·b ≤
q + 2 (2q +2)/(q +2) q + a b (2q +2)/q , 2q + 2 2q + 2
∀a , b > 0.
Combing the two estimates for ξ1 , ξ2 with the fact that Eξ = Eξ1 + C) ≤ c β C −β/(β+1) , we see that Eξ2 = E( f K ,C ) − E( f c ) ≤ D( S2 ≤ c q ,β t
1−β
C 2(β+1) + m
where c q ,β := 10κ 2c β /3 + on q , β. The last term is S3 ≤ 1.
1 m
4 3
qq +1 +2
+C
−β β+1
,
(2)
(3)
∀z ∈ Zm \ VR \ VR ,
+ 32( 2c1q )q /(q +1) + c β is a constant depending
1184
Q. Wu and D.-X. Zhou
Putting the above three estimates for S1 , S2 , S3 to equation 6.3, we find (1) that for every z ∈ W(R) \ VR \ V (2) \ V (3) , there holds z ≤ 2c s,q t
Rs m
(q +1)
q +2+q s+s
+ 8c q ,β t
1 m
qq +1 +2
+C
β − β+1
C 1/2 +1 m
. (6.5)
Here we have used another elementary inequality for α = q /(2q + 2) ∈ (0, 1) and x = z : x ≤ a x α + b,
a , b, x > 0 =⇒ x ≤ max{(2a )1/(1−α) , 2b}.
Now we can choose C to be (q +1)(β+1) C := min m2 , m s(q +1)+β(q +2+q s+s) .
(6.6)
q +1 (q +1) s(q +1)+β(q +2+q s+s) β It ensures that m1 q +2 ≤ C − β+1 and m1 q +2+q s+s ≤ C − (β+1)(q +2+q s+s) . With this (1) (2) (3) choice of C, equation 6.5 implies that with a set VR := VR ∪ VR ∪ VR of −t measure at most 3e ,
s(q +1) q +2+q β 1 s+s z ≤ C − β+1 2c s,q t C − β+1 R + 24c q ,β t ,
∀z ∈ W(R) \ VR . (6.7)
We shall finish our proof by using equations 6.2 and 6.7 iteratively. Start with the bound R = R(0) := κC. Lemma 4 verifies W(R(0) ) = Zm . At this first step, by equations 6.7 and 6.2, we have Zm = W(R(0) ) ⊆ W(R(1) ) ∪ VR(0) , where s(q +1) β 1 R(1) := κC β+1 (2c s,q t(κ + 1))C β+1 · q +2+q s+s + 24c q ,β t . Now we iterate. For n = 2, 3, . . . , we derive from equations 6.7 and 6.2 that Zm = W(R(0) ) ⊆ W(R(1) ) ∪ VR(0) ⊆ · · · ⊆ W(R(n) ) ∪ ∪n−1 j = 0 VR( j) , where each set VR( j) has measure at most 3e −t and the number R(n) is given by (n)
R
= κC
1 β+1
n β · 2c s,q t(κ + 1) C β+1
s(q +1) q +2+q s+s
n
+ 24c q ,β t(κ + 1)n .
SVM Soft Margin Classifiers
1185
Note that > 0 is fixed. We choose n0 ∈ N to be large enough such that
(n0 +1)
q +2 2s s(q + 1) + . ≤ s+ q + 2 + qs + s β q +1 In the n0 th step of our iteration, we have shown that for z ∈ W(R(n0 ) ),
f z∗ K
≤ κC
1 β+1
β n · 2c s,q t(κ + 1) 0 C β+1
s(q +1) q +2+q s+s
n0
+ 24c q ,β t(κ + 1)n0 .
This together with equation 6.5 gives β(q +1) 2β z ≤ c(s, q , β, )t n0 max m− β+1 , m− s(q +1)+β(s+2+q s+s) + . This is true for z ∈ W(R(n0 ) ) \ VR(n0 ) . Since the set ∪nj 0= 0 VR( j) has measure at most 3(n0 + 1)e −t , we know that the set W(R(n0 ) ) \ VR(n0 ) has measure at least 1 − 3(n0 + 1)e −t . Note that E(π ( f z )) − E( f c ) ≤ z . Take t = log( 3(n0δ+1) ). Then the proof is finished by equation 4.1. Acknowledgments This work is partially supported by the Research Grants Council of Hong Kong (Project No. CityU 103704) and City University of Hong Kong (Project No. 7001442). References Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 68, 337–404. Barron, A. R. (1990). Complexity regularization with applications to artificial neural networks. In G. Roussa (Ed.), Nonparametric functional estimation (pp. 561–576). Dordrecht: Kluwer. Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44, 525–536. Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2003). Convexity, classification, and risk bounds. Unpublished manuscript. Blanchard, G., Bousquet, O., & Massart, P. (2004). Statistical performance of support vector machines. Unpublished manuscript. Boser, B. E., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop of Computational Learning Theory (Vol. 5, pp. 144–152). Pittsburgh, PA: ACM.
1186
Q. Wu and D.-X. Zhou
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. J. Machine Learning Research, 2, 499–526. Bradley, P. S., & Mangasarian, O. L. (2000). Massive data discrimination via linear support vector machines. Optimization Methods and Software, 13, 1–10. Chen, D. R., Wu, Q., Ying, Y., & Zhou, D. X. (2004). Support vector machine soft margin classifiers: Error analysis. J. Machine Learning Research, 5, 1143–1175. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Mach. Learning, 20, 273– 297. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Cucker, F., & Smale, S. (2001). On the mathematical foundations of learning. Bull. Amer. Math. Soc., 39, 1–49. Devroye, L., Gyorfi, ¨ L., & Lugosi, G. (1997). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Adv. Comput. Math., 13, 1–50. Kecman, V., & Hadzic, I. (2000). Support vector selection by linear programming. Proc. IJCNN, 5, 193–198. Lugosi, G., & Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statis., 32, 30–55. Mendelson, S. (2002). Improving the sample complexity using global data. IEEE Trans. Inform. Theory, 48, 1977–1991. Mukherjee, S., Rifkin, R., & Poggio, T. (2002). Regression and classification with regularization. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, & B. Yu (Eds.), Nonlinear estimation and classification (pp. 107–124). New York: SpringerVerlag. Niyogi, P. (1998). The informational complexity of learning. Norwell, MA: Kluwer. Niyogi, P., & Girosi, F. (1996). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comp., 8, 819–842. Pedroso, J. P., & Murata, N. (2001). Support vector machines with different norms: Motivation, formulations and results. Pattern Recognition Letters, 22, 1263–1272. Pontil, M. (2003). A note on different covering numbers in learning theory. J. Complexity, 19, 665–671. Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Comp., 16, 1063–1076. Scovel, C., & Steinwart, I. (in press). Fast rates for support vector machines. Smale, S., & Zhou, D. X. (2003). Estimating the approximation error in learning theory. Anal. Appl., 1, 17–41. Smale, S., & Zhou, D. X. (2004). Shannon sampling and function reconstruction from point values. Bull. Amer. Math. Soc., 41, 279–305. Steinwart, I. (2002). Support vector machines are universally consistent. J. Complexity, 18, 768–791. Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statis., 32, 135–166. van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer-Verlag.
SVM Soft Margin Classifiers
1187
Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wahba, G. (1990). Spline models for observational data Philadelphia: SIAM. Wu, Q., Ying, Y., & Zhou, D. X. (In press). Multi-kernel regularized classifiers. Wu, Q., & Zhou, D. X. (2004). Analysis of support vector machine classification. Manuscript submitted for publication. Zhang, T. (2002). Covering number bounds of certain regularized linear function classes. J. Machine Learning Research, 2, 527–550. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statis., 32, 56–85. Zhou, D. X. (2002). The covering number in learning theory. J. Complexity, 18, 739–767. Zhou, D. X. (2003). Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inform. Theory, 49, 1743–1752.
Received April 30, 2004; accepted September 24, 2004.
LETTER
Communicated by Klaus-Robert Muller ¨
Leave-One-Out Bounds for Support Vector Regression Model Selection Ming-Wei Chang
[email protected] Chih-Jen Lin
[email protected] Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan
Minimizing bounds of leave-one-out errors is an important and efficient approach for support vector machine (SVM) model selection. Past research focuses on their use for classification but not regression. In this letter, we derive various leave-one-out bounds for support vector regression (SVR) and discuss the difference from those for classification. Experiments demonstrate that the proposed bounds are competitive with Bayesian SVR for parameter selection. We also discuss the differentiability of leave-one-out bounds. 1 Introduction Support vector machines (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995) have become a promising tool for classification and regression. Their success depends on the tuning of several parameters that affect the generalization error. A popular approach is to approximate the error by a bound that is a function of parameters. Then we search for parameters so that this bound is minimized. Past efforts focus on such bounds for classification, and the aim of this article is to derive bounds for regression. We first briefly introduce support vector regression (SVR). Given training vectors xi ∈ Rn , i = 1, . . . , l, and a vector y ∈ Rl as their target values, SVR solves min ∗
w,b,ξ,ξ
subject to
l l 1 T C C ξi2 + (ξ ∗ )2 w w+ 2 2 i=1 2 i=1 i
− − ξi∗ ≤ wT φ(xi ) + b − yi ≤ + ξi ,
(1.1) i = 1, . . . , l.
Data are mapped to a higher-dimensional space by the function φ, and an -insensitive loss function is used. We refer to this form as L2-SVR because a two-norm penalty term ξi2 + (ξi∗ )2 is used. As w may be a huge vector Neural Computation 17, 1188–1222 (2005)
© 2005 Massachusetts Institute of Technology
Leave-One-Out Bounds for Support Vector Regression
1189
variable after introducing the mapping function φ, practically we solve the dual problem: 1 (α − α∗ )T K˜ (α − α∗ ) 2
min∗ α,α
+
l l (αi + αi∗ ) + yi (αi − αi∗ ) i=1
subject to
l (αi − αi∗ ) = 0,
(1.2)
i=1
0 ≤ αi , αi∗ , i = 1, . . . , l,
i=1
where K (xi , x j ) = φ(xi )T φ(x j ) is the kernel function. K˜ = K + I /C, and I is the identity matrix. For optimal w and (α, α∗ ), the primal-dual relationship shows, w =
l (αi∗ − αi )φ(xi ), i=1
so the approximate function is f (x) = wT φ(x) + b =−
l
K (x, xi )(αi − αi∗ ) + b.
(1.3)
i=1
More general information about SVR can be found in Smola and Scholkopf ¨ (2004). One difficulty over classification for parameter selection is that SVR possesses an additional parameter . Therefore, the search space of parameters is bigger than that for classification. Some work has tried to address SVR parameter selection. Momma and Bennett (2002) perform model section by pattern search, so the number of parameters checked is smaller than that by a full grid search. Kwok (2001) and Smola, Murata, Scholkopf, ¨ and Muller ¨ (1998) analyze the behavior of and conclude that the optimal scales linearly with the input noise of the training data. However, this property can be applied only when the noise is known. Gao, Gunn, Harris, and Brown (2002) derive a Bayesian framework for SVR that leads to minimizing a function of parameters. However, its performance is not very good compared to a full grid search (Lin & Weng, 2004). An improvement is in Chu, Keerthi, and Ong (2004), which modified the standard SVR formulation. This improved Bayesian SVR will be compared to our approach in this letter. This article is organized as follows. Section 2 briefly reviews leave-oneout bounds for support vector classification. We derive various leave-oneout bounds for SVR in section 3. Implementation issues are in section 4 and experiments are in section 5. Conclusions are in section 6.
1190
M.-W. Chang and C.-J. Lin
2 Leave-One-Out Bounds for Classification: A Review Given training vectors xi ∈ Rn , i = 1, . . . , l, and a vector y ∈ Rl such that yi ∈ {1, −1}, an SVM formulation for two-class classification is min
w,b,ξ
subject to
l 1 T C ξ2 w w+ 2 2 i=1 i
(2.1)
yi (wT φ(xi ) + b) ≥ 1 − ξi , ξi ≥ 0, i = 1, . . . , l.
Next, we briefly review two leave-one-out bounds. 2.1 Radius Margin Bound for Classification. By defining w ˜ ≡ √ , w Cξ
(2.2)
Vapnik and Chapelle (2000) have shown that the following radius margin (RM) bound holds: ˜ 2 = 4 R˜ 2 eT α, loo ≤ 4 R˜ 2 w
(2.3)
where loo is the number of leave-one-out errors and e is a vector of all ones. In equation 2.3, α is the solution of the following dual problem, I 1 T T α max e α − α Q + α 2 C subject to
0 ≤ αi , i = 1, . . . , l,
(2.4)
yT α = 0, ˜ 2 = eT α. Define where Qi j = yi y j K (xi , x j ). At optimum, w
φ(xi ) ˜ i ) ≡ ei , φ(x √ C where ei is a zero vector of length l except the ith component is one. Then R˜ ˜ i ), i = in equation 2.3 is the radius of the smallest sphere containing all φ(x 1, . . . , l. The right-hand side of equation 2.3 is a function of parameters, which will then be minimized for parameter selection.
Leave-One-Out Bounds for Support Vector Regression
1191
2.2 Span Bound for Classification. Span bound, another leave-one-out bound proposed in Vapnik and Chapelle (2000), is tighter than the RM bound. Define St2 as the optimal objective value of the following problem: min λ
subject to
2
˜ ˜ i )
λi φ(x
φ(xt ) −
i∈F\{t} λi = 1,
(2.5)
i∈F\{t}
where F = {i | αi > 0} is the index set of free components of an optimal α of equation 2.4. Under the assumption that the set of support vectors remains the same during the leave-one-out procedure, the span bound is l
αt St2 .
(2.6)
t=1
˜ the diameter of the Equation 2.5 indicates that St is smaller than 2 R, ˜ i ). Thus, equation 2.6 is tighter than smallest sphere containing all φ(x equation 2.3. Unfortunately, St2 is not a continuous function (Chapelle, Vapnik, Bousquet, & Mukherjee, 2002), so a modified span bound is proposed: l
αt S˜2t ,
(2.7)
t=1
where S˜ 2t is the optimal objective value of
2
1
˜
˜ λi φ(xi ) + η λi2 min φ(xt ) −
λ αi i∈F\{t} i∈F\{t} λi = 1. subject to
(2.8)
i∈F\{t}
η is a positive parameter that controls the smoothness of the bound. From equations 2.5 and 2.8, St2 ≤ S˜2t , so equation 2.7 is also a leave-one-out bound. Define D as an l × l diagonal matrix where Dii = η/αi and Di j = 0 for i = j. Define a new kernel matrix K˜ with ˜ i )T φ(x ˜ j ). K˜ (xi , x j ) = φ(x We let ˜ = D
DFF
0F
0FT
0
and
M=
K˜ FF
eF
eFT
0
,
(2.9)
1192
M.-W. Chang and C.-J. Lin
where K˜ FF is the submatrix of K˜ corresponding to free support vectors and eF (0F ) is a vector of |F| ones (zeros). By defining ˜ = M + D, ˜ M
(2.10)
Chapelle et al. (2002) showed that ˜ t )−1 h = S˜ 2t = K˜ (xt , xt ) − hT ( M
1 ˜ tt , −D ˜ −1 )tt (M
(2.11)
˜ with the tth column and row removed and ˜ t is the submatrix of M where M ˜ excluding M ˜ tt . h is the tth column of M Note that Chapelle et al. (2002) did not give a formal proof on the continuity of equation 2.7. We address this issue in section 4.1. 3 Leave-One-Out Bounds for Regression First, the Karash-Kunh-Tucker (KKT) optimality condition of equation 1.2 is listed here for further analysis: a vector α is optimal for equation 1.2 if and only if it satisfies constraints of equation 1.2 and there is a scalar b such that −( K˜ (α − α∗ ))i + b = yi + , if αi > 0, −( K˜ (α − α∗ ))i + b = yi − , if αi∗ > 0, yi − ≤ −( K˜ (α − α∗ ))i + b ≤ yi + , if αi = αi∗ = 0,
(3.1)
where ( K˜ (α − α∗ ))i is the ith element of K˜ (α − α∗ ). From equation 3.1, αi αi∗ = 0 when the KKT condition holds. General discussion about KKT conditions can be seen in optimization books (e.g., Bazaraa, Sherali, & Shetty 1993). 3.1 Radius Margin Bound for Regression. To study the leave-oneout error for SVR, we introduce the leave-one-out problem without the tth data: min
1 t T t C t 2 C t∗ 2 ξ + ξ (w ) (w ) + 2 2 i=t i 2 i=t i
subject to
− − ξit∗ ≤ (wt )T φ(xi ) + b t − yi ≤ + ξit ,
wt ,b t ,ξ t ,ξ t∗
(3.2)
i = 1, . . . , t − 1, t + 1, . . . , l. Though ξ t and ξ t∗ are vectors with l − 1 elements, we define ξtt = ξtt∗ = 0 to
Leave-One-Out Bounds for Support Vector Regression
1193
make them have l elements. The approximate function of equation 3.2 is f t (x) = (wt )T φ(x) + b t , so the leave-one-out error for SVR is defined as loo ≡
l f t (xt ) − yt .
(3.3)
t=1
The leave-one-out error is well defined if the approximation function is unique. Note that though w (and wt ) is unique due to the strictly convex term wT w (and (wt )T wt ), multiple b (or b t ) is possible (see, e.g., the discussion in Lin (2001a)). Therefore, we make the following assumption: Assumption 1. Equation 1.2 and the dual problem of equation 3.2 all have free support vectors. We say a dual SVR has free support vectors if (α, α∗ ) is optimal and there are some i such that αi > 0 or αi∗ > 0 (i.e., αi + αi∗ > 0 as αi αi∗ = 0). Under this assumption, equations 1.3 and 3.1 imply wT φ(xi ) + b = yi + if αi > 0 (or = yi − if αi∗ > 0). As the optimal w is unique, so is b. Similarly, b t is unique as well. We then introduce a useful lemma: Lemma 1. Under assumption 1, 1. If αt > 0,
f t (xt ) ≥ yt .
2. If αt∗ > 0,
f t (xt ) ≤ yt .
The proof is in appendix A. If αt > 0, the KKT condition 3.1 implies that f (xt ) is also larger than or equal to yt . Thus, this lemma reveals the relative position of f t (xt ) to f (xt ) and yt . The next lemma gives an error bound on each individual leave-one-out test. Lemma 2. Under assumption 1, 1. If αt = αt∗ = 0,
| f t (xt ) − yt | = | f (xt ) − yt | ≤ .
2. If αt > 0,
f t (xt ) − yt ≤ 4 R˜ 2 αt + .
3. If αt∗ > 0,
yt − f t (xt ) ≤ 4 R˜ 2 αt∗ + .
The proof is in appendix B. Then, when αt > 0, | f t (xt ) − yt | = f t (xt ) − yt from lemma 1. It follows | f t (xt ) − yt | ≤ 4 R˜ 2 αt + from lemma 2. After extending this argument to the cases of αt∗ > 0 and αt = αt∗ = 0, we have:
1194
M.-W. Chang and C.-J. Lin
Theorem 1. Under assumption 1, the leave-one-out error, equation 3.3, is bounded by 4 R˜ 2 eT (α + α∗ ) + l.
(3.4)
We discuss the difference on proving bounds for classification and regression. In classification, the RM bound 2.3 is from the following derivation: if the tth training data are wrongly classified during the leave-one-out procedure, then 1 ≤ 4αt R˜ 2 ,
(3.5)
where αt is the tth element of the optimal solution of equation 2.4. If the data are correctly classified, the leave-one-out error is zero and still smaller than 4αt R˜ 2 . Therefore, 4 R˜ 2 eT α is larger than the number of leave-oneout errors. On the other hand, since there are no “incorrectly classified” data in regression, we use lemma 2 instead of equation 3.5 and lemma 1 is required. 3.2 Span Bound for L2-SVR. Similar to the above discussion, we can have the span bound for L2-SVR: Theorem 2. Under the same assumptions of theorem 1 and the assumption that the set of support vectors remains the same during the leave-one-out procedure, the leave-one-out error of L2-SVR is bounded by l (αt + αt∗ )St2 + l,
(3.6)
t=1
where St2 is the optimal objective value of equation 2.5 with F replaced by {i | αi + αi∗ > 0}. The proof is in appendix C. The same as the case for classification, equation 3.6 may not be continuous, so we propose a similar modification to equation 2.7: l (αt + αt∗ ) S˜ 2t + l.
(3.7)
t=1
S˜ 2t is the optimal objective solution of min λ
subject to
2
φ(x ˜ ˜ t) − λi φ(xi )
λi2
+η
i∈F\{t}
i∈F\{t}
λi = 1,
i∈F\{t}
1 (αi + αi∗ )
(3.8)
Leave-One-Out Bounds for Support Vector Regression
1195
where η is a positive parameter and F = {i | αi + αi∗ > 0}. The calculation of S˜ 2t is the same as equation 2.11. 3.3 LOO Bounds for L1-SVR. L1-SVR is another commonly used form for regression. It considers the following objective function, l l 1 T ξi + C ξi∗ , w w+C 2 i=1 i=1
min ∗
w,b,ξ,ξ
(3.9)
under the constraints of equation 1.1 and nonnegative constraints on ξ, ξ ∗ : ξi ≥ 0, ξi∗ ≥ 0, i = 1, . . . , l. The name “L1” comes from the linear loss function. The dual problem is min∗ α,α
subject to
l l 1 (αi + αi∗ ) + yi (αi − αi∗ ), (α − α∗ )T K (α − α∗ ) + 2 i=1 i=1 l (3.10) (αi − αi∗ ) = 0, 0 ≤ αi , αi∗ ≤ C, i = 1, . . . , l, i=1
Two main differences between equations 1.2 and 3.10 are that K˜ is replaced by K and αi , αi∗ are upper-bounded by C. To derive leave-one-out bounds, we still require assumption 1. With some modifications in the proof (details are in section D.1), lemma 1 still holds. For lemma 2, results are different, as now αi , αt∗ ≤ C and ξt plays a role: Lemma 3 1. If αt = αt∗ = 0,
| f t (xt ) − yt | = | f (xt ) − yt | ≤ .
2. If αt > 0,
f t (xt ) − yt ≤ 4R2 αt + ξt + ξt∗ + .
3. If αt∗ > 0,
yt − f t (xt ) ≤ 4R2 αt∗ + ξt + ξt∗ + .
The proof is in section D.2. Note that R is now the radius of the smallest sphere containing all φ(xi ), i = 1, . . . , l. Using lemmas 1 and 3, the bound is 4R2 eT (α + α∗ ) + eT (ξ + ξ ∗ ) + l. Regarding the span bound, the proof for theorem 2 still holds. However, St2 is redefined as the optimal objective value of the following problem: min λ
subject to
2
φ(xt ) − λi φ(xi )
i∈F\{t}
i∈F\{t}
λi = 1,
(3.11)
1196
M.-W. Chang and C.-J. Lin
where F = {i | 0 < αi + αi∗ < C}. Then a leave-one-out bound is l l (αt + αt∗ )St2 + (ξt + ξt∗ ) + l. t=1
(3.12)
t=1
4 Implementation Issues In the rest of this article, we consider only leave-one-out bounds using L2-SVR. 4.1 Continuity and Differentiability. To use the bound, α and α∗ must be well-defined functions of parameters. That is, we need the uniqueness of the optimal dual solution. As K˜ contains the term I /C and hence is positive definite, α and α∗ are unique (Chang & Lin, 2002, lemma 4). To discuss continuity and differentiability, we make an assumption about the kernel function: Assumption 2. The kernel function is differentiable respect to parameters. For continuity, we have known that the span bound is not continuous, but others are: Theorem 3 1. (α, α∗ ) and R˜ 2 are continuous, and so is the radius margin bound. 2. The modified span bound, equation 3.7 is continuous. The proof is in appendix E. To minimize loo bounds, differentiability is important as we may have to calculate the gradient. Unfortunately, leaveone-out bounds for L2-SVR are not differentiable. An example for the radius margin bound is in appendix F. This situation is different from classification, where the radius margin bound for L2-SVM is differentiable (see more discussion in Chung, Kao, Sun, Wang, & Lin 2003). However, we may still use gradient-based methods as gradients exist almost everywhere. Theorem 4. Radius margin and modified span bounds are differentiable almost everywhere. If around a given parameter set, zero and nonzero elements of (α, α∗ ) are the same, then bounds are differentiable at this parameter set. The proof is in appendix G. The above discussion applies to bounds for classification as well. For differentiable points, we calculate gradients in section 4.2.
Leave-One-Out Bounds for Support Vector Regression
1197
4.2 Gradient Calculation. To have the gradient of leave-one-out bounds, we need the gradient of α + α∗ , R˜ 2 , and S˜ 2t . ˆ F ≡ α∗F − αF and recall the definition 4.2.1 Gradient of α + α∗ . Define α of M in equation 2.9. For free support vectors, KKT optimality conditions 3.1 imply that p α ˆF , = M 0 b where
pi =
yi −
if αˆ i > 0,
yi +
if αˆ i < 0.
We have ∂(αi + αi∗ ) ∂ αˆ i = zi , ∂θ ∂θ where
zi =
1
(4.1)
if αˆ i > 0,
−1 if αˆ i < 0.
Except , all other parameters relate to M but not p, so for any such parameter θ ,
∂α ˆF ∂θ ∂ M α ˆF + M = 0. ∂b b ∂θ ∂θ Thus, ˜ ∂α ˆF ∂K ∂θ = −M−1 ∂θ ∂b 0FT ∂θ
If θ is , ∂α ˆ
∂p ∂ −1 = M ∂ , ∂b 0 ∂ F
˜ ∂K 0F α α ˆ ˆF F = −M−1 ∂θ . b 0 0
(4.2)
1198
M.-W. Chang and C.-J. Lin
and ∂ pi = ∂
−1
if αˆ i > 0,
1
if αˆ i < 0.
4.2.2 Gradient of S˜ 2t . Now S˜ 2t is defined as equation 2.11, but Dtt = η/(αt + αt∗ ) for t ∈ F. Using equation 2.11, ˜ −1 )tt 1 η ∂( M ∂(αt + αt∗ ) ∂ St2 = − −1 2 + . ∗ 2 ˜ )tt ∂θ ∂θ (αt + αt ) ∂θ (M Note that ˜ −1 )tt ∂( M = ∂θ
˜ −1 ∂M ∂θ
= tt
˜ −1 M
˜ −1 ∂M ˜ M ∂θ
,
(4.3)
tt
and ∂(αt + αt∗ )/∂θ can be obtained using equations 4.1 and 4.2. Furthermore, in equation 4.3, ˜ ˜ ∂K ∂D ˜ ∂M + = ∂θ ∂θ ∂θ 0FT
0F
,
0
where
˜ ∂D ∂θ
=− ii
∂(αi + αi∗ ) η , (αi + αi∗ )2 ∂θ
i ∈ F.
4.2.3 Gradient of R˜ 2 . R˜ 2 is the optimal object value of
max β
subject to
l
βi K˜ (xi , xi ) − β T K˜ β
i=1
0 ≤ βi ,
i = 1, . . . , l,
(4.4) l
βi = 1.
i=1
(see, e.g., Vapnik, 1998). From Bonnans & Shapiro (1998), it is differentiable, and the gradient is l ∂ R˜ 2 ∂ K˜ (xi , xi ) ∂ K˜ βi = − βT β. ∂θ ∂θ ∂θ i=1
Leave-One-Out Bounds for Support Vector Regression
1199
Table 1: Data Statistics. Problem
n
l
pyrim triazines mpg housing add10 cpusmall spacega abalone
27 60 7 13 10 12 6 8
74 186 392 566 1000 1000 1000 1000
Note: n is the number of features and l is the number of data instances.
5 Experiments In this section, different parameter selection methods including the proposed bounds are compared. We consider the same real data used in Lin & Weng (2004), and some statistics are in Table 1. To have a reliable comparison, for each data set, we randomly produce 30 training-testing splits. Each training set consists of four-fifths of the data, and the remaining are for testing. Parameter selection using different methods is applied on each training file. We then report the average and standard deviation of 30 mean squared errors (MSE) on predicting test sets. A method with lower MSE is better. We compare two proposed bounds with three other parameter selection methods: 1. RM (L2-SVR): the radius margin bound (equation 3.4). 2. MSP (L2-SVR): the modified span bound (equation 3.7). 3. CV (L2-SVR): a grid search of parameters using fivefold crossvalidation. 4. CV (L1-SVR): the same as the previous method but L1-SVR is considered. 5. BSVR: a Bayesian framework that improves the smoothness of the evidence function using a modified SVR (Chu et al., 2004). All methods except BSVR use the radial basis function (RBF) kernel, K (xi , x j ) = e −xi −x j
2
/(2σ 2 )
,
(5.1)
where σ 2 is the kernel parameter. BSVR implements an extension of the RBF kernel, K (xi , x j ) = κ0 e −xi −x j
2
/(2σ 2 )
+ κb ,
(5.2)
1200
M.-W. Chang and C.-J. Lin
where κ0 and κb are two additional kernel parameters. Both kernels satisfy assumption 2 on differentiability. Implementation details and experimental results are in the following subsections. 5.1 Implementations of Various Model Selection Methods. RM and MSP are differentiable almost everywhere, so we implement quasi-Newton, a gradient-based optimization method, to minimize them. The parameter η in the modified span bound, equation 2.7, is set to be 0.1. Section 5.3 will discuss the impact of using different η. Following most earlier work on minimizing leave-one-out bounds, we consider parameters in the log scale: (ln C, ln σ 2 , ln ). Thus, if f is the function of parameters, the gradient is calculated by ∂f ∂f =θ , ∂ ln θ ∂θ and formulas in section 4.2. Suppose θ is the parameter vector to be determined. The quasi-Newton method is an iterative procedure to minimize f (θ). If k is the index of the loop, the kth iteration for updating θ k to θ k+1 is: 1. Compute a search direction p = −Hk ∇ f (θ k ). 2. Find θ k+1 = θ k + λp using a line search to ensure sufficient decrease. 3. Obtain Hk+1 by tsT ssT stT Hk I − T + T I− T Hk+1 = t s t s t s Hk
if tT s > 0,
(5.3)
otherwise,
where s = θ k+1 − θ k
and t = ∇ f (θ k+1 ) − ∇ f (θ k ).
Here, Hk serves as the inverse of an approximate Hessian of f and is set to be the identity matrix in the first iteration. The sufficient decrease by the line search usually means f (θ k + λp) ≤ f (θ k ) + σ1 λ∇ f (θ k )T p,
(5.4)
where 0 < σ1 < 1 is a positive constant. We find the largest value λ in a set {γ i | i = 0, 1, . . .} such that equation 5.4 holds (γ = 1/2 used in this article). We confine the search in a fixed region, so each parameter θi is associated with a lower bound li and upper bound ui . If in the quasi-Newton method, θik + λpi is not in [li , ui ], it is projected to the interval. For ln C and ln σ 2 , we set li = −8 and ui = 8. For ln , li = −8, but ui = −1. We could not use a too
Leave-One-Out Bounds for Support Vector Regression
1201
Table 2: Mean and Standard Deviation of 30 MSEs, (Using 30 Training-Testing Splits). RM Problem Mean STD
MSP
L2 CV
L1 CV
BSVR
Mean STD Mean STD Mean STD Mean STD
pyrim 0.015 0.010 0.007 triazines 0.042 0.005 0.021 mpg 8.156 1.598 7.045 housing 23.14 7.774 9.191 add10 6.491 1.675 1.945 cpusmall 34.02 13.33 14.57 spacega 0.037 0.007 0.013 abalone 10.69 1.551 5.071
0.008 0.007 0.006 0.021 1.682 7.122 2.733 9.318 0.254 1.820 4.692 14.73 0.001 0.012 0.678 5.088
0.007 0.007 0.005 0.023 1.809 7.146 2.957 11.26 0.182 1.996 3.754 15.63 0.001 0.013 0.646 5.247
0.007 0.007 0.008 0.021 1.924 6.894 5.014 10.40 0.194 2.298 5.344 16.17 0.001 0.014 0.806 5.514
0.008 0.007 1.856 3.950 0.256 5.740 0.001 0.912
large ui as a too large may cause all data to be in the -insensitive tube and hence α = α∗ = 0. Then assumption 1 does not hold, and the leave-oneout bound may not be valid. More discussion on the use of quasi-Newton method is in Chung et al. (2003). The initial point of the quasi-Newton method is (ln C, ln σ 2 , ln ) = (0, 0, −3). The minimization procedure stops when ∇ f (θ k ) < (1 + f (θ k )) × 10−5 , or
f (θ k−1 ) − f (θ k ) f (θ k−1 )
< 10−5
(5.5)
happens. Each function (gradient) evaluation involves solving an SVR and is the main computational bottleneck. We use LIBSVM (Chang & Lin, 2001) as the underlying solver. For CV, we try 2312 parameter sets with (ln C, ln γ , ln ) = [−8, −7, . . . , 8] × [−8, −7, . . . , 8] × [−8, −7, . . . , −1]. Similar to the case of using leaveone-out bounds, here we avoid considering too large . The one with the lowest fivefold CV accuracy is used to train the model for testing. For BSVR, we directly use our gradient-based implementation with the same stopping condition, equation 5.5. Note that their evidence function is not differentiable either. 5.2 Experimental Results. Table 2 presents mean and standard deviation of 30 MSEs. CV (L1- and L2-SVR) MSP, and BSVR are similar, but RM is worse. In classification, Chapelle et al. (2002) showed that radius margin and span bounds perform similarly. From our experiments on the radius margin bound, parameters are more sensitive in regression than in
1202
M.-W. Chang and C.-J. Lin
Table 3: Average (ln C, ln σ 2 , ln ) of 30 Runs. RM
MSP
σ2
σ2
Problem In C In pyrim triazines mpg housing add10 cpusmall spacega abalone
In In C In
4.0 1.0 −2.6 1.0 −1.4 −0.6 −1.8 −0.8 −0.7 0.4 −1.0 4.7 −1.5 1.1 −1.0 5.7 0.9 4.6 −1.0 8.0 −1.1 −0.5 −1.0 7.9 −6.5 −6.2 −1.7 7.4 −8.0 7.0 −1.0 3.7
2.9 3.0 1.1 1.2 3.3 1.6 3.1 0.6
L2 CV In In C In −8.0 −8.0 −7.1 −6.9 −8.0 −5.6 −8.0 −8.0
6.4 3.5 3.6 6.6 8.0 7.9 7.3 4.1
σ2
3.1 4.4 0.6 1.6 3.0 2.4 0.0 0.9
L1 CV In In C In σ 2 In −7.3 −7.2 −6.6 −6.1 −7.8 −7.0 −6.8 −6.8
2.1 0.6 5.6 7.5 7.9 8.0 6.0 6.8
3.4 4.0 0.4 1.4 2.9 1.7 1.1 1.2
−6.6 −4.7 −1.7 −1.7 −1.2 −2.1 −4.6 −2.1
Table 4: Computational Time (in Seconds) for Parameter Selection.
pyrim triazines mpg housing add10 cpusmall spacega abalone
RM
SP
L2 CV
L1 CV
0.6 2.5 10.7 23.3 160.6 204.7 53.6 68.0
0.3 5.8 63.0 159.4 957.1 931.8 771.9 1030.4
84.3 310.8 1536.7 2368.9 6940.2 10,073.7 4246.9 12,905.0
99.27 440.0 991.59 1651.5 5312.7 6087.0 15,514.42 7523.3
classification. One possible reason is that the leave-one-out error for SVR is a continuous but not a discrete measurement. The good performance of BSVR indicates that its Bayesian evidence function is accurate, but the use of a more general kernel function may also help. On the other hand, although MSP uses only the RBF kernel, it is competitive with BSVR. Note that as CV is conducted on a discrete set of parameters, sometimes its MSE is slightly worse than that of MSP or BSVR, which considers parameters in a continuous space. Table 3 presents the parameters obtained by different approaches. We do not give those by BSVR as it considers more than three parameters. Clearly these methods obtain quite distinct parameters even though they (except RM) give similar testing errors. This observation indicates that good parameters are in a quite wide region. In other words, SVM is sensitive to parameters but is not too sensitive. Moreover, different regions that RM and MSP lead to causing the two approaches to have quite different running times. More details are in Table 4 and the discussion that follows. To see the performance gain of the bound-based methods, we compare the computational time in Table 4. The experiments were done on a Pentium IV 2.8 GHz computer using the Linux operating system. We did not compare
Leave-One-Out Bounds for Support Vector Regression
1203
Table 5: Number of Function and Gradient Evaluations of the Quasi-Newton Method (Average of 30 Runs). RM
MSP
Problem
FEV
GEV
FEV
GEV
pyrim triazines mpg housing add10 cpusmall spacega abalone
34.7 29.8 38.2 40.5 44.0 43.9 28.6 19.0
13.5 10.4 5.0 5.0 4.96 5.0 9.96 3.0
32.2 30.0 21.6 33.2 27.8 35.6 30.1 19.9
13.5 13.2 11.7 13.2 5.4 6.66 9.93 10.3
the running time of BSVR as the code is available only on MS Windows. Clearly, using bounds saves a significant amount of time. For RM and MSP, the quasi-Newton implementation requires many fewer SVRs than CV. Table 5 lists the average number of function and gradient evaluations of the quasi-Newton method. Note that the number of function evaluations is the same as the number of SVRs solved. From Table 4, MSP is slower than RM though they have similar numbers of function and gradient evaluations. Because they do not land at the same parameter region, their respective SVR training time is different. In other words, the individual SVR training time here is related to parameters. Now MSP leads to a good region with smaller testing errors, but training SVRs with parameters in this region takes more time. From Table 4, the computational time of CV using L1- and L2-SVR is not close. Because they have different formulas (e.g., C in L1-SVR and C/2 in L2-SVR), we do not expect them to be very similar. 5.3 Discussion. The smoothing parameter η of the modified span bound was simply set to be 0.1 for experiments. It is important to check how η affects the performance of the bound. Figure 1 presents the relation between η and the test error. From equation 2.8, large η causes the modified bound to be away from the original one. Thus, the performance is worse, as shown in Figure 1. However, if η is reasonably small, the performance is quite stable. Therefore, the selection of η is not difficult. It is also interesting to investigate how tight the proposed bounds are in practice. Figure 2 compares different bounds and the leave-one-out value. We select the best σ 2 and from CV and show values of bounds and leaveone-out via changing C. Clearly, the span bound is a good approximation of leave-one-out, but RM is not when C is large. This situation has happened in classification (Chung et al., 2003). The reason is that St can be much smaller than 2 R˜ under some parameters. Recall that R˜ is the radius of the smallest
1204
M.-W. Chang and C.-J. Lin
Figure 1: Effect of η on testing errors: using the first training-testing split of the problems add10, mpg, housing, and cpusmall.
Figure 2: Leave-one-out, RM, and MSP. σ 2 and are fixed via using the best parameters from CV (L2-SVR). The training file is spacega.
Leave-One-Out Bounds for Support Vector Regression
1205
˜ i ), i = 1, . . . , l, so R˜ is large if there are two farsphere containing φ(x away points. However, the span bound finds a combination of xi , i ∈ / F \{t}, to be as close to xt . 6 Conclusions In this article, we derive leave-one-out bounds for SVR and discuss their properties. Experiments demonstrate that the proposed bounds are competitive with Bayesian SVR for parameter selection. A future study will apply the proposed bounds on feature selection. We also would like to implement nonsmooth optimization techniques as bounds here are not really differentiable. The implementation considering L1-SVR is also interesting. Experiments demonstrate that minimizing the proposed bound is more efficient than cross-validation on a discrete set of parameters. For a model with more than two parameters, a grid search is time-consuming, so a gradient-based method may be more suitable. Appendix A: Proof of Lemma 1 We consider the first case, αt > 0. Let (wt , b t , ξ t , ξ t∗ ) and (w, b, ξ, ξ ∗ ) be the optimal solutions of equations 3.2 and 1.1, respectively. Though ξ t and ξ t∗ are vectors with l − 1 elements, recall that we define ξtt = ξtt∗ = 0 to make ξ t and ξ t∗ have l elements. Note that the only difference between equations 3.2 and 1.1 is that equation 1.1 possesses the following constraint: − − ξt∗ ≤ wT φ(xt ) + b − yt ≤ + ξt .
(A.1)
We then prove the lemma by a contradiction. If the result is wrong, there are αt > 0 and f t (xt ) < yt . From the KKT condition, equation 3.1, αt > 0 implies ξt > 0 and f (xt ) = yt + + ξt > yt + > yt . Then we have f (xt ) = wT φ(xt ) + b > yt > f t (xt ) = (wt )T φ(xt ) + b t . Therefore, there is 0 < p < 1 such that (1 − p)(wT φ(xt ) + b) + p((wt )T φ(xt ) + b t ) = yt . Using the feasibility of the two points, ˆ ξˆ ∗ ) = (1 − p)(w, b, ξ, ξ ∗ ) + p(wt , b t , ξ t , ξ t∗ ) ˆ ξ, ˆ b, (w,
(A.2)
1206
M.-W. Chang and C.-J. Lin
is a new feasible solution of equation 3.2 without considering the tth element of ξˆt and ξˆt∗ . Since the objective function is convex and 0 < p < 1, we have 1 T C 2 C ∗2 ˆ + ˆ w (ξˆi ) + (ξˆ ) w 2 2 i=t 2 i=t i 1 T C 2 C ∗2 w w+ ≤ (1 − p) (ξi ) + (ξ ) 2 2 i=t 2 i=t i C t 2 C t∗ 2 1 t T t (ξ ) + (ξ ) (w ) (w ) + +p 2 2 i=t i 2 i=t i ≤
1 T C 2 C ∗2 w w+ (ξi ) + (ξ ) . 2 2 i=t 2 i=t i
(A.3)
The last inequality comes from the fact that (wt , b t , ξ t , ξ t∗ ) is optimal for equation 3.2 and if the constraint A.1 of equation 1.1 is not considered, (w, b, ξ, ξ ∗ ) is feasible for equation 3.2. In this case, αt > 0, which implies ξt > 0 and ξt∗ = 0 from KKT conditions 3.1. Since ξt = 0, ξˆt = 0 as well. As ξˆt and ξˆt∗ are not considered for derivˆ ξˆ ∗ ) satisˆ ξ, ˆ b, ing equation A.3, we can redefine ξˆt = ξˆt∗ = 0 such that (w, fies equation A.1 and thus is also a feasible solution of equation 1.1. Then, from equation A.3, ξˆt = 0 < ξt , and ξˆt∗ = ξt∗ = 0, we have l l 1 T C C ˆ w ˆ + (ξˆi )2 + (ξˆ ∗ )2 w 2 2 i=1 2 i=1 i
0. In this case, we mainly use the formulation B.1 to represent L2-SVR with the optimal solution α. ¯ We also represent the dual of equation 3.2 by a form similar to equation B.1 and denote α ¯ t as any of its unique optimal solution. Note that equation 3.2 has l − 1 constraints, so α ¯t t t t has 2(l − 1) elements. Here, we define α ¯t =α ¯ t+l = 0 to make α ¯ a vector with 2l elements. The KKT condition of equation B.1 can be rewritten as ¯ α) (Q ¯ i + bzi + pi = 0, if α¯ i > 0, ¯ ( Qα) ¯ i + bzi + pi > 0, if α¯ i = 0.
(B.5)
Next, we follow the procedure of Vapnik and Chapelle (2000) and Joachims (2000). In equation B.1, the approximate value of the training vector xt can be written as f (xt ) = −
2l i=1
¯ it + b, α¯ i Q
1208
M.-W. Chang and C.-J. Lin
which equals f (xt ) defined in equation 1.3. Here, we intend to consider f (xt ) = −
¯ it + b α¯ i Q
(B.6)
i=t,t+l
as an approximation of f t (xt ) since f and f t do not consider the tth training vector. However equation B.6 is not applicable since equations B.3 and B.4 are not a feasible solution of the dual of equation 3.2 when αt > 0. Therefore, we construct γ from α ¯ where γ is feasible for the problem with the tth data removed from equation B.1. Define a set F t = {i | α¯ i > 0, i = t, t + l}. In order to make γ a feasible solution, let γ = α ¯ − η, where η satisfies ηi ≤ α¯ i , ηi = 0, ηi = α¯ i ,
i ∈ Ft, i ∈ F t , i = t, i = t + l,
(B.7)
i = t, i = t + l,
and zT η = 0.
(B.8)
For 2l example, we can set all η1 , . . . , ηl to zero except ηt = αt . Then, using ¯ i ≥ αt , we can find η, which satisfies equation B.7 and i=l+1 α 2l
ηi = αt .
(B.9)
i=l+1
Denote F as the objective function of equation B.1. Then, 1 ¯ α ¯ − η) + pT (α ¯ − η) (α ¯ − η)T Q( 2 1 ¯ α) ¯ − pT α − (α) ¯ ¯ T Q( 2 1 ¯ − ηT ( Q ¯α = η T Qη ¯ + p). 2
F (γ) − F (α) ¯ =
(B.10)
¯α From equations B.5 and B.7, for any i ∈ F t ∪ {t}, ( Q ¯ + p)i = −zi b and for any i ∈ F t ∪ {t}, ηi = 0. Using equation B.8, equation B.10 is reduced to F (γ) − F (α) ¯ =
1 T ¯ η Qη. 2
(B.11)
Leave-One-Out Bounds for Support Vector Regression
1209
Similarly, from α ¯ t , we construct a vector δ that is a feasible solution of ¯ equation B.1. Let F t = {i | α¯ it > 0, i = t, t + l}. In order to make δ feasible, define δ = α ¯ t − µ, where µ satisfies µi ≤ α¯ it , µi = 0,
i ∈ F¯ t , i ∈ F¯ t , i = t, i = t + l,
(B.12)
µi = −α¯ i , i = t, i = t + l, and zT µ = 0.
(B.13)
The existence of µ easily follows from assumption 1. With the condition t zT α ¯ t = 0, this assumption implies that at least one of α¯ l+1 , . . . , α¯ 2lt is positive. t Thus, while α¯ t is increased from zero to α¯ t , we can increase some positive t α¯ l+1 , . . . , α¯ 2lt so that zT α ¯ t = 0 still holds. t Next, define δ = α ¯ − µ and note that δ is a feasible solution of equation 1.2. It follows that 1 t ¯ α ¯ t − µ) + pT (α ¯ t − µ) (α ¯ − µ)T Q( 2 1 t T ¯ α ¯ t ) − pT α − (α ¯t ¯ ) Q( 2 1 ¯ − µT ( Q ¯α = µT Qµ ¯ t + p). 2
F (δ) − F (α ¯ t) =
(B.14)
From equation B.5, ¯α (Q ¯ t + p)i = −zi b t ,
∀i ∈ F¯ t ,
(B.15)
where α ¯ t and b t are optimal for equation 3.2 and its dual. By equations B.12, B.13, and B.15, ¯α ¯ t + p) = −b t µT ( Q
2l
¯α ¯α µi zi + µt ( Q ¯ t + p)t + µt+l ( Q ¯ t + p)t+l
i=t,t+l
¯α ¯α = −α¯ t ( Q ¯ t + p + b t z)t − α¯ t+l ( Q ¯ t + p + b t z)t+l = α¯ t ( f t (xt ) − yt − ) + α¯ t+l (yt − f t (xt ) − ). Thus, equation B.14 is simplified to 1 ¯ ¯ t ) − F (δ) + µT Qµ. α¯ t ( f t (xt ) − yt − ) + α¯ t+l (yt − f t (xt ) − ) = F (α 2 (B.16)
1210
M.-W. Chang and C.-J. Lin
Here we claim that α¯ t+l = 0 when α¯ t > 0 since from equation B.5, αt αt∗ = α¯ t α¯ t+l = 0.
(B.17)
Note that F (δ) ≥ F (α) ¯ as α ¯ is the optimal solution of equation B.1. Similarly, F (γ) ≥ F (α ¯ t ). Combining equations B.17, B.16, and B.11, 1 ¯ ¯ t ) − F (δ) + µT Qµ α¯ t ( f t (xt ) − yt − ) = F (α 2 1 ¯ ≤ F (γ) − F (α) ¯ + µT Qµ 2 1 1 ¯ + µT Qµ. ¯ = η T Qη 2 2
(B.18)
Let Bη be the set containing all feasible η. That is, all η satisfy equations B.7 and B.8. Similarly, Bµ is the set containing all feasible µ. Then, 1 1 ¯ + min µT Qµ, ¯ α¯ t ( f t (xt ) − yt − ) ≤ min η T Qη η∈Bη 2 µ∈Bµ 2
(B.19)
since equation B.18 is valid for all feasible η and µ. Recall K˜ = K + I /C, and define gi = ηi − ηi+l for any i = 1, . . . , l. From the definition of equations B.1 and B.2, ¯ = η T Qη
l l
gi g j K˜ i j = gt2 K˜ tt + 2gt
i=1 j=1
gi K˜ it +
i=t
gi g j K˜ i j .
i=t j=t
Moreover, we can rewrite ¯ = η Qη T
K˜ tt − 2
gt2
i=t
λi K˜ it +
λi λ j K˜ i j ,
(B.20)
i=t j=t
where λi = −
gi , gt
i = 1, . . . , l.
(B.21)
When αt > 0, αt+l = 0 from equation B.17. Therefore, gt is not zero since ηt+l = 0 and gt = α¯ t . From equation B.8, l i=1,i=t
λi = 1.
(B.22)
Leave-One-Out Bounds for Support Vector Regression
1211
Note that ηi ηi+l = 0
(B.23)
˜ i )T φ(x ˜ j ), and equafrom equations B.7 and B.17. Therefore, from K˜ i j = φ(x tion B.23, so equations B.20, B.21, and B.22 imply ¯ = gt2 d 2 (φ(x ˜ t ), t ) = α¯ t2 d 2 (φ(x ˜ t ), t ), min η T Qη
(B.24)
η∈Bη
where
l
t =
˜ i) | λi φ(x
λi = 1, λi ≥ −
i=t
i=1,i=t
α¯ i if α¯ i ≥ 0, gt
α¯ i+l if α¯ i+l ≥ 0 , λi ≤ gt
˜ t ) and the set t in the feature ˜ t ), t ) is the distance between φ(x and d(φ(x space. Define a subset of t as
+ t
=
l
˜ i) | λi φ(x
i=1,i=t
λi = 1, λi ≥ 0, λi ≥ −
i=t
α¯ i if α¯ i ≥ 0, gt
α¯ i+l if α¯ i+l ≥ 0 . λi ≤ gt
Recall that in equation B.9, one way to find feasible α ¯ − η is to decrease some free α¯ l+1 , . . . , α¯ 2l . With gt = ηt = α¯ t > 0, this is achieved by using positive + λi : α¯ i+l − λi gt . Thus, + t is nonempty. As t is a subset of the convex hull ˜ i ), i = 1, . . . , l, i = t, by φ(x
2 ˜2 ˜ t ), t ) ≤ d 2 (φ(x ˜ t ), + ˜ ˜ d 2 φ(x t ≤ max φ(xt ) − φ(xi ) ≤ 4 R . i=t
(B.25)
Equation B.24 then implies ¯ ≤ 4α¯ t2 R˜ 2 . min η T Qη η∈Bη
(B.26)
Similarly, ¯ ≤ 4α¯ t2 R˜ 2 . min µT Qµ
µ∈Bµ
(B.27)
1212
M.-W. Chang and C.-J. Lin
Combining the above inequalities and equation B.19, and canceling out α¯ t , we have f t (xt ) − yt ≤ 4 R˜ 2 α¯ t + = 4 R˜ 2 αt + .
(B.28)
3. αt∗ > 0. The result can be proved through a similar procedure for the case of αt > 0. Appendix C: Proof of Theorem 2 Define η¯ = −µ ¯ =α ¯ −α ¯ t . Under the assumption that the set of support vectors remains the same during the leave-one-out procedure, η¯ ∈ Bη and µ ¯ ∈ Bµ . Then equation B.18 becomes an equality: ¯ η¯ = min η T Qη ¯ = α¯ t2 d 2 (φ(x ˜ t ), t ). α¯ t ( f t (xt ) − yt − ) = η¯ T Q η∈Bη
Since η¯ = α ¯ −α ¯ t and the sets of support vectors of α ¯ and α ¯ t are the same, ˜ t ), t ) = d 2 (φ(x ˜ t ), t ), d 2 (φ(x where
t =
i=t,α +α∗ >0 i
˜ i) | λi φ(x
i
i=t
λi = 1 .
The reason is that, using the assumption, we do not need to consider the constraints associated with free support vectors in the definition of t . Therefore, it follows that loo ≤
l (αt + αt∗ )St2 + l, t=1
where St2 is the optimal objective value of equation 2.5 with F replaced by {i | i = t, αi + αi∗ > 0}. Appendix D: Leave-One-Out Bounds for L1-SVR D.1 Modifications in the Proof of Lemma 1. In lemma 1, αt > 0 implies ξt > 0, which is used for the strict inequality in equation A.4. However, for L1-SVR, ξt may be zero even if αt > 0. To prove the inequality, we consider
Leave-One-Out Bounds for Support Vector Regression
1213
equation A.3 in which the equality becomes active only if 1 1 t T t ξit + C ξit∗ = wT w + C ξi + C ξi∗ . (w ) (w ) + C 2 2 i=t i=t i=t i=t Therefore, (w, b, ξ, ξ ∗ ) is optimal for equation 3.2 as well. Using assumption 1, (w, b) = (wt , b t ), so f t (xt ) = f (xt ) ≥ yt contradicts the assumption that f t (xt ) < yt . D.2 Modifications in the Proof of Lemma 2. The proof for the case of αt = αt∗ = 0 is exactly the same, so we focus on the case of αt > 0. Similar to the L2 case, we consider a form like equation B.1 and α ¯ becomes the dual variable. Now F t is redefined as {i | 0 < α¯ i < C, i = t, t + l}. We claim that η still exists so that 0 ≤ α¯ i − ηi ≤ C, ηi = 0, ηi = α¯ i ,
i ∈ Ft, i ∈ F t , i = t, i = t + l,
(D.1)
i = t, i = t + l,
and zT η = 0
(D.2)
are satisfied. In order to decrease α¯ t to zero, one may decrease some free α¯ l+1 , . . . , α¯ 2l so that equation B.9 is satisfied. However, for L1-SVR, it is possible that after all free α¯ l+1 , . . . , α¯ 2l are decreased to 0, α¯ t is not zero yet. At this point we must increase some free α¯ 1 , . . . , α¯ l . Since we keep eT α ¯ =0 and all α¯ l+1 , . . . , α¯ 2l have been updated to zero or remain at C, α¯ t +
α¯ i = C,
(D.3)
i=1,...,l,i = t, 0 0 for all x, x ∈ X. Since, the gaussian kernel can be shown to satisfy Mercer’s condition of semi-positivity (Scholkopf & Smola, 2002), it is a dot product in feature space; hence, it is the cosine of the angle between the corresponding features. The feature space is, then, an orthant of the unity sphere; hence, it is quite confined (yet infinite dimensional; Scholkopf & Smola, 2002; Micchelli, 1986). Given a training set, our objective is to maximize K v with respect to the kernel parameters. Newton’s iteration is −1 v(k + 1) = v(k) + ∇v2 K v ∇v K v ,
(3.2)
where ∇v K v and ∇v2 K v are, respectively, the gradient and the Hessian of K v with respect to v. In the case of a gaussian kernel, we have one kernel parameter v. The first and the second derivatives of the kernel with respect to v
1270
Y. Baram
are: x (i) , x ( j) 2 2 x (i) − x ( j) 2 ∂kv (x (i) , x ( j) ) exp − = ∂v v3 v2 ∂ 2 kv (x (i) , x ( j) ) −6 x (i) − x ( j) 2 x (i) − x ( j) 2 = exp − ∂v2 v4 v2 x (i) − x ( j) 2 4 x (i) − x ( j) 4 . exp − + v6 v2
(3.3)
(3.4)
Substituting these expressions into equation 3.2 will complete the specification of the parameter optimization step. How can we evaluate the performance of kernel sum classifiers? In learning theory, performance is often measured by risk functions (Scholkopf & Smola, 2002). In particular, given a classifier f and a cost function c(x, y, f (x)) (e.g., c(x, y, f (x)) = 0.5 | y − f (x) |), the expected risk is defined as R[ f ] = E{c(x, y, f (x))} = c(x, y, f (x))dP(x, y), (3.5) X×Y
whereas the empirical risk is Remp [ f ] =
M 1 c x (i) , y(i) , f x (i) . M i=1
(3.6)
Rooted in the VC theory (Vapnik & Chervonenkis, 1968, 1971), generalization is defined as uniform (with respect to a classifier family) convergence of the empirical risk to the actual risk, as the size M of the training setgrows indefinitely. When the empirical risk is zero, generalization has a particularly strong meaning, since it guarantees, asymptotically, a faultless classifier. Since for a commonly used cost per error (c = 0.5 | y(x) − f (x) |), the expected risk coincides with the error probability, we shall call such convergence consistency (this definition, which is consistent with the one used in statistics, is different from the one used by Scholkopf & Smola, 2002, which implies convergence of the expected risk to the Bayes risk). When the kernel in the kernel sum classifier is positive definite and when the dimension of the corresponding feature space is finite, so is the VC dimension. Generalization is then implied by the VC bound. Unfortunately, the feature spaces associated with some useful kernels (e.g., the gaussian one) are infinite dimensional, making the VC bound unusable. Several alternative bounds on classifier performance have been proposed (e.g., Scholkopf & Smola, 2002).
Learning by Kernel Polarization
1271
We find the PAC-Bayesian margin bound particularly useful in the present context here. We shall assume that polarization is complete, in the sense that the polarity coefficient, equation 3.1, attains its absolute maximum value. Then we have the following result: Lemma. Under complete kernel polarization, the expected risk of a kernel sum classifier f (x) = sign{w, φ(x)} with a positive definite kernel and a feature space of dimension n ∈ N ∪ ∞ is, with probability 1 − δ, bounded as 2 2 R[ f ] ≤ −d ln 1 − 1 − ρ ( f ) + 2 ln M − ln δ + 2 , M
(3.7)
where d = min(M, n) and ρ( f ) = min
i∈[M]
y(i) w, φ(x (i) ) , φ(x (i) )
where φ(x (i) ) is the feature corresponding to x (i) and [M] = {1, 2, . . . , M}. Proof. Complete kernel polarization guarantees that the two classes in the training set reduce to a point each in feature space, and these two points are different. This means that the empirical risk of a kernel sum classifier on the training set is zero. The proof now follows directly from the PAC-Bayesian margin bound (Herbrich, 2000; Scholkopf & Smola, 2002). Consistency of kernel sum classifiers now follows for positive unity kernels, that is, kernels satisfying k(x, x ) ≥ 0 and k(x, x) = 1. Theorem. For a positive unity kernel under complete polarization, a kernel sum classifier is consistent. Proof. We first note that under a positive unity kernel, the feature space is the positive orthant of the unity sphere. Under complete polarization, the two classes are concentrated at two points in feature space, separated by an angle π/2. Consider a homogeneous separating hyperplane in feature space, yielding a zero empirical risk and the highest possible expected risk. There may be, at most, two such hyperplanes, each passing arbitrarily close to one of the two polarized classes (one or both may possess the highest expected risk). Let us denote the normal to one of these hyperplanes w ∗ and the associated classifier f ∗ . Without loss of generality, suppose that this hyperplane passes arbitrarily close to the point corresponding to the polarized class labeled y = +1. Then w ∗ forms an angle arbitrarily close to π, with the vector associated with the point corresponding to the polarized class labeled y = −1. Clearly, for any other hyperplane w with zero empirical
1272
Y. Baram
risk (i.e., a hyperplane passing between the two points) and its associated classifier f , we have R[ f ] ≤ R[ f ∗ ].
(3.8)
We now have mini∈[M] {y(i) w ∗ , φ(x (i) )} = cos π = −1 and φ(x (i) ) = kv (x (i) , x (i) ) = 1; hence, ρ 2 ( f ∗ ) = 1. It follows from equation 3.7 that R[ f ∗ ] ≤
2 (2 ln M − ln δ + 2) . M
(3.9)
Hence, R[ f ] ≤
2 (2 ln M − ln δ + 2) . M
(3.10)
It follows that R[ f ] → 0 as M → ∞, as asserted. The theorem implies that a kernel sum classifier with a polarized gaussian kernel is consistent. Of course, this is also true for other kernels with similar properties. Even if complete polarization is not achieved, this result certainly motivates kernel polarization, which is further justified in the following section by numerical examples. 4 Examples We have applied kernel polarization to the six smallest databases from UCI (2003) ranging in size from 215 to 1066 and in dimension from 5 to 20. Each database was divided into four parts, three parts serving as a training set and one part as a test set, in a fourfold cross-validation fashion. For each of the cases, we have polarized a gaussian kernel with respect to the training data. The polarized kernel was then employed by the weighted-neighbors (PKWN), the nearest-mean (PKNM), and the support vectors (PKSV) classifiers described before. We have also applied the Parzen classifier (PC) and the Euclidean versions of the nearest neighbor (NN), the nearest mean (NM), and the weighted neighbors (WN) classifiers to the same data. Performance results for the SV classifier employing exhaustive search for the kernel and the soft margin parameter values were obtained from R. El-Yaniv and E. Yom-Tov (personal communication, 2003). The results, in terms of the average percentile number of errors for the four folds and the standard error of the mean deviation (STDM) are summarized in Table 1. The average performance of each of the classifiers is given in the last row of the table, where the STDM value is calculated with respect to the results for the different databases.
Thyroid Heart Cancer Diabetes German Flare Mean
PC
22.3 ± 0.8 16.7 ± 3.0 27.4 ± 0.1 24.9 ± 1.8 26.7 ± 0.6 44.0 ± 2.2 27.0 ± 3.7
NN
3.3 ± 0.3 23.7 ± 0.3 30.7 ± 2.1 28.6 ± 0 30.3 ± 1.3 44.4 ± 0.3 26.8 ± 5.5
30.2 ± 1.9 41.0 ± 6.1 29.2 ± 0.4 34.9 ± 2.1 30.0 ± 1.4 35.3 ± 0.5 33.4 ± 1.9
WN 8.4 ± 0.41 23.0 ± 0.6 28.9 ± 0.9 25.9 ± 0.7 29.4 ± 0.3 34.2 ± 1.4 24.9 ± 3.7
PKWN 15.2 ± 0.9 18.4 ± 0.9 30.4 ± 0.5 26.9 ± 1.2 28.5 ± 0.8 37.2 ± 1.4 26.1 ± 3.3
NM
Table 1: Classification Results (Error Percentage Mean+STDM) for the Examples.
9.3 ± 0.9 15.6 ± 0.5 29.9 ± 0.7 27.3 ± 1.6 28.2 ± 0.6 33.4 ± 3.5 24.1 ± 3.9
PKNM
5.2 ± 3.2 15.3 ± 6.4 29.7 ± 2.8 22.7 ± 2.3 23.5 ± 0.8 32.4 ± 1.8 21.5 ± 4.1
SV
5.1 ± 0.1 16.3 ± 2.0 27.1 ± 0.7 22.9 ± 0.4 23.7 ± 0.8 34.0 ± 0.8 21.6 ± 4.0
PKSV
Learning by Kernel Polarization 1273
1274
Y. Baram
It can be seen that the kernel versions of the weighted neighbors and the nearest mean (PKWN and PKNM) classifiers produced better results than their Euclidean counterparts (WN and NM) in all cases. Furthermore, these results were better than those achieved by the NN and the PC classifiers in most cases. These relative advantages can also be seen from the average performance results, given in the last row of the table. The worst results were achieved by the Euclidean WN classifier. Switching to polarized kernels, the average performance of the WN classifier improved by about 25%, exceeding those of the NN and the PC classifiers by about 10% on average. The performances of the PKWN and the PKNM classifiers are nearly the same. As for the SV classifier, the use of polarized kernels proves to be quite beneficial, as the classification results are nearly the same as those obtained by exhaustive parameter search. Moreover, since polarization implies that hard classification would be nearly as effective as soft classification, polarization involves the kernel parameter (v) alone, and the soft margin parameter is eliminated from our (PKSV) design.
5 Conclusion We have introduced the concept of kernel polarization as a means for selecting kernel parameter values. We have shown that complete kernel polarization yields consistency in kernel sum classification. We have examined the use of polarized kernels in proximity classifiers on real-life data and found them to perform better than their Euclidean distance counterparts. Employed by a support vectors classifier, a polarized kernel was found to produce about the same classification results as a kernel optimized by exhaustive search. Acknowledgments I thank Ran El-Yaniv for his insightful comments and his help with the numerical examples.
References Baram, Y. (1996). Classification by balanced representation. Neurocomputing, 13, 347– 357. Chapelle, O., Vapnik, V., Bousquet, O., & Mukhergee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131–159. Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Trans. on Information Theory, IT-13, 21–27. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press.
Learning by Kernel Polarization
1275
Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kandola, J. (2001). On kernel-target alignment. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 367–373). Cambridge, MA: MIT Press. Fukunaga, K. (1990). Introduction to statistical pattern recognition. Orlando, FL: Academic Press. Haykin, S. (1998). Neural networks, a comprehensive foundation (2nd ed.). New York: Macmillan. Herbrich, R. (2000). Learning linear classifiers. Unpublished doctoral dissertation, TU Berlin. Micchelli, C. A. (1986). Algebraic aspects of interpolation. In Proceedings of Symposia in Applied Mathematics, 36 (pp. 81–102). Providence, RI: American Mathematical Society. Parzen, E. (1962). On the estimation of probability density function and the mode. Ann. Math. Stat., 33, 1065–1076. Platt, J. C., Burges, C. J. C., Swenson, S., Weare, C., & Zheng, A. (2002). Learning a gaussian process prior for automatically generating music playlists. In S. Becker, S. Thrun, & K. Ober mayer (Eds.), Advances in neural information processing systems, 15 (pp. 1425–1432). Cambridge, MA: MIT Press. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. UCI. (2003). Machine Learning Repository. Available online: www.ics.uci.edu/ ∼mlearn/MLRepository.html. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Vapnik, V., & Chervonenkis, A. (1968). Uniform convergence of frequencies of occurrence of events to their probabilities. Dokl. Akad. Nauk SSSR, 181, 915–918. Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of occurrence of events to their probabilities. Theory of Probability and Its Applications, 16(2), 264–280. Vapnik, V., & Lerner, A. (1963). Pattern recognition using generalized portrait method. Automation and Remote Control, 24, 774–780. Wang, L., & Chan, K. L. (2002). Learning kernel parameters by using class separability measure. Paper presented at the NIPS kernel workshop, Whistler, Canada.
Received December 18, 2003; accepted November 29, 2004.
LETTER
Communicated by Daniel Durstewitz
A Unified Approach to Building and Controlling Spiking Attractor Networks Chris Eliasmith
[email protected] Department of Philosophy and Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, Canada
Extending work in Eliasmith and Anderson (2003), we employ a general framework to construct biologically plausible simulations of the three classes of attractor networks relevant for biological systems: static (point, line, ring, and plane) attractors, cyclic attractors, and chaotic attractors. We discuss these attractors in the context of the neural systems that they have been posited to help explain: eye control, working memory, and head direction; locomotion (specifically swimming); and olfaction, respectively. We then demonstrate how to introduce control into these models. The addition of control shows how attractor networks can be used as subsystems in larger neural systems, demonstrates how a much larger class of networks can be related to attractor networks, and makes it clear how attractor networks can be exploited for various information processing tasks in neurobiological systems. 1 Introduction Persistent activity has been thought to be important for neural computation at least since Hebb (1949), who suggested that it may underlie shortterm memory. Amit (1989), following work on attractors in artificial neural networks (e.g., that of Hopfield, 1982), suggested that persistent neural activity in biological networks is a result of dynamical attractors in the state space of recurrent networks. Since then, attractor networks have become a mainstay of computational neuroscience and have been used in a wide variety of models (see, e.g., Zhang 1996; Seung, Lee, Reis, & Tank, 2000; Touretzky & Redish, 1996; Laing & Chow, 2001; Hansel & Sompolinsky, 1998; Eliasmith, Westover, & Anderson, 2002). Despite a general agreement among theoretical neuroscientists that attractor networks form a large and biologically relevant class of networks, there is no general method for constructing and controlling the behavior of such networks. In this letter, we present such a method and explore several examples of its application, significantly extending work described in Eliasmith and Anderson (2003). We argue that this framework can both unify the current use of attractor networks and show how to extend the range of applicability of attractor models. Neural Computation 17, 1276–1314 (2005)
© 2005 Massachusetts Institute of Technology
Controlling Spiking Attractor Networks
1277
Most important, perhaps, we describe in detail how complex control can be integrated with standard attractor models. This allows us to begin to answer the kinds of pressing questions now being posed by neuroscientists, including, for example, how to account for the dynamics of working memory (see, e.g., Brody, Romo, & Kepecs, 2003; Fuster, 2001; Rainer & Miller, 2002). We briefly summarize the general framework and then demonstrate its application to the construction of spiking networks that exhibit line, plane, ring, cyclic, and chaotic attractors. Subsequently, we describe how to integrate control signals into each of these models, significantly increasing the power and range of application of these networks. 2 Framework This section briefly summarizes the methods described in Eliasmith and Anderson (2003), which we will refer to as the neural engineering framework (NEF). Subsequent sections discuss the application of these methods to the construction and control of attractor networks. The following three principles describe the approach: 1. Neural representations are defined by the combination of nonlinear encoding (exemplified by neuron tuning curves and neural spiking) and weighted linear decoding (over populations of neurons and over time). 2. Transformations of neural representations are functions of the variables represented by neural populations. Transformations are determined using an alternately weighted linear decoding. 3. Neural dynamics are characterized by considering neural representations as control-theoretic state variables. Thus, the dynamics of neurobiological systems can be analyzed using control theory. In addition to these main principles, the following addendum is taken to be important for analyzing neural systems:
r
Neural systems are subject to significant amounts of noise. Therefore, any analysis of such systems must account for the effects of noise.
Each of the next three sections describes one of the principles, in the context of the addendum, in more detail. (For detailed justifications of these principles, see Eliasmith & Anderson, 2003.) They are presented here to make clear both the terminology and assumptions in the subsequent network derivations. The successes of the subsequent models help to justify the adoption of these principles.
1278
C. Eliasmith
2.1 Representation. Consider a population of neurons whose activities a i (x) encode some vector, x. These activities can be written as a i (x) = G i [J i (x)] ,
(2.1)
where G i is the nonlinear function describing the neuron’s response function, and J i (x) is the current entering the soma. The somatic current is given by J i (x) = αi x · φ˜ i + J ibias ,
(2.2)
where αi is a gain and conversion factor, x is the vector variable to be encoded, φ˜ i determines the preferred stimulus of the neuron, and J ibias is a bias current that accounts for background activity. This equation provides a standard description of the effects of a current arriving at the soma of neuron i as a result of presenting a stimulus x. The nonlinearity G i that describes the neuron’s activity as a result of this current can be left undefined for the moment. In general, it should be determined experimentally, and thus based on the intrinsic physiological properties of the neuron(s) being modeled. The result of applying G i to the soma current J i (x) over the range of stimuli gives the neuron’s tuning curve a i (x). So a i (x) defines the encoding of the stimulus into neural activity. Given this encoding, the original stimulus vector can be estimated by decoding those activities, for example, xˆ =
a i (x)φi .
(2.3)
i
These decoding vectors, φi , can be found by a least-squares method (see the appendix; Salinas & Abbott, 1994; Eliasmith & Anderson, 2003). Together, the nonlinear encoding in equation 2.1 and the linear decoding in equation 2.3 define a neural population code for the representation of x. To incorporate a temporal code into this population code, we can draw on work that has shown that most of the information in neural spike trains can be extracted by linear decoding (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Let us first consider the temporal code in isolation by taking the neural activities a i (t) to be decoded spike trains, that is, a i (t) =
n
h i (t) ∗ δi (t − tn ) =
h i (t − tn ),
(2.4)
n
where δi (·) are the spikes at times tn for neuron i, and h i (t) are the linear decoding filters, which, for reasons of biological plausibility, we can take to be the (normalized) postsynaptic currents (PSCs) in the subsequent neuron. Elsewhere it has been shown that the information loss under this assumption is minimal and can be alleviated by increasing population size (Eliasmith
Controlling Spiking Attractor Networks
1279
& Anderson, 2003). As before, the encoding on which this linear decoding operates is defined as in equation 2.1, where G i is now taken to be a spiking nonlinearity. We can now combine this temporal code with the previously defined population code to give a general population temporal code for vectors: δ(t −tin ) = G i αi x · φ˜ i + J ibias Encoding xˆ = i,n h i (t − tn )φi Decoding. 2.2 Transformation. For such representations to be useful, they must be used to define transformations (i.e., functions of the vector variables). f (x) Fortunately, we can again find (least-squares optimal) decoders φi to perform a transformation f (x). So instead of finding the optimal decoders φi to extract the originally encoded variable x from the encoding, we can reweight the decoding to give some function f (x) other than identity (see the f (x) appendix). To distinguish the representational decoders φi from φi , we refer to the latter as transformational decoders. Given this characterization, it is a simple matter to rewrite the encoding and decoding equations for estimating some function of the vector variable: δ(t − tin ) = G i αi x · φ˜ i + J ibias Encoding f (x) Decoding. fˆ (x) = i,n h i (t − tn )φi Notably, both linear and nonlinear functions of the encoded variable can be computed in this manner (see Eliasmith & Anderson, 2003, for further discussion). 2.3 Dynamics. Dynamics of neural systems can be described using the previous characterizations of representation and transformation by employing modern control theory. Specifically, we can allow the higher-level vector variables represented by a neural population to be control-theoretic state variables. Let us first consider linear time-invariant (LTI) systems. Recall that the state equation in modern control theory describing LTI dynamics is ˙ = Ax(t) + Bu(t). x(t)
(2.5)
The input matrix B and the dynamics matrix A completely describe the dynamics of the LTI system, given the state variables x(t) and the input u(t). Taking the Laplace transform of equation 2.5 gives x(s) = h(s) [Ax(s) + Bu(s)] , where h(s) = 1s . Any LTI control system can be written in this form.
1280
C. Eliasmith
In the case of neural systems, the transfer function h(s) is not 1s but is determined by the intrinsic properties of the component cells. Because it is reasonable to assume that the dynamics of the synaptic PSC dominate the dynamics of the cellular response as a whole (Eliasmith & Anderson, 2003), it is reasonable to characterize the dynamics of neural populations based on their synaptic dynamics, that is, using h i (t) from equation 2.4. A simple model of a synaptic PSC is given by h (t) =
1 −t/τ , e τ
(2.6)
where τ is the synaptic time constant. The Laplace transform of this filter is h (s) =
1 . 1 + sτ
Given the change in filters from h(s) to h (s), we now need to determine how to change A and B in order to preserve the dynamics defined in the original system (i.e., the one using h(s)). In other words, letting the neural dynamics be defined by A and B , we need to determine the relation between matrices A and A and matrices B and B given the differences between h(s) and h (s). To do so, we can solve for x(s) in both cases and equate the resulting expressions for x(s). This gives A = τ A + I
B = τ B.
(2.7) (2.8)
This procedure assumes nothing about A or B, so we can construct a neurobiologically realistic implementation of any dynamical system defined using the techniques of modern control theory applied to LTI systems (constrained by the neurons’ intrinsic dynamics). Note also that this derivation is independent of the spiking nonlinearity, since that process is both very rapid and dedicated to encoding the resultant somatic voltage (not filtering it in any significant way). Importantly, the same approach can be used to characterize the broader class of time-varying and nonlinear control systems (examples are provided below). 2.4 Synthesis. Combining the preceding characterizations of representation, transformation, and dynamics results in the generic neural subsystem shown in Figure 1. With this formulation, the synaptic weights needed to implement some specified control system can be directly computed. Note that the formulation also provides for simulating the same model at various levels of description (e.g., at the level of neural populations or individual neurons, using either rate neurons or spiking neurons, and so on). This is useful for fine-tuning a model before introducing the extra computational overhead involved with modeling complex spiking neurons.
Controlling Spiking Attractor Networks
Σ n δ
β
(t-tjn)
Σδβ'(t-tjn)
n
1 1+sτijαβ
αβ
M
Fβ φj
1281
Fαβ[xβ(t)] α
Ji (t)
∼φ α i
1 1+sτijαβ'
αβ'
M
Fβ' φj
Gi[.]
Σn δα(t-tin)
soma αβ'
F
β'
[x (t)]
...
...
...
...
higher-level description
synaptic weights
PSCs dendrites
neural description
Figure 1: A generic neural population systems diagram. This figure is a combination of a higher-level (control-theoretic) and a neural-level system description (denoted by dotted lines). The solid boxes highlight the dendritic elements. The right-most solid box decomposes the synaptic weights into the relevant matrices. See the text for further discussion. (Adapted from Eliasmith & Anderson, 2003.)
In Figure 1, the solid lines highlight the dendritic components, which have been separated into postsynaptic filtering by the PSCs (i.e., temporal decoding) and the synaptic weights (population decoding and encoding). The weights themselves are determined by three matrices: (1) the decoding Fβ matrix, whose elements are the decoders φi for some (nonlinear) function β F of the signal x (t) that comes from a preceding population β; (2) the encoding matrix, whose elements are the encoders φ˜ iα for this population; and (3) the generalized transformation matrix Mαβ , which defines the transformations necessary for implementing the desired control system. Essentially, this diagram summarizes the three principles and their interrelations. The generality of the diagram hints at the generality of the underlying framework. In the remainder of the article, however, I focus only on its application to the construction and control of attractor networks. 3 Building Attractor Networks The most obvious feature of attractor networks is their tendency toward dynamic stability. That is, given a momentary input, they will settle on a position, or a recurring sequence of positions, in state space. This kind of stability can be usefully exploited by biological systems in a number of ways. For instance, it can help the system react to environmental changes
1282
C. Eliasmith
on multiple timescales. That is, stability permits systems to act on longer timescales than they might otherwise be able to, which is essential for numerous behaviors, including prediction, navigation, and social interaction. In addition, stability can be used as an indicator of task completion, such as in the case of stimulus categorization (Hopfield, 1982). As well, stability can make the system more robust (i.e., more resistant to undesirable perturbations). Because these networks are constructed so as to have only a certain set of stable states, random perturbations to nearby states can quickly dissipate to a stable state. As a result, attractor networks have been shown to be effective for noise reduction (Pouget, Zhang, Deneve, & Latham, 1998). Similarly, attractors over a series of states (e.g., cyclic attractors) can be used to robustly support repetitive behaviors such as walking, swimming, flying, or chewing. Given these kinds of useful computational properties and their natural analogs in biological behavior, it is unsurprising that attractor networks have become a staple of computational neuroscience. More than this, as the complexity of computational models continues to increase, attractor networks are likely to form important subnetworks in larger models. This is because the ability of attractor networks to, for example, categorize, filter noise, and integrate signals makes them good candidates for being some of the basic building blocks of complex signal processing systems. As a result, the networks described here should prove useful for a wide class of more complex models. To maintain consistency, all of the results of subsequent models were generated using networks of leaky integrate-and-fire neurons with absolute refractory periods of 1 ms, membrane time constants of 10 ms, and synaptic time constants of 5 ms. Intercepts and maximum firing rates were chosen from even distributions. The intercept intervals are normalized over [−1, 1] unless otherwise specified. (For a detailed discussion of the effects of changing these parameters, see Eliasmith & Anderson, 2003.) For each example presented below, the presentation focuses specifically on the construction of the relevant model. As a result, there is minimal discussion of the justification for mapping particular kinds of attractors onto various neural systems and behaviors, although references are provided. 3.1 Line Attractor. The line attractor, or neural integrator, has recently been implicated in decision making (Shadlen & Newsome, 2001; Wang, 2002), but is most extensively explored in the context of oculomotor control (Fukushima, Kaneko, & Fuchs, 1992; Seung, 1996; Askay, Gamkrelidze, Seung, Baker, & Tank, 2001). It is interesting to note that the terms line attractor and neural integrator actually describe different aspects of the network. In particular, the network is called an integrator because the low-dimensional variable (e.g., horizontal eye position) x(t) describing the network’s output reflects the integration of the input signal (e.g., eye movement velocity) u(t) to the system. In contrast, the network is called a line attractor because in
Controlling Spiking Attractor Networks
1283 x(t)
B
A
u(t)
bk(t)
aj(t)
Figure 2: Line attractor network architecture. The underline denotes variables that are part of the neuron-level description. The remaining variables are part of the higher-level description.
the high-dimensional activity space of the network (where the dimension is equal to the number of neurons in the network), the organization of the system collapses network activity to lie on a one-dimensional subspace (i.e., a line). As a result, only input that moves the network along this line changes the network’s output. In a sense, then, these two terms reflect a difference between what can be called higher-level and neuron-level descriptions of the system (see Figure 2). As modelers of the system, we need a method that allows us to integrate these two descriptions. Adopting the principles outlined earlier does precisely this. Notably, the resulting derivation is simple and is similar to that presented in Eliasmith and Anderson (2003). However, all of the steps needed to generate the far more complex circuits discussed later are described here, so it is useful as an introduction (and referred to for some of the subsequent derivations). We can begin by describing the higher-level behavior as integration, which has the state equation x˙ = Ax(t) + Bu(t) x(s) =
1 [Ax(s) + Bu(s)] , s
(3.1) (3.2)
where A = 0 and B = 1. Given principle 3, we can determine A and B , which are needed to implement this behavior in a system with neural dynamics defined by h (t) (see equation 2.6). The result is B = τ A = 1,
1284
C. Eliasmith
Firing Rate (Hz)
250
200
150
100
50
0 -1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
x Figure 3: Sample tuning curves for a population of neurons used to implement the line attractor. These are the equivalent steady-state tuning curves of the spiking neurons used in this example. They are found by solving the differential equations for the leaky integrate-and-fire neuron assuming a constant input . 1 current and are described by a j (x) = ref J threshold τ j −τ RC j ln 1− a x+J bias j
where τ is the time constant of the PSC of neurons in the population representing x(t). To use this description in a neural model, we must define the representation of the state variable of the system, x(t). Given principle 1, let us define this representation using the following encoding and decoding: a j (t) = G j α j x(t)φ˜ j + J jbias
(3.3)
and xˆ (t) =
a j (t)φ xj .
(3.4)
j
Note that the encoding weight φ˜ j plays the same role as the encoding vector in equation 2.2 but is simply ±1 (for on and off neurons) in the scalar case. Figure 3 shows a population of neurons with this kind of encoding. Let us also assume an analogous representation for u(t). Working in the time domain, we can take our description of the dynamics, x(t) = h (t) ∗ A x(t) + B u(t) ,
Controlling Spiking Attractor Networks
1285
and substitute it into equation 3.3 to give a j (t) = G j α j φ˜ j h (t) ∗ [A x(t) + B u(t)] + J jbias .
(3.5)
Substituting our decoding equation 3.4 into 3.5 for both populations gives
a j (t) = G j α j
h (t) ∗ φ˜ j
A
a i (t)φix
+B
i
= G j h (t) ∗
ω ji a i (t) +
b k (t)φku
+J
bias j
k
i
ω jk b k (t) + J jbias ,
(3.6) (3.7)
k
where ω ji = α j A φix φ˜ j andω jk = α j B φku φ˜ j are the recurrent and input connection weights, respectively. Note that i is used to index population activity at the previous time step,1 and G i is a spiking nonlinearity. It is important to keep in mind that the temporal filtering is done only once, despite this notation. That is, h (t) is the same filter as that defining the decoding of both x(t) and u(t). More precisely, this equation should be written as
δ j (t − tn ) = G j
n
ω ji h i (t) ∗ δi (t − tn ) + . . .
i,n
(3.8)
ω jk h k (t)
∗ δk (t − tn ) + J
bias j
.
(3.9)
k,n
The dynamics of this system when h i (t) = h k (t) are as written in equation 3.5, which is the case of most interest as it best approximates a true integrator. Nevertheless, they do not have to be equal and model a broader class of dynamics when this is included in the higher-level analysis. For completeness, we can write the subthreshold dynamical equations for an individual LIF neuron voltage Vj (t) in this population as follows: d Vj 1 = − RC dt τj
Vj − R j
ω jk h k (t)
ω ji h i (t) ∗ δi (t − tn ) + . . .
i,n
∗ δk (t − tn ) + J
(3.10)
bias j
,
(3.11)
k,n
1
In fact, there are no discrete time steps since this is a continuous system. However, the PSC effectively acts as a time step, as it determines the length of time that previous information is available.
1286
u(t)
A)
C. Eliasmith C) 70
1
60 0
-1 0
B)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Neuron
50 40 30
1
x(t)
20 10
0
0 0
0.1
0.2
0.3
0.4
0.5
0.6
Time (s)
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time (s)
Figure 4: (A) The decoded input u(t), (B) the decoded integration x(t) of a spiking line attractor with 200 neurons under 10% noise, and (C) spike rasters of a third of the neurons in the population.
where τ jRC = R j C j , R j is the membrane resistance, and C j the membrane capacitance. As usual, the spikes are determined by choosing a threshold voltage for the LIF neuron (Vth ) and placing a spike when Vj > Vth . In our ref models, we also include a refractory time constant τ j , which captures the absolute refractory period observed in real neurons. Figure 4 shows a brief sample run for this network. To gain insight into the network’s function as both an attractor and an integrator, it is important to derive measures of the network’s behavior. This has already been done to some extent for line attractors, so I will not discuss such measures here (Seung et al., 2000; Eliasmith & Anderson, 2003). What these analyses make clear, however, is how higher-level properties, such as the effective time constant of the network, are related to neuron-level properties, such as membrane and synaptic time constants. Because the previous derivation is part of a general method for building more complex attractor networks (as I discuss next), it becomes evident how these same analyses can apply in the more complex cases. This is a significant benefit of generating models with a set of unified principles. More important from a practical standpoint, constructing this network by employing control theory makes it evident how to control some of the high-level properties, such as the effective network time constant (see section 4). It is this kind of control that begins to make clear how important such simple networks are for understanding neural signal processing. 3.2 Plane Attractor. Perhaps the most obvious generalization of a line attractor is to a plane attractor, that is, from a one-dimensional to a multidimensional attractor. In this section, I perform this generalization. However, to demonstrate the representational generality of the NEF, I consider plane
Controlling Spiking Attractor Networks
1287
attractors in the context of function representation (rather than vector representation). This has the further benefit of demonstrating how the ubiquitous bump attractor networks relate to the methods described here. Networks that have sustained gaussian-like bumps of activity have been posited in various neural systems, including frontal working memory areas (Laing & Chow, 2001; Brody et al., 2003), the head direction system (Zhang, 1996; Redish, 1999), visual feature selection areas (Hansel & Sompolinsky, 1998), arm control systems (Snyder, Batista, & Andersen, 1997), and the path integration system (Conklin & Eliasmith, 2005). The prevalence of this kind of attractor suggests that it is important to account for such networks in a general framework. However, it is not immediately obvious how the stable representation of functions relates to either vector or scalar representation as I have so far described it. Theoretically, continuous function representation demands an infinite number of degrees of freedom. Neural systems, of course, are finite. As a result, it is natural to assume that understanding function representation in neural systems can be done using a finite basis for the function space accessible by that system. Given a finite basis, the finite set of coefficients of such a basis determines the function being represented by the system at any time. As a result, function representation can be treated as the representation of a vector of coefficients over some basis, that is,
x(ν; A) =
M
xm m (ν),
(3.12)
m=1
where ν is the dimension the function is defined over (e.g., spatial position), x is the vector of M coefficients xm , and m are the set of M orthonormal basis functions needed to define the function space. Notably, this basis does not need to be accessible in any way by the neural system itself; it is merely a way for us to conveniently write the function space that can be represented by the system. The neural representation depends on an overcomplete representation of this same space, where the overcomplete basis is defined by the tuning curves of the relevant neurons. More specifically, we can define the encoding of a function space analogous to that for a vector space by writing a i (x(ν; x)) = a i (x) = G i αi x(ν; x)φ˜ i (ν)ν + J ibias .
(3.13)
Here, the encoded function x(ν; x) (e.g., a gaussian-like bump) and the encoding function φ˜ i (ν), which is inferred from the tuning curve, determine the activity of the neuron. Because of the integration over ν in this encoding, it is only changes in the coefficients x that affect neural firing, again making
1288
C. Eliasmith
it clear that we can treat neural activity as encoding a vector of coefficients. The decoding for function representation is, as expected, xˆ (ν; x) =
a i (x)φi (ν),
(3.14)
i
where the decoders φi (ν) can be determined like the vector decoders discussed earlier. Having defined the function representation in this way, we are in a position to explicitly relate it to an equivalent vector representation. This is important because it allows us to use the control-theoretic techniques discussed in section 2.3 to define the dynamics of the representation. Let us begin by writing the decoders φi (ν) using the orthonormal basis that defines the function space x(ν; x),
φi (ν) =
M
q im m (ν),
m
where q im are elements of the matrix of coefficients defining each of the i encoding functions with respect to the basis m (ν). This ensures that the representation of a given function will not lie outside the original function space. Similarly, the encoding functions φ˜ i (ν) should encode functions x(ν; x) only in such a way that they can be decoded by these decoders, so we may assume
φ˜ i (ν) =
M
q˜ im m (ν).
(3.15)
m
Together, these definitions determine the equivalent of the function representation in a vector space. In particular, the encoding is given by a i (x) = G i αi = G i αi = G i αi
xm m (ν)q˜ in n (ν)
n,m
n,m
m
xm q˜ in δnm +
J ibias
xm q˜ im +
= G i αi xq˜ i m + J ibias ,
+ ν
J ibias
J ibias
Controlling Spiking Attractor Networks A)
1289 B)
0.1
Φ1 Φ2 Φ3 Φ4
Φ1 Φ2 Φ3 Φ4 Φ5
0.1
Φ (ν)
Φ (ν)
Φ5
0
0
-0.1 -0.1
-3
-2
-1
0
ν
1
2
3
-1
-0.8 -0.6 -0.4 -0.2
0
ν
0.2
0.4
0.6
0.8
1
Figure 5: The orthonormal bases for the representational spaces for (A) the ring attractor in the head direction system and (B) LIP working memory. Note that the former is cyclic and the latter is not.
and the decoding is given by xˆ =
a i (x)qi .
i
Essentially, these equations simply convert the encoding and decoding functions into their equivalent encoding and decoding vectors in the function space whose dimensions are determined by m . Our description of a neural system in this space will have all the same properties as it did in the original function space. The advantage, as already mentioned, is that control theory is more easily applied to finite vector spaces. When we introduce control in section 4, we demonstrate this advantage in more detail. To see the utility of this formulation, let us consider two different neural systems in parallel: working memory in lateral intraparietal (LIP) cortex and the ring attractor in the head direction system.2 These can be considered in parallel because the dominant dynamics in both systems are the same. That is, just like the integrator, LIP working memory and the ring attractor maintain a constant value with no input: x˙ = 0. The difference between these systems is that LIP working memory is sometimes taken to have a bounded domain of representation (e.g., between ±60 degrees from midline), whereas the ring attractor has a cyclic domain of representation. Given equation 3.12, this difference will show up as a difference in the orthonormal basis (ν) that spans the representational space. Figure 5 shows this difference.
2 The first of these is considered in Eliasmith & Anderson (2003), although without the integration of the data as described here.
1290 A)
C. Eliasmith B)
Figure 6: Samples of the (A) encoding functions and (B) resultant tuning curves for LIP neurons based on the data provided in Platt and Glimcher (1998).
In fact, these orthonormal basis functions can be inferred from the neural data. To do so, we first need to construct the neural encoding functions φ˜ i (ν) by looking at the experimentally determined tuning curves of the neurons. That is, we can generate a population of neurons with tuning curves that have the same distribution of widths and heights as observed in the system we are modeling and then use the encoding functions necessary for generating that population as an overcomplete basis for the representational space. We can then determine, using singular value decomposition, the orthonormal basis that spans that same space and use it as our orthonormal basis, (ν). For example, using data from Platt and Glimcher (1998), we have applied this method to get the tuning curves shown in Figure 6A, the encoders shown in Figure 6B, and thus arrive at the orthonormal basis shown in Figure 5B for LIP representation. Given an orthonormal basis, we now need to determine which set of coefficients x on that basis is relevant for the neural system of interest. The standard representations in both LIP working memory and the ring attractor are generally characterized as gaussian-like bumps, at any position. However, in LIP, there is evidence that this bump can be further parameterized by nonpositional dimensions (Sereno & Maunsell, 1998). This kind of parametric working memory has also been found in frontal systems (Romo, Brody, Hern´andez, & Lemus, 1999). So the representation in LIP will be gaussian-like bumps, but of varying heights. Ideally, we need to specify some probability density ρ(x) on the coefficients that appropriately picks out just the gaussian-like functions centered at every value of ν (and those of various heights for the LIP model). This essentially specifies the range of functions that are permissible in our function space. It is clearly undesirable to have all functions that can be represented by the orthonormal basis as candidate representations. Given ρ(x), we can
Controlling Spiking Attractor Networks A)
1291
B) 0
1
0
ν
-1
-π
0
ν
π
Time (s)
0.5
1
0
1
500
Neuron
1000
1
500
Time (s)
0.5
1 1000
Neuron
Figure 7: Simulation results for (A) LIP working memory encoding two different bump heights at two locations and (B) a head direction ring attractor. The top graph shows the decoded function representation and the bottom graph the activity of the neurons in the population. (The activity plots are calculated from spiking data using a 20 ms time window with background activity removed and are smoothed by a 50-point moving average.) Both models use 1000 spiking LIF neurons with 10% noise added.
use the methods described in the appendix to find the decoders. In particular, the average over x in equation A.2 is replaced by an average over ρ(x). In practice, it can be difficult to compactly define ρ(x), so it is often convenient to use a Monte Carlo method for approximating this distribution when performing the average, which we have done for these examples. Having fully defined the representational space for the system, we can apply the methods described earlier for the line attractor to generate a fully recurrent spiking model of these systems. The resulting behaviors of these two models are shown in Figure 7. Note that although it is natural to interpret the behavior of this network as a function attractor (as demonstrated by the activity of the population of neurons), the model can also be understood as implementing a (hyper-)plane attractor in the x vector space. Being able to understand the models of two different neural systems with one approach can be very useful. This is because we can transfer analyses
1292
C. Eliasmith
of one model to the other with minimal effort. As well, it highlights possible basic principles of neural organization (in this case integration). These observations provide a first hint that the NEF stands to unify diverse descriptions of neural systems and help identify general principles underlying the dynamics and representations in those systems. 3.3 Cyclic Attractor. The kinds of attractors presented to this point are, in a sense, static because once the system has settled to a stable point, it will remain there unless perturbed. However, there is another broad class of attractors with dynamic, periodic stability. In such cases, settling into the attractor results in a cyclic progression through a closed set of points. The simplest example of this kind of attractor is the ideal oscillator. Because cyclic attractors are used to describe oscillators and many neural systems seem to include oscillatory behavior, it is natural to use cyclic attractors to describe oscillatory behavior in neural systems. Such behavior may include any repetitive motion, such as walking, swimming, flying, or chewing. The natural mapping between oscillators and repetitive behavior is at the heart of most work on central pattern generators (CPGs; Selverston, 1980; Kopell & Ermentrout, 1998). However, this work typically characterizes oscillators as interactions between only a few neighboring neurons. In contrast, the NEF can help us in understanding cyclic attractors at the network level. Comparing the results of an NEF characterization with that of the standard approach to CPGs shows that there are advantages to the higher-level NEF characterization. To effect this comparison, let us extend a previously described model of lamprey swimming (for more details on the mechanical model, see Eliasmith & Anderson 2000, 2003).3 Later, we extend this model by introducing control. When the lamprey swims, the resulting motion resembles a standing wave of one period over the lamprey’s length. The tensions T in the muscles needed to give rise to this motion can be described by T(z, t) = κ(sin(ωt − kz) − sin(ωt)),
(3.16)
A where κ = γ ηω , k = 2π , A = 1 is the wave amplitude, η = 1 is the normalk L ized viscosity coefficient, γ = 1 is the ratio of intersegmental and vertebrae length, L = 1 is the length of the lamprey, and ω is the swimming frequency. As for the LIP model, we can define an orthogonal representation of the dynamic pattern of tensions in terms of the coefficients xn (t) and the
3 Unlike the previous model, this one includes noisy spiking neurons. Parameters for these neurons and their distribution are based on data in el Manira, Tegner, & Grillner (1994). This effectively demonstrates that the bursting observed in lamprey spinal cord is observed in the model as well.
Controlling Spiking Attractor Networks
1293
harmonic functions n (z): N ˆ x2n−1 (t) sin(2π nz) + x2n (t) cos(2πnz) . T(z, t; x) = κ x0 + n=1
The appropriate x coefficients are found by setting the mean square ˆ t; x) and T(z, t) to be zero. Doing so, we find that error between T(z, x0 (t) = − sin(ωt), x1 (t) = − cos(ωt), x2 (t) = sin(ωt), and for n > 2, xn (t) = 0. This defines the representation in a higher-level function space, whose dynamics we can implement by describing the dynamics of the coefficients, x. In this case, it is evident that the coefficients x0 and x1 implement a standard oscillator. The coefficient x2 is an additional counterphase sine wave. This additional term simply tilts the two-dimensional cyclic attractor in phase space, so we essentially have just a standard oscillator. We can write the control equations as usual, x˙ = Ax, where
0 ω 0 A = −ω 0 0 0 −ω 0
(3.17)
for some frequency ω. Before we embed this control description into a neural population, it makes sense to take into account the known anatomical structure of the system we are describing. In the case of the lamprey, we know that the representation of the tension T is spread down the length of the animal in a series of 100 or so segments (Grillner, Wall´en, Brodin, & Lansner, 1991). As a result, we can define a representation that is intermediate between a neural representation and the orthogonal representation that captures this structure. In particular, let us define an overcomplete representation along the length z with gaussian-like encoding functions. The encoding into this intermediate representation is thus b j (t) = φ˜ j (z)T(z, t)z , and the decoding is ˆ t) = T(z,
j
b j (t)φ j (z).
1294
C. Eliasmith
This representation is not essential, but has one very useful property: it allows us to simulate some parts of the model at the neural level and other parts at this intermediate level, resulting in significant computational savings while selectively simplifying the model. Of course, to use this representation, we need to associate the intermediate representation to both the neural and orthogonal representations. The relation to the neural representation is defined by the standard neural representation described in section 2.2, with the encoding given by δ j (t − tin ) = G i αi b j φ˜ i + J ibias and the decoding by bˆ j =
h ij (t − tn )φi .
i,n
Essentially, these equations describe how the intermediate population activity b j is related to the actual neurons, indexed by i, in the various populations along the length of the lamprey.4 To relate the intermediate and orthogonal spaces, we can use the projec˜ tion operator = [φ]. That is, we can project the control description into the intermediate space as follows: x˙ = Ax −1
b˙ = A−1 b b˙ = Ab b, where Ab = A−1 . Having provided these descriptions, we can now selectively convert segments of the intermediate representation into spiking neurons to see how single cells perform in the context of the whole, dynamic spinal cord. Figure 8A shows single cells that burst during swimming, and Figure 8B shows the average spike rate of an entire population of neurons in a segment. Given this characterization of neural representation, these graphs reflect different levels of description of the neural activity during the same simulation. It should be clear that this model does not adopt the standard CPG approach to modeling this system. As a result, the question arises as to whether
4 Because the lamprey spinal cord is effectively continuous, assignment of neurons to particular populations is somewhat arbitrary, although constrained by the part of the lamprey over which they encode muscle tension. So the resulting model is similarly continuous.
Controlling Spiking Attractor Networks A)
1295 B) 3.5
Left side Right side
Left side Right side
3
Firing Rate (Hz)
Neuron
20
15
10
5
2.5 2 1.5 1 0.5
00
0.5
1
1.5
2
Time(s)
2.5
0
0
0.5
1
1.5
2
2.5
3
Time(s)
Figure 8: Results from the lamprey modeled as a cyclic attractor. The middle segment of the lamprey was modeled at the single cell level by a population of 200 neurons under 10% noise. (A) The spikes from 20 (10 left, 10 right) single cells in the population. (B) The average rate in the two (left and right) subpopulations.
the results of this simulation match the known neuroanatomy as well as the CPG approach. While there is much that remains unknown about the lamprey, these three constraints on the anatomy are well established: 1. Connectivity is mostly local but spans several segments. 2. Connectivity is asymmetric. 3. Individual neurons code muscle tension over small regions. By introducing the intermediate level of representation, we have enforced the third constraint explicitly. Looking at the intermediate-level weight matrix for this system shown in Figure 9, we can see that constraint 2 clearly holds and that constraint 1 is approximately satisfied. The NEF can be used to embed high-level descriptions of cyclic attractors into biologically plausible networks that are consistent with the relevant anatomical constraints. Undoubtedly, different systems will impose different anatomical constraints, but the NEF methods are clearly not determined by the particular constraints found in the lamprey. Equally important, as discussed in section 4, because the NEF allows the integration of control into the high-level description, it is straightforward to characterize (and enforce) essential high-level properties like stability and controllability in the generated models. This has often proved a daunting task for the standard bottom-up CPG approach (Marder, Kopell, & Sigvardt, 1997). So, again, it is the ease with which this model can be extended to account for important but complex behaviors that demonstrates the utility of the NEF.
1296
C. Eliasmith 100
Neuron
80
60
40
20
0 0
20
40
60
80
100
Neuron Figure 9: The connectivity matrix between segments in the lamprey model. Connectivity is asymmetric and mostly local, in agreement with the known anatomy. The darker the image, the stronger the weights. Zero weights are the large gray-hatched areas.
3.4 Chaotic Attractor. The final class of attractors considered are also dynamic attractors, but they, unlike the cyclic attractors, are not periodic. Instead, any nearby trajectories in chaotic (or strange) attractors diverge exponentially over time. Nevertheless, they are attractors in that there is a bounded subspace of the state space toward which trajectories, regardless of initial conditions, tend over time. In the context of neurobiological systems, there have been some suggestions that chaos or chaotic attractors can be useful for describing certain neural systems (Matsugu, Duffin, & Poon, 1998; Kelso & Fuchs, 1995; Skarda & Freeman, 1987). For example, Skarda & Freeman (1987) suggest that the olfactory bulb, before odor recognition, rests in a chaotic state. The fact that the state is chaotic rather than merely noisy permits more rapid convergence to limit cycles that aid in the recognition of odors. These kinds of information processing effects themselves are well documented. For instance, a number of practical control problems can be more efficiently solved if a system can exploit chaotic attractors effectively (Bradley, 1995). However, the existence of chaos in neural systems is subject to much debate (Lai, Harrison, Frei, & Osorio, 2003; Biswal & Dasgupta, 2002).
Controlling Spiking Attractor Networks
1297
As a result, we consider chaotic attractors here largely for the purposes of completeness, that is, to show that this approach is general enough to capture such phenomena, should they exist. For this example, we have chosen to use the familiar Lorenz attractor, described by:
x˙ 1 x1 −a a 0 x˙ 2 = b −1 x1 x2 . x˙ 3 x2 0 −c x3
(3.18)
If a = 10, b = 28, and c = 8/3 this system of equations gives the wellknown butterfly chaotic attractor. It is clear from this set of equations that the system to be considered is nonlinear. So, unlike the previous examples, we need to compute nonlinear functions of the state variables, meaning this is not an LTI system. As discussed in more detail in section 4.1, there are various possible architectures for computing the necessary transformations. Here, we compute the necessary cross-terms by extracting them directly from the population representing the vector space x. Specifically, we can find decoding vectors for x1 x3 (i.e., φ x1 x3 ) and x1 x2 (i.e., φ x1 x2 ) using the method discussed in the appendix where, for example, f (x) = x1 x3 . These decoding vectors can be used to provide an expression for the recurrent updating of the population’s firing rate, a i (x) = G i αi φ˜ i l(x) + J ibias ,
(3.19)
where the vector function l(x) is defined by the Lorenz equations in equation 3.18. Substituting the appropriate neural-level characterizations of this transformation into equation 3.19 gives a i (x) = G i
ωija x1 − ωija x2 + ωijbx1 − ωijx2
j
− ωijx1 x3
+
ωijx1 x2
−
ωijcx3 a j (x)
+
J ibias
,
where ωija x1 = αi φ˜ i,1 a φ xj 1 , ωija x2 = αi φ˜ i,1 a φ xj 1 ωijbx1 = αi φ˜ i,2 bφ xj 1 , ωijx2 = αi φ˜ i,2 φ xj 2 , ωijx1 x3 = αi φ˜ i,2 φ xj 1 x3 , ωijx1 x2 = αi φ˜ i,3 φ xj 1 x2 , and ωijcx3 = αi φ˜ i,3 cφ xj 3 . As usual, the connection weights are found by combining the encoding and decoding vectors as appropriate. Note that despite implementing a higher-level nonlinear system, there are no multiplications between neural activities in the neural-level description. This demonstrates that the neural nonlinearities alone result in the nonlinear behavior of the network. That is, no additional nonlinearities (e.g., dendritic nonlinearities) are needed to give rise to this behavior.
1298
C. Eliasmith
A) x y z
50 40 30
Amplitude
20 10 0
-1 0 -2 0 -3 0 -4 0 0
0. 5
1
1. 5
2
2. 5
3
3. 5
4
4. 5
5
Time(s) B)
C) 20 50
18
45
16
40
14
35
12
30
Neuron
z
10
25
8
20
6
15 10
4
5
2
0
0 0
0. 1 0. 2 0. 3 0. 4 0.5
0. 6 0. 7 0. 8 0. 9
Time(s)
1
-2 0 -1 5 -10
-5
0
5
10
15
20
25
x
Figure 10: The Lorenz attractor implemented in a spiking network of 2000 LIF neurons under 10% noise. (A) The decoded output from the three dimensions. (B) Spike trains from 20 sample neurons in the population for the first second of the run display irregular firing. (C) Typical Lorenz attractor-type motions (i.e., the butterfly shape) are verified by plotting the state space. For clarity, only the last 4.5 s are plotted, removing the start-up transients.
Running this simulation in a spiking network of 2000 LIF neurons under noise gives the results shown in Figure 10. Because this is a simulation of a chaotic network under noise, it is essential to demonstrate that a chaotic system is in fact being implemented, and the results are not just noisy spiking from neurons. Applying the noise titration method (Poon & Barahona, 2001) on the decoded spikes verifies the presence of chaos in the system ( p-value < 10−15 with a noise limit of 54%, where the noise limit indicates how much more noise could be added before the nonlinearity was no longer
Controlling Spiking Attractor Networks
1299
detectable). Notably, the noise titration method is much better at detecting chaos in time series than most other methods, even in highly noisy contexts. As a result, we can be confident that the simulated system is implementing a chaotic attractor as expected. This can also be qualitatively verified by plotting the state space of the decoded network activity, which clearly preserves the butterfly look of the Lorenz attractor (see Figure 10C). 4 Controlling Attractor Networks To this point we have demonstrated how three main classes of attractor networks can be embedded into neurobiologically plausible systems and have indicated, in each case, which specific systems might be well modeled by these various kinds of attractors. However, in each case, we have not demonstrated how a neural system could use this kind of structure effectively. Merely having an attractor network in a system is not itself necessarily useful unless the computational properties of the attractor can be taken advantage of. Taking advantage of an attractor can be done by moving the network either into or out of an attractor, moving between various attractor basins, or destroying and creating attractors within the network’s state space. Performing these actions means controlling the attractor network in some way. Some of these behaviors can be effected by simply changing the input to the network. But more generally, we must be able to control the parameters defining the attractor properties. In what follows, we focus on this second, more powerful kind of control. Specifically, we revisit examples from each of the three classes of attractor networks and show how control can be integrated into these models. For the neural integrator we show how it can be turned into a more general circuit that acts as a controllable filter. For the ring attractor, we demonstrate how to build a nonlinear control model that moves the current head direction estimate given a vestibular control signal and does not rely on multiplicative interactions at the neural level. In the case of the cyclic attractor, we construct a control system that permits variations in the speed of the orbit. And finally, in the case of the chaotic attractor, we demonstrate how to build a system that can be moved between chaotic, cyclic, and point attractor regimes. 4.1 The Neural Integrator as a Controllable Filter. As described in section 3.1, a line attractor is implemented in the neural integrator in virtue of the dynamics matrix A being set to 1. While the particular output value of the attractor depends on the input, the dynamics of the attractor are controlled by A . Hence, it is natural to inquire as to what happens as A varies over time. Since A is unity feedback, it is fairly obvious what the answer to this question is: as A goes over 1, the resulting positive feedback will cause the circuit to saturate; as A becomes less than one, the circuit begins to act as a low-pass filter, with the cutoff frequency determined by the precise
1300
C. Eliasmith A)
x(t)
c(t)
B) x(t)
B'x
B'x
u(t)
u(t)
A'x
B'c
A'(t)
A'(t)
Figure 11: Two possible network architectures for implementing the controllable filter. The B variables modify the inputs to the populations representing their subscripted variable. The A variables modify the relevant recurrent connections. The architecture in A is considered in Eliasmith and Anderson (2003); the more efficient architecture in B is considered here.
value of A . Thus, we can build a tunable filter by using the same circuit and allowing direct control over A . To do so, we can introduce another population of neurons dl that encode the value of A (t). Because A is no longer static, the product A x must be constantly recomputed. This means that our network must support multiplication at the higher level. The two most obvious architectures for building this computation into the network are shown in Figure 11. Both architectures are implementations of the same high-level dynamics equation, x(t) = h (t) ∗ (A (t)x(t) + τ u(t)),
(4.1)
which is no longer LTI, as it is clearly a time-varying system. Notably, while both architectures demand multiplication at the higher level, this does not mean that there needs to be multiplication between activities at the neural level. This is because, as mentioned in section 2.2 and demonstrated in section 3.4, nonlinear functions can be determined using only linear decoding weights. As described in Eliasmith and Anderson (2003), the first architecture can be implemented by constructing an intermediate representation of the vector c = [A , x] from which the product is extracted using linear decoding. The result is then used as the recurrent input to the a i population representing x. This circuit is successful, but performance is improved by adopting the second architecture. In the second architecture, the representation in a i population is taken to be a 2D representation of x in which the first element is the integrated input and the second element is A . The product is extracted directly from
Controlling Spiking Attractor Networks
1301
this representation using linear decoding and then used as feedback. This has the advantage over the first architecture of not introducing extra delays and noise. Specifically, let x = [x1 , x2 ] (where x1 = x and x2 = A in equation 4.1). A more accurate description of the higher-level dynamics equation for this system is x˙ = h (t) ∗ A x + B u x˙ 1 x2 0 τ 0 x1 u = h (t) ∗ + , x2 x2 0 0 0 τ A
(4.2)
which makes the nonlinear nature of this implementation explicit. Notably, here the desired A is provided as input from a preceding population, as is the signal to be integrated, u. To implement this system, we need to compute the transformation, p ˆ = p(t) a i (t)φi , i
where p(t) is the product of the elements of x. Substituting this transformation into equation 3.6 gives
a j = G j α j h ∗ φ˜ j =Gj
i
p a i (t)φi
+B
i
ωij a i (t) +
k
ωk j b k (t) + J
bias j
b k φku
+J
bias j
,
(4.3)
k
p where ωij = α j φ˜ j φi , ωk j = α j φ˜ j B φku and
a i (t) = h ∗ G i αi x(t)φ˜ i + J ibias . The results of simulating this nonlinear control system are shown in Figure 12. This run demonstrates a number of features of the network. In the first tenth of a second, the control signal 1 − A is nonzero, helping to eliminate any drift in the network for zero input. The control signal then goes to zero, turning the network into a standard integrator over the next two-tenths of a second, when a step input is provided to the network. The control signal is then increased to .3, rapidly forcing the integrated signal to zero. The next step input is filtered by a low-pass filter, since the control signal is again nonzero. The third step input is also integrated, as the control signal is zero. Like the first input, this input is forced to zero by increasing the control signal, but this time the decay is much slower because the control
1302
A)
C. Eliasmith
1.5
u(t) 1-A'(t)
Input
1 0.5 0 -0.5 -1 -1.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Controlled Filter
B) 1
x (t) 1 x (t)
0.8
2
0.6 0.4 0.2 0 -0.2 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time (s) Figure 12: The results of simulating the second architecture for a controllable filter in a spiking network of 2000 neurons under 10% noise. (A) Input signals to the network. (B) High-level decoded response of the spiking neural network. The network encodes both the integrated result and the control signal directly, to efficiently support the necessary nonlinearity. See the text for a description of the behavior.
signal is lower (.1). These behaviors show how the control signal can be used as a reset signal (by simply making it nonzero) or as a means of determining the properties of a tunable low-pass filter. The introduction of control into the system gave us a means of radically altering the attractive properties of the system. It is only while A = 1 that we have an approximate line attractor. For positive values less than one, the system no longer acts as a line attractor, but rather as a point attractor, whose basin properties (e.g., steepness) vary as the control signal. As can be seen in equation 4.3, there is no multiplication of neural activities. There is, of course, significant debate about whether, and to what extent, dendritic nonlinearities might be able to support multiplication of neural activity (see, e.g., Koch & Poggio, 1992; Mel 1994, 1999; Salinas & Abbott, 1996; von der Heydt, Peterhans, & Dursteler, 1991). As a result, it is useful to demonstrate that it is possible to generate circuits without
Controlling Spiking Attractor Networks
1303
multiplication of neural activities that support network-level nonlinearities. If dendritic nonlinearities are discovered in the relevant systems, these networks would become much simpler (essentially we would not need to construct the intermediate c population). 4.2 Controlling Bump Movement Around the Ring Attractor. Of the two models presented in section 3.2, the role of control is most evident for the head direction system. In order to be useful, this system must be able to update its current estimate of the heading of the animal given vestibular information about changes in the animal’s heading. In terms of the ring attractor, this means that the system must be able to rotate the bump of activity around the ring in various directions and at various speeds given simple left-right angular velocity commands. (For a generalization of this model to the control of a two-dimensional bump that very effectively characterizes behavioral and single-cell data from path integration in subiculum see Conklin & Eliasmith, 2005.) To design a system that behaves in this way, we can first describe the problem in the function space x(ν) for a velocity command δ/τ and then write this in terms of the coefficients x in the orthonormal Fourier space: x(ν; t + τ ) = x(ν + δ; t) = xm e im(ν+δ) m
=
xm e imν e imδ .
m
Rotation of the bump can be effected by applying the matrix Em = e imδ , where δ determines the speed of rotation. Written for real-valued functions, E becomes
1 0 0 0 cos(mδ) sin(mδ) E= 0 − sin(mδ) cos(mδ) 0
0
···
0 0 .. . . .. .
To derive the dynamics matrix A for the state equation 2.5, it is important to note that E defines the new function at t + τ , not just the change, δx. As well, we would like to control the speed of rotation, so we can introduce a scalar on [−1, 1] that changes the velocity of
1304
C. Eliasmith
rotation, with δ defining the maximum velocity. Taking these into account gives A = C(t) (E − I) , where C(t) is the left-right velocity command.5 As in section 4.1, we can include the time-varying scalar variable C(t) in the state vector x and perform the necessary multiplication by extracting a nonlinear function that is the product of that element with the rest of the state vector. Doing so again means there is no need to multiply neural activities. Constructing a ring attractor without multiplication is a problem that was only recently solved by Goodridge and Touretzky (2000). That solution, however, is specific to a one-dimensional single bump attractor, does not use spiking neurons, and does not include noise. As well, the solution posits single left-right units that together project to every cell in the population, necessitates the setting of normalization factors, and demands numerical experiments to determine the appropriate value of a number of the parameters. In sum, the solution is somewhat nonbiological and very specific to the problem being addressed. In contrast, the solution we have presented here is subject to none of these concerns: it is both biologically plausible and very general. The behavior of the fully spiking version of this model is shown in Figure 13. In this case, the control parameters simply move the system through the attractor’s phase space rather than altering the phase space itself as in the previous example. However, to accomplish the same movement using the input u(t), we would have to provide the appropriate high-dimensional vector input (i.e., the new bump position minus the old bump position). Using the velocity commands in this nonlinear control system, we need only provide a scalar input to appropriately update the systems state variables. In other words, the introduction of this kind of control greatly simplifies updating the system’s current position in phase space.
5 There is more subtlety than one might think to this equation. For values of C(t) = 1, the system does not behave as one might expect. For negative values, two bumps are created: one negative bump in the direction opposite the desired direction of motion and the other at the current bump location. This results in the current bump being effectively pushed away from the negative bump. For values less than one, a proportionally scaled bump is created in the location as if C(t) = 1, and a proportionally scaled bump is subtracted from the current position, resulting in proportionally scaled movement. There are two reasons this works as expected. The first is that the movements are very small, so the resulting bumps in all cases are approximately gaussian (though subtly bimodal). The second is that the attractor dynamics built into the network clean-up any nongaussianity of the resulting states. The result is a network that displays bumps moving in either direction proportional to C(t), as desired.
Controlling Spiking Attractor Networks
A)
1305
B) 0
Time (s)
1
2
-π
Left / Right
ν
π
Figure 13: The controlled ring attractor in a spiking network of 1000 neurons with 10% noise. The left-right velocity control signal is shown on the left, and the corresponding behavior of the bump is shown on the right.
4.3 Controlling the Speed of the Cyclic Attractor. As mentioned in section 3.3, one advantage of our synthesis of top-down and bottomup data is that it permits the inclusion of strong top-down constraints on model building. In the context of lamprey locomotion, introducing control over swimming speed and guaranteeing stable oscillation to CPG-based models were problems that had to be tackled separately, took much extra work, and resulted in solutions specific to this kind of network (Marder et al., 1997). In contrast, stability, control, and other top-down constraints can be included in the cyclic attractor model directly. In this example, we consider control over swimming speed. Given the two previous examples, we know that this kind of control can be characterized as the introduction of nonlinear or time-varying parameters into our state equations. For instance, we can make the frequency term ω in equation 3.17 a function of time. For simplicity, we will consider the standard oscillator, although it is closely related to the swimming model as discussed earlier (see Kuo & Eliasmith, 2004, for an anatomically and physiologically plausible model of zebrafish swimming with speed control). To change the speed of an oscillator, we need to implement x˙ =
0 ω(t) −ω(t) 0
x1 + Bu(t). x2
1306
C. Eliasmith 1.5 x1 x2 ω
1
0.5
0
-0.5
-1
-1.5 0
0.5
1
1.5
2
2.5
3
Time (s) Figure 14: The controlled oscillator in a spiking network of 800 neurons with 10% noise. The control signal varies ω, causing the oscillator to double its speed and then slow down to the original speed.
For this example, we can construct a nonlinear model, but unlike equation 4.2, we do not have to increase the dimension of the state space. Instead, we can increase the dimension of the input space, so the third dimension carries the time-varying frequency signal, giving x˙ =
0 u3 (t) −u3 (t) 0
x1 + Bu(t). x2
(4.4)
As before, this can be translated into a neural dynamics matrix using equation 2.7 and implemented in a neural population using the methods analogous to those in section 4.1. In fact, the architecture used for the neural implementation is the same despite the alternate way of expressing the control system in equation 4.4. That is, in order to perform the necessary multiplication, the dimensionality of the space encoded by the population is increased by one. In this case, the extra dimension is assigned to the input vector rather than the state vector. It should not be surprising, then, that we can successfully implement a controlled cyclic attractor (see Figure 14). In this example, control is used to vary a property of the attractor: the period of the orbit. Because the previous two examples are static attractors,
Controlling Spiking Attractor Networks
1307
this kind of control does not apply to them. However, it should be clear that this example adds nothing new theoretically. Nevertheless, it helps to demonstrate that the methods introduced earlier apply broadly. Notably, the introduction of this kind of control into a neural model of swimming results in connectivity that matches the known anatomy (Kuo & Eliasmith, 2004) . 4.4 Moving Between Chaotic, Cyclic, and Point Attractors. Both Skarda and Freeman (1987) and Kelso and Fuchs (1995) have suggested that being in a chaotic attractor may help to improve the speed of response of a neural system to various perturbations. In other words, they suggest that if chaotic dynamics are to be useful to a neural system, it must be possible to move into and out of the chaotic regime. Conveniently, the bifurcations and attractor structures in the Lorenz equations, 3.18, are well characterized, making them ideal for introducing the kind of control needed to enter and exit the chaotic attractor. For instance, changing the b parameter over the range [1, 300] causes the system to exhibit point, chaotic, and cyclic attractors. Constructing a neural control system with an architecture analogous to that of the controlled integrator discussed earlier would allow us to move the system between these kinds of states. However, an implementational difficulty arises. As suggested by Figure 10, the mean of the x3 variable is roughly equal to b. Thus, for largely varying values of b, the neural system will have to represent a large range of values, necessitating a very wide dynamic range with a good signal-to-noise ratio. This can be achieved with enough neurons, but it is more efficient to rewrite the Lorenz equations to preserve the dynamics but eliminate the scaling problem. To do so, we can simply subtract and add b as appropriate to remove the scaling effect. This gives
x˙ 1 a (x2 − x1 ) x ˙ = 2 bx1 − x2 − x1 (x3 + b) x˙ 3 x1 x2 − c(x3 + b) − b x1 0 0 0 −a a 0 b 0 0 00. = 0 −1 −x1 x2 + x2 0 −c x3 −(c + 1) 0 0 0 Given this characterization of the Lorenz system, it is evident that, conveniently, introduction of the controlled signal b no longer requires multiplication, making the problem simpler than the previous control examples. Implementing these equations in a spiking neural population can be done as in section 3.4. The results of simulating this network under noise are shown in Figure 15. After the start-up transient, the network displays chaotic behavior, as with
1308
C. Eliasmith 100
Amplitude
x y z Control
50
0
-50
0
1
2
3
4
5
6
7
8
9
10
Time (s) Figure 15: The controlled chaotic attractor in a spiking network of 1000 neurons with 10% noise. The system clearly moves from the chaotic regime to an oscillatory one, and finally to a point attrator. Noise titration verifies this result (see the text for discussion).
no control (see Figure 10). However, in this case, it is the value of the control signal b(t) that forces the system to be in the chaotic regime. After 6 seconds, the control signal changes, moving the system to a stable limit cycle. At 8 seconds, the control signal changes again, moving the system to a fixedpoint attractor. To verify that these different regimes are as described, data from each of the regimes were titrated as before. The noise titration during the chaotic regime verified the presence of the nonlinearity (p-value < 10−15 ; noise limit 39%). During the limit cycle and the point attractor regimes, noise titration did not detect any nonlinearity, as expected. Interestingly, despite the highly nonlinear nature of the system itself, the kind of control that might be useful for information processing turns out to be quite simple. Unlike the previous examples, this one demonstrates how nonmultiplicative control can be very powerful. It serves to move a nonlinear system through very different attractor regimes, some linear and some not. However, unlike previous examples, it is unclear how useful this kind of network is for understanding neural systems better. Nevertheless, it serves to illustrate the generality of the NEF for understanding a wide variety of attractors and control systems in the context of constraints specific to neural systems.
Controlling Spiking Attractor Networks
1309
5 Conclusion We have presented several examples of biologically plausible attractor networks that cover a wide variety of attractors. These examples employ a variety of representations, including scalars, vectors, and functions. They also exemplify a variety of control systems, including linear, time varying, and nonlinear. Some of the examples on their own are not particularly novel. What is novel is demonstrating how they relate to more complex (and novel) models. In other words, it is this wide diversity of examples that shows that the NEF can help us systematize the functionality of neural systems. That is, despite differences in tuning curves, kinds of representation, neural response properties, intrinsic dynamics, and so on, it is possible to classify various networks as variations on themes of integration, filtering, and oscillation, among others, all of which can be derived from simple, general principles. Furthermore, characterizing high-level dynamics, and how lower-level properties affect those dynamics, can provide a window into the purpose of neural subsystems. Compiling this knowledge should aid in more quickly understanding the likely functions of larger, more complex, and less familiar systems. This kind of systematization can be useful in a number of ways. For one, it suggests that perhaps the kinds of networks described here can serve as functional parts of larger networks—networks that we can construct using these same methods. One example of this is presented in Eliasmith et al. (2002), where an integrator is one of nine subnetworks used to estimate the true translational velocity of an animal given the responses of semicircular canals and otoliths to a variety of acceleration profiles. So while we have used the NEF here to construct a specific class of networks, the same methods are more broadly applicable. A second benefit of systematization is that it supports the transfer of knowledge and analyses regarding well-understood neural systems to lesser understood ones. For instance, understanding a useful control structure for the ring attractor suggests a useful control structure for path integration, a lesser-studied and more complex neural system (Conklin & Eliasmith, 2005). More generally, if we can see the close relation between two high-level descriptions of different neural systems (e.g., working memory and the neural integrator, or path integration and the head direction system), what we have learned about one may often be translated into implications for the other. This can greatly speed up the development of novel models and focus our attention on the important characteristics of new systems (be they similarities to, or differences from, other known systems). Finally, by being able to provide the high-level characterizations of neural systems on which such systematization depends, we can carefully introduce new complexities into existing models. For instance, the recent surge of interest in the observed dynamics of working memory (Romo et al., 1999; Brody et al., 2003; Miller, Brody, & Romo, 2003) can be captured by simple
1310
C. Eliasmith
extensions of the models described earlier (Singh & Eliasmith, 2004). Again, this can greatly aid the construction of novel models—models that may be able to address more complicated phenomena than otherwise possible (Eliasmith, 2004, presents a neuron-level model of a well-studied deductive inference task—the Wason card selection task—with 14 subsystems.) So while this discussion has focused on characterizing a class of networks that is clearly important for understanding neural systems, the methods underlying this approach have much broader application. They can help us begin to better understand the general control and routing of information through the brain in a way responsible to neural constraints. With continuing improvement in experimental techniques for examining large-scale networks, theoretical tools for understanding networks on the same scale become essential. The examples provided here hint that the NEF may be a useful attempt at beginning to develop such tools. Appendix: Least-Squares Optimal Linea Decoders Consider determining the optimal linear decoders φi in equation 2.3 under noise (see also Salinas & Abbott, 1994 and Eliasmith & Anderson, 2003). To include noise, we introduce the noise term ηi , which is drawn from a gaussian, independent, identically distributed, zero mean distribution. The noise is added to the neuron activity a i , resulting in a decoding of xˆ =
N
(a i (x) + ηi ) φi .
(A.1)
i=1
To find the least-squares-optimal φi , we construct and minimize the mean square error, averaging over the expected noise and the vector x: 1 E= 2 1 = 2
N
x−
2
(a i (x) + ηi ) φi
i=1
x−
x,η
N
N
a i (x)φi −
i=1
i=1
2
ηi φi
,
(A.2)
x,η
where ·x indicates integration over the range of x. Because the noise is independent on each neuron, the noise averages out except when i = j. So the average of the ηi η j noise is equal to the variance σ 2 of the noise on the neurons. Thus, the error with noise becomes 1 E= 2
x−
N i=1
2
a i (x)φi x
N 1 + σ2 φ2. 2 i=1 i
(A.3)
Controlling Spiking Attractor Networks
1311
Taking the derivative of the error gives
N 1 δE =− 2 x − a j (x)φ j a i (x) + σ 2 φ j δij δφi 2 j x
N = − a i (x)xx + a i (x)a j (x)φ j + σ 2 φ j δij . j
(A.4)
x
Setting the derivative to zero gives N a i (x)xx = a i (x)a j (x) x + σ 2 δij φ j
(A.5)
j
or, in matrix form, ϒ = φ. The decoding vectors φi are given by φ = −1 ϒ, where
ij = a i (x)a j (x) x + σ 2 δij ϒi = xa i (x)x .
Notice that the matrix will be nonsingular because of the noise term on the diagonal. This same procedure can be followed to find the optimal linear decoders φ f (x) for some linear or nonlinear function f (x). The error to be minimized then becomes 1 E= 2
f (x) −
N i=1
2
(a i (x) + ηi ) φi
f (x)
. x,η
The minimization is analogous, with only a differing result for ϒ: ϒi = f (x)a i (x)x .
Acknowledgments Special thanks to Charles H. Anderson, Valentin Zhigulin, and John Conklin for helpful discussions on this project. John Conklin coded the simulations for controlling the ring attractor. The code for detecting chaos was graciously provided by Chi-Sang Poon. This work is supported by grants
1312
C. Eliasmith
from the National Science and Engineering Research Council of Canada, the Canadian Foundation for Innovation, the Ontario Innovation Trust, and the McDonnell Project in Philosophy and the Neurosciences. References Amit, D. J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Askay, E., Gamkrelidze, G., Seung, H. S., Baker, R., & Tank, D. (2001). In vivo intracellular recording and perturbation of persistent activity in a neural integrator. Nature Neuroscience, 4, 184–193. Biswal, B., & Dasgupta, C. (2002). Neural network model for apparent deterministic chaos in spontaneously bursting hippocampal slices. Physical Review Letters, 88, 088102. Bradley, E. (1995). Autonomous exploration and control of chaotic systems. Cybernetics and Systems, 26, 299–319. Brody, C. D., Romo, R., & Kepecs, A. (2003). Basic mechanisms for graded persistent activity: Discrete attractors, continuous attractors, and dynamic representations. Current Opinion in Neurobiology, 13, 204–211. Conklin, J., & Eliasmith, C. (2005). A controlled attractor network model of path integration in the rat. Journal of Computational Neuroscience, 18, 183–203. el Manira, A., Tegner, J., & Grillner, S. (1994). Calcium-dependent potassium channels play a critical role for burst termination in the locomotor network in lamprey. Journal of Neurophysiology, 72, 1852–1861. Eliasmith, C., & Anderson, C. H. (2000). Rethinking central pattern generators: A general approach. Neurocomputing, 32, 735–740. Eliasmith, C., & Anderson, C. H. (2003). Neural engineering: Computation, representation, and dynamics in neurobiological systems. Cambridge, MA: MIT Press. Eliasmith, C. (2004). Learning context sensitive logical inference in a neurobiological simulation. In S. Levy & R. Gayler (Eds.), Compositional connectionism in cognitive science (pp. 17–20), Menlo Park, CA: AAAI Press. Eliasmith, C., Westover, M. B., & Anderson, C. H. (2002). A general framework for neurobiological modeling: An application to the vestibular system. Neurocomputing, 46, 1071–1076. Fukushima, K., Kaneko, C. R. S., & Fuchs, A. F. (1992). The neuronal substrate of integration in the oculomotor system. Progress in Neurobiology, 39, 609–639. Fuster, J. M. (2001). The prefrontal cortex—an update: Time is of the essence. Neuron, 30, 319–333. Goodridge, J. P., & Touretzky, D. S. (2000). Modeling attractor deformation in the rodent head-direction system. Journal of Neurophysiology, 83, 3402–3410. Grillner, S., Wall´en, P., Brodin, I., & Lansner, A. (1991). The neuronal network generating locomotor behavior in lamprey: Circuitry, transmitters, membrane properties, and simulation. Annual Review of Neuroscience, 14, 169–199. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press.
Controlling Spiking Attractor Networks
1313
Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558. Kelso, J. A. S., & Fuchs, A. (1995). Self-organizing dynamics of the human brain: Critical instabilities and Silnikov chaos. Chaos, 5, 64–69. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single neuron computation. Orlando, FL: Academic Press. Kopell, N., & Ermentrout, G. B. (1988). Coupled oscillators and the design of central pattern generators. Mathematical Biosciences, 90, 87–109. Kuo, D., & Eliasmith, C. (2004). Understanding interactions between networks controlling distinct behaviors: Escape and swimming in larval zebrafish. Neurocomputing, 58–60, 541–547. Lai, Y.-C., Harrison, M. A., Frei, M. G., & Osorio, I. (2003). Inability of Lyapunov exponents to predict epileptic seizures. Physical Review Letters, 91, 068012. Laing, C. R., & Chow, C. C. (2001). Stationary bumps in networks of spiking neurons. Neural Computation, 13, 1473–1494. Marder, E., Kopell, N., & Sigvardt, K. (1997). How computation aids in understanding biological networks. In P. Stein, S. Grillner, A. Selverston, & D. Stuart (Eds.), Neurons, networks, and motor behavior. Cambridge, MA: MIT Press. Matsugu, M., Duffin, J., & Poon, C. (1998). Entrainment, instability, quasi-periodicity, and chaos in a compound neural oscillator. Journal of Computational Neuroscience, 5, 35–51. Mel, B. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031– 1085. Mel, B. (1999). Why have dendrites? A computational perspective. In G. Stuart, N. Spruston, & M. H¨ausser (Eds.), Dendrites. New York: Oxford University Press. Miller, P., Brody, C., & Romo, R. (2003). A recurrent network model of somatosensory parametric working memory in the prefrontal cortex. Cerebral Cortex, 13, 1208– 1218. Platt, M. L., & Glimcher, G. W. (1998). Response fields of intraparietal neurons quantified with multiple saccadic targets. Experimental Brain Research, 121, 65–75. Poon, C.-S., & Barahona, M. (2001). Titration of chaos with added noise. Proceedings of the National Academy of Science, 98, 7107–7112. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Computation, 10, 373–401. Rainer, G., & Miller, E. K. (2002). Timecourse of object-related neural activity in the primate prefrontal cortex during a short-term memory task. European Journal of Neuroscience, 15, 1244–1254. Redish, A. D. (1999). Beyond the cognitive map. Cambridge, MA: MIT Press. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Romo, R., Brody, C. D., Hern´andez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–473. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. Journal of Computational Neuroscience, 1, 89–107.
1314
C. Eliasmith
Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proceedings of the National Academy of Sciences USA, 93, 11956– 11961. Selverston, A. I. (1980). Are central pattern generators understandable? Behavioral and Brain Sciences, 3, 535–571. Sereno, A. B., & Maunsell, J. H. R. (1998). Shape selectivity in primate lateral intraparietal cortex. Nature, 395, 500–503. Seung, H. S. (1996). How the brain keeps the eyes still. Proceedings of the National Academy of Sciences, USA, 93, 13339–13344. Seung, H. S., Lee, D., Reis, B., & Tank, D. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271. Shadlen, N. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area lip) of the rhesus monkey. Journal of Neurophysiology, 86, 1916–1936. Singh, R., & Eliasmith, C. (2004, March 24–28). A dynamic model of working memory in the PFC during a somatosensory discrimination task. Paper presented at Computational and Systems Neuroscience 2004, Cold Spring Harbor, NY. Skarda, C. A., & Freeman, W. J. (1987). How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10, 161–195. Snyder, L. H., Batista, A. P., & Andersen, R. A. (1997). Coding of intention in the posterior parietal cortex. Nature, 386, 167–170. Touretzky, D. S., & Redish, A. D. (1996). Theory of rodent navigation based on interacting representations of space. Hippocampus, 6, 247–270. von der Heydt, R., Peterhans, E., & Dursteler, M. (1991). Grating cells in monkey visual cortex: Coding texture? In B. Blum (Ed.), Channels in the visual nervous system: Neurophysiology, psychophysics, and models. London: Freund. Wang, X. J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36, 955–968. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. Journal of Neuroscience, 16, 2112–2126.
Received June 3, 2004; accepted November 15, 2004.
LETTER
Communicated by Bard Ermentrout
Synchronized Firings in the Networks of Class 1 Excitable Neurons with Excitatory and Inhibitory Connections and Their Dependences on the Forms of Interactions Takashi Kanamaru
[email protected] Masatoshi Sekine
[email protected] Department of Electrical and Electronic Engineering, Faculty of Technology, Tokyo University of Agriculture and Technology, Tokyo 184-8588, Japan
Synchronized firings in the networks of class 1 excitable neurons with excitatory and inhibitory connections are investigated, and their dependences on the forms of interactions are analyzed. As the forms of interactions, we treat the double exponential coupling and the interactions derived from it: pulse coupling, exponential coupling, and alpha coupling. It is found that the bifurcation structure of the networks depends mainly on the decay time of the synaptic interaction and the effect of the rise time is smaller than that of the decay time.
1 Introduction Recently, oscillations and synchronization in neural systems have been attracting considerable attention. Particularly in the visual cortex and the hippocampus, synchronized oscillations with typical frequencies are often observed in the average behaviors of the neuronal ensemble, and it is proposed that they are related to the binding of the information in the visual cortex and the regulation of the synaptic plasticity in the hippocampus (for a review, see Gray, 1994). To understand the mechanism of such synchronized oscillations, networks of excitatory or inhibitory neurons have been investigated by numerous authors (Abbott & Vreeswijk, 1993; Hansel, Mato, & Meunier, 1995; Kuramoto, 1991; Mirollo & Strogatz, 1990; Sato & Shiino, 2002; Tsodyks, Mitkov, & Sompolinsky, 1993; van Vreeswijk, 1996; van Vreeswijk, Abbott, & Ermentrout, 1994). Typically, the perfect synchronization is observed in the network of pulse-coupled self-oscillating excitatory neurons (Kuramoto, 1991; Mirollo & Strogatz, 1990), but it is not always stable for networks with slow couplings, and the partial synchronization, the antiphase synchronization, or an asynchronous state appears depending on the parameters, such as the characteristic timescale of the synaptic interaction (Abbott & Neural Computation 17, 1315–1338 (2005)
© 2005 Massachusetts Institute of Technology
1316
T. Kanamaru and M. Sekine
van Vreeswijk, 1993; Hansel et al., 1995; Sato & Shiino, 2002; Tsodyks et al., 1993; van Vreeswijk, 1996; van Vreeswijk et al., 1994). The frequencies of these synchronized firings are determined mainly by the frequency of a single neuron, and they might be much larger than the physiologically observed ones, such as 40 Hz of the gamma oscillation. Recently, more complex dynamics than that of the excitatory network have been found in networks of excitatory and inhibitory neurons (Borgers ¨ & Kopell, 2003; Brunel, 2000; Golomb & Ermentrout, 2001; Hansel & Mato, 2003; Kanamaru & Sekine, 2003, 2004; van Vreeswijk & Sompolinsky, 1996). Similar to the excitatory network, the synchronized firings are observed in the network of self-oscillating neurons (Borgers ¨ & Kopell, 2003) or in the network of self-oscillating and excitable neurons (Hansel & Mato, 2003). Moreover, the synchronized firings are observed even in the network of only excitable neurons with excitatory and inhibitory connections in a noisy environment (Brunel, 2000; Kanamaru & Sekine, 2003, 2004), where excitable neurons in the absence of connections fire randomly with the help of noise, and when an appropriate strength of connections is introduced, the synchronized firings appear. In our previous studies (Kanamaru & Sekine, 2003, 2004), we investigated a noisy network of class 1 neurons with excitatory and inhibitory connections by means of bifurcation analyses and found various synchronized firings, including chaotic ones. We found that the frequencies of such synchronized firings depend on both noise intensity and coupling strength. In this model, the characteristic timescale of the interaction is assumed to be the same order as that of each neuron, and we could not examine the effect of the timescale of synaptic interactions systematically. In this letter, we investigate the synchronized firings in the networks of class 1 excitable neurons with excitatory and inhibitory connections under a noisy environment and examine their dependence on the forms of interactions. As the forms of interactions, we treat the double exponential coupling and the interactions derived from it in some limiting cases: pulse coupling, exponential coupling, and alpha coupling. With these couplings, we investigate the dependence of the bifurcation structure on the rise time and the decay time of synaptic interactions. In section 2, we provide a definition of our model and introduce its Fokker-Planck equations. We also introduce four forms of interactions: double exponential coupling, pulse coupling, exponential coupling, and alpha coupling. In section 3, we analyze the network with the pulse coupling by solving the Fokker-Planck equations, and we obtain a bifurcation set numerically. We observe that the synchronized periodic firings appear mainly by going through the Hopf bifurcation or the saddle node on limit cycle bifurcation. In section 4, we analyze the network with the exponential coupling. Besides the synchronized periodic firings, we observe synchronized chaotic firings and anomalous high-frequency synchronization. We also investigate the effect of the decay time of the synaptic interaction. In section 5, we analyze the networks with alpha coupling or
Synchronized Firings in Class 1 Networks
1317
double exponential coupling. We find that the dependence of the bifurcation structure on the rise time of the synaptic interaction is weaker than that on the decay time. Conclusions and discussions are given in the final section. 2 Model Let us consider the coupled active rotators composed of excitatory neu(i) (i) rons θ E (i = 1, 2, · · · , NE ) and inhibitory neurons θ I (i = 1, 2, · · · , NI ) (Kanamaru & Sekine, 2003, 2004), written as (i) (i) (i) τ E θ˙ E = 1 − a sin θ E + ξ E (t) + IEE (t) − IEI (t),
(2.1)
(i) τ I θ˙ I
(2.2)
=1 − a
(i) sin θ I
+
(i) ξ I (t)
+ I I E (t) − III (t).
Here, a is a system parameter, τ E and τ I are the time constants of the neuron, IXY (t) (X, Y = E or I ) is the synaptic input from the ensemble Y to the (i) ensemble X, and ξ X (t) is gaussian white noise satisfying
(i) ( j) ξ X (t)ξY (t ) = Dδi j δXY δ(t − t ),
(2.3)
where D is the noise intensity and δi j is Kronecker’s delta. For a > 1, an active rotator shows typical properties of an excitable system: it has a stable equilibrium θ0 ≡ arcsin(1/a ) and − sin(θ (i) (t)) + 1/a shows a pulselike waveform when an appropriate amount of disturbance is injected (Kurrer and Schulten, 1995; Sakaguchi, Shinomoto, & Kuramoto, 1988; Shinomoto & Kuramoto, 1986; Tanabe, Shimokawa, Sato, & Pakdaman, 1999). Note that a single active rotator can be transformed into the canonical model θ˙ = (1 − cos θ ) + (1 + cos θ)r for class 1 neurons (Ermentrout, 1996; Izhikevich, 1999). Thus, our synaptically coupled active rotators might reflect the dynamics of networks of class 1 neurons such as the Connor model or Morris-Lecar model (Ermentrout, 1996). Moreover, the active rotator has a property that its Fokker-Planck equations can be numerically integrated with a smaller number of terms than that of the leaky integrate-and-fire model. Thus, we consider it an effective tool to analyze the dynamics of pulse neural networks. As the interaction IXY (t) from ensemble Y to ensemble X(X, Y = E or I ), we consider the difference of two exponential functions (Abbott & Vreeswijk, 1993; Hansel et al., 1995; Gerstner & Kistler, 2002), written as NY gXY 1 IXY (t) = NY j=1 k κ1Y − κ2Y
( j)
t − tk exp − κ1Y
( j)
t − tk − exp − κ2Y
,
(2.4)
1318
T. Kanamaru and M. Sekine ( j)
where tk is the kth firing time of the jth neuron, and κ1Y and κ2Y (κ1Y > κ2Y > 0) denote the decay time and the rise time of the synaptic interaction, ( j) respectively. Note that the second sum is taken over k satisfying t > tk , ( j) and the firing time is defined as the time when θY turns around over the value 3π/2, which is the point located at the opposite side of the stable equilibrium point θ0 = arcsin(1/a ) ∼ π/2. This interaction is called the double exponential coupling in the following. In the three limits κ1Y , κ2Y → 0, κ2Y → 0 (κ1Y ≡ κY ), and κ1Y → κ2Y ≡ κY , IXY (t) is rewritten as NY gXY ( j) δ t − tk , NY j=1 k ( j) NY t − tk gXY 1 , exp − IXY (t) = NY j=1 k κY κY ( j) ( j) NY t − tk t − tk gXY IXY (t) = exp − , NY j=1 k κY κY2
IXY (t) =
(2.5)
(2.6)
(2.7)
and we call them pulse coupling, exponential coupling, and alpha coupling, respectively. In the following, synchronization phenomena in the network with each coupling are analyzed. To reduce the number of parameters, we set gEE = gII ≡ gint , gEI = gIE ≡ gext , a = 1.05, and τ E = τ I = 1.0. In previous studies (Kanamaru & Sekine, 2003, 2004), we considered a network with the waveform coupling written as IXY (t) =
NY gXY ( j) −sin θY + 1/a , NY j=1
(2.8)
where the waveform of the pulse is injected to the next neuron directly, and we found various synchronized firings including synchronized chaotic firings and weakly synchronized periodic firings. It is notable that chaos is observed in the noisy network of active rotators, while chaos does not appear in a single active rotator by the general property of one-dimensional differential equations. In those studies, waveform coupling was used to facilitate the numerical analyses, but to compare them with the physiologically observed synchronization phenomena, the double exponential coupling and the couplings derived from it seem to be more appropriate. For the analysis, let us introduce the Fokker-Planck equations (Gerstner & Kistler, 2002; Kuramoto, 1984), ∂n E D ∂ 2 nE 1 ∂ (AE n E ) + 2 , =− ∂t τ E ∂θ E 2τ E ∂θ E 2
(2.9)
Synchronized Firings in Class 1 Networks
D ∂ 2 nI 1 ∂ ∂n I (AI n I ) + 2 , =− ∂t τ I ∂θ I 2τ I ∂θ I 2
1319
(2.10)
AE (θ E , t) = 1 − a sin θ E + IEE (t) − IEI (t),
(2.11)
AI (θ I , t) = 1 − a sin θ I + I I E (t) − III (t),
(2.12)
for the normalized number densities of the excitatory and inhibitory neurons, 1 (i) (2.13) δ θE − θE , n E (θ E , t) ≡ NE 1 (i) (2.14) n I (θ I , t) ≡ δ θI − θI , NI in the limit NE , NI → ∞. Note that asynchronous firings and synchronized firings of the network correspond to a stationary solution and a time-varying solution of the Fokker-Planck equations, respectively. The probability fluxes for the excitatory and inhibitory ensembles are defined as JE (θ E , t) = JI (θ I , t) =
1 D ∂n E AE n E − 2 , τE 2τ E ∂θ E
(2.15)
1 D ∂n I AI n I − 2 , τI 2τ I ∂θ I
(2.16)
respectively. Note that the probability flux at θ = 3π/2 can be interpreted as the instantaneous firing rate at t for each ensemble. 3 Pulse Coupling In this section, a network with the pulse coupling written by equations 2.1, 2.2, and 2.5 is considered. The coupling term IXY (t) in equation 2.5 is approximated as IXY (t) = gXY J Y (t) + σ (t),
(3.1)
where J Y (t) ≡ J Y (3π/2, t) is the firing rate and σ (t) is a fluctuation term. The probability flux J Y (t) at θ = 3π/2 is obtained by solving equations 2.15 and 2.16 for θ = 3π/2. This flux J Y (t) can be calculated when an inequality
gEE 3π 3π gII 1− nE nI 1+ τE 2 τI 2
3π 3π gEI gIE nE nI = 0 + τE τI 2 2
(3.2)
1320
T. Kanamaru and M. Sekine
is satisfied. A sufficient condition for inequality 3.2 is 1−
gEE nE τE
3π 2
> 0,
(3.3)
because the other terms in equation 3.2 are positive. Within all our numerical solutions, the condition 3.3 is proven to be satisfied. In the limit of NY → ∞, the fluctuation term σ (t) converges to zero. With this approximation, a numerically obtained bifurcation set for gint = 3.5 in the (D, gext ) plane is shown in Figure 1. Typically, there exist synchronized firings in the area between the Hopf bifurcation line and the saddle node on the limit cycle bifurcation line with moderate values of D. In Figure 1, flows in the plane of probability fluxes JE and JI are also shown (their explanations are given later in this section). The Hopf and the saddle node bifurcation lines are obtained as follows. First, equations 2.9 and 2.10 are transformed into a set of ordinary differential equations x˙ = f (x) for the spatial Fourier coefficients of n E and n I , as shown in the appendix. Next, a stationary solution x 0 is numerically obtained with the Newton method (Press, Flannery, Teukolsky, & Vetterling, 1988), and the eigenvalues of the Jacobian matrix D f (x 0 ) numerically obtained by using the QR algorithm (Press et al., 1988) are examined to find bifurcation lines. For numerical calculations, each Fourier series is truncated at the first 40 or 60 terms. The homoclinic and the double-limit cycle bifurcation lines are obtained by observing the long time behaviors of the solutions of equations 2.9 and 2.10. This bifurcation set is similar to that of the network with the waveform coupling (Kanamaru & Sekine, 2003) except that chaotic firings found in the network with the waveform coupling do not exist in this network. To understand the bifurcation set, schematic flows of the solution in the (JE , JI ) plane are also drawn on the bifurcation set in Figure 1. Note that a stationary solution and a time-periodic solution of the Fokker-Planck equations are projected as an equilibrium point and a limit cycle onto the (JE , JI ) plane, respectively, and they correspond to the asynchronous and the synchronized firings of the network, respectively. Typically, for small D and moderate gext , there exists a stable equilibrium point with small probability fluxes. This equilibrium point corresponds to the firings where all neurons fluctuate around their resting potentials, and when this point disappears by the saddle node on limit cycle bifurcation, the synchronized firings appear. For large D, there exists a stable equilibrium point with large probability fluxes, and it corresponds to the firings where neurons fire with high frequencies without correlations. The synchronized firings also appear after the Hopf bifurcation of the equilibrium point. This equilibrium point approaches the origin of the (JE , JI ) plane with the increase of gext , and its probability fluxes become small. Moreover, in some region in the bifurcation set, the synchronized firings also appear by the double limit cycle bifurcation or the homoclinic bifurcation. (For more information about each
Synchronized Firings in Class 1 Networks 1
1321
JI JE
SN HB
0.8
JI
SN
JE 0.6
SNL
JI
Hopf
JI
gext / gint
JE
JE
0.4 SH
0.2
GH DLC
HB
JI JE
SN
0 0.005
0.01
0.1
0.2
D Figure 1: A bifurcation set in the (D, gext ) plane for the pulse-coupled network with gint = 3.5. The solid, dotted, and dash-dotted lines denote the Hopf, saddle node, and global bifurcations, respectively. Schematic flows of the solution in the (JE , JI ) plane are also drawn on the bifurcation set. The filled and open circles in the trajectories in the (JE , JI ) plane denote the stable and unstable equilibrium points, respectively. The solid and dashed closed curves denote the stable and unstable limit cycles, respectively. The meanings of the abbreviations are as follows: SN, saddle node; SNL, saddle node on limit cycle; DLC, double limit cycle; HB, homoclinic bifurcation; SH, subcritical Hopf; GH, generalized Hopf.
bifurcation, see Guckenheimer & Holmes, 1983, and Hoppensteadt & Izhikevich, 1997.) The raster plots of the typical synchronized firings for the finite system with NE = NI = 1000 are shown in Figure 2. Each figure shows the firing times of the neurons. As shown in Figure 2A, the synchronized firings near the saddle node on limit cycle bifurcation have a long period, and their degree of synchronization is strong. This is because the system stays a long
1322
T. Kanamaru and M. Sekine
Figure 2: Raster plots of the typical synchronized firings for the finite system with NE = NI = 1000. The parameters are set at (A) gext /gint = 0.4 and D = 0.015 and (B) gext /gint = 0.4 and D = 0.08 with gint = 3.5. The neurons are aligned so that the excitatory neurons are in the range 0 ≤ i < 1000, and the inhibitory neurons are in the range 1000 ≤ i < 2000.
time in the area where the original saddle and node existed. As shown in Figure 2B, the synchronized firings near the Hopf bifurcation have a short period, and their degree of synchronization is weak. This is because the limit cycle that corresponds to this weakly synchronized firing with a high firing rate is created around the stable equilibrium point, which denotes the asynchronous firings.
Synchronized Firings in Class 1 Networks
1323
4 Exponential Coupling In this section, a network with the exponential coupling is analyzed. The parameters are fixed at κ E = κ I = 1 and gint = 3.5. For a large number of neurons, equation 2.6 is approximated by the Ornstein-Uhlenbeck process (Gardiner, 1985), written as IXY˙ (t) = −(IXY (t) − gXY J Y (t))/κY + σ (t),
(4.1)
where σ (t) is a fluctuation term and σ (t) converges to zero in the limit of NY → ∞. By integrating this differential equation with the Fokker-Planck equations, the exponentially coupled network can be numerically analyzed. A numerically obtained bifurcation set in the (D, gext ) plane is shown in Figure 3. In this bifurcation set, there exists a crisis line where a chaotic solution disappears (as explained later in this section). In Figure 3, schematic flows of the solution in the (JE , JI ) plane are also drawn on the bifurcation set. The bifurcation structure roughly resembles that of the pulse-coupled network, but in the exponentially coupled network, there also exist period-doubling bifurcations and the chaotic solutions. The flows in the (JE , JI ) plane, the time series of JE , and the raster plots for the finite system with NE = NI = 1000 are shown in Figures 4A, 4B, and 4C, respectively, and synchronized chaotic firings are observed. Let us consider the Poincar´e section of the trajectory at a line JE = 0.15 with d JE /dt > 0 in the (JE , JI ) plane. The bifurcation diagram of the attractors at the Poincar´e section against D for gext /gint = 0.64 and gint = 3.5 is shown in Figure 5A, and the chaotic attractors are observed. To confirm that the chaotic behaviors in Figure 5A are actually chaotic, the largest Lyapunov exponent is calculated by the standard technique (Ott, 1993): calculating the expansion rate of two nearby trajectories, each of which follows a set of ordinary differential equations x˙ = f (x) for the spatial Fourier coefficients of equations 2.9 and 2.10. The corresponding Lyapunov exponent is shown in Figure 5B. It is observed that the Lyapunov exponent takes positive values when the chaotic solutions exist and takes zero when periodic solutions are stable. In the following, periodic solutions with the period n in the Poincar´e section are called periodic solutions with cycle n. The areas where the periodic solutions with cycle 2 or 4 or the chaotic solutions exist are roughly sketched in Figure 6. The periodic solutions with large cycles and the windows in the chaotic regions are neglected because their areas are very narrow. In the bifurcation set, there exist points of crisis line where the chaotic attractors disappear. When a periodic solution instead of the chaotic attractor disappears, this is the point of homoclinic bifurcation. For small gext , high-frequency synchronization where excitatory neurons continue to fire with the period about their pulse width are observed. The flows of the probability flux and the raster plots for such a synchronization
1324
T. Kanamaru and M. Sekine 1
JI
JE SN
0.8
JI
HB
JE
Hopf
JI
JI
0.6
gext / gint
crisis or HB JI
JE
JE
0.4
JE SN 0.2 SN
0 0.005
0.01
0.1
0.2
D Figure 3: A numerically obtained bifurcation set of the exponentially coupled network. The parameters are set at κ E = κ I = 1 and gint = 3.5. The solid, dotted, and dash-dotted lines denote the Hopf, saddle node, and global bifurcations, respectively. Schematic flows of the solution in the (JE , JI ) plane are also drawn on the bifurcation set. The filled and open circles in the trajectories in the (JE , JI ) plane denote the stable and unstable equilibrium points, respectively. The solid closed curves denote the stable limit cycle. The meanings of the abbreviations are as follows: SN, saddle node; HB, homoclinic bifurcation.
are shown in Figure 7. As shown in Figure 7B, the frequencies of the excitatory neurons are very high, and their patterns of synchronization are hardly seen. It seems that this high-frequency synchronization does not correspond to the physiological observations because the periods of the physiologically observed periodic firings are much longer than the typical pulse width of a neuron. Thus, we call this high-frequency synchronization the anomalous high-frequency synchronization. The anomalous highfrequency synchronization is realized because the probability flux JE of the
Synchronized Firings in Class 1 Networks
1325
Figure 4: The chaotic dynamics observed in the exponentially coupled network for gext /gint = 0.64, gint = 3.5, and D = 0.0125. (A) A flow in the (JE , JI ) plane. (B) A time series of JE . (C) The raster plot of the firings in the finite system with NE = NI = 1000.
1326
T. Kanamaru and M. Sekine
Figure 5: The positions of the attractors on the Poincar´e section JE = 0.15 against D for gext /gint = 0.64. (B) The corresponding Lyapunov exponent.
excitatory neurons always takes large values. A condition for the existence of the anomalous high-frequency synchronization is obtained as follows. Generally, if the product J X (t) of the time-average J X (t) of the probability flux and the pulse width takes a value larger than 1, the neurons in the ensemble X continue to fire. With our parameters, the pulse width is about 5; thus, the excitatory neurons continue to fire if an inequality J X (t) > 0.2 is satisfied. In the area labeled H in Figure 6, this inequality is satisfied; thus, the anomalous high-frequency synchronization is observed.
Synchronized Firings in Class 1 Networks
1327
1 0.8
C 4
2
0.6
gext / gint
0.4 0.2 0 0.005
H
0.1
0.01
0.2
D Figure 6: The areas where the periodic solutions with cycle 2 or 4 or the chaotic solutions exist are roughly sketched, and they are labeled 2, 4, and C, respectively. In the area labeled H, there exists the anomalous high-frequency synchronization.
Before closing this section, let us consider the dependence of the exponentially coupled network on the synaptic time constants κ E and κ I . The bifurcation sets for κ E = κ I = 0.1, κ E = κ I = 0.5, and κ E = κ I = 3.0 are shown in Figure 8. The boundaries of the areas where the periodic solution with cycle 2 or the chaotic solution exists are roughly sketched. The periodic solutions with larger cycles also exist for the parameters of Figure 8C, but we neglect them because those areas are very narrow. Moreover, the area where the anomalous high-frequency synchronization exists is also shown. As shown in Figure 8A, for small κ E and κ I , the structure of the bifurcation set is almost identical with that of the pulse-coupled network. This is because equation 4.1 reduces to IXY (t) = gXY J Y (t) in the limit of κ E , κ I → 0, and it is equivalent to the interaction term of the pulse coupling in equation 3.1. As shown in Figures 8B and 8C, when κ E and κ I are increased, the two homoclinic bifurcation lines merge, and the periodic solutions with n cycles and the chaotic solutions appear. Moreover, it is also observed that the area where the synchronized firings exist becomes narrower along the D-axis by the increase of κ E and κ I . In other words, the synchronized firings are more easily obtained for short synaptic decay time κ E and κ I . Note that the change of κ E and κ I does not affect the positions of equilibrium points because the equilibrium of 4.1 is independent of κY . Thus, if gext , gint , and D are fixed, the firing rate or probability flux of the equilibrium point is kept constant with the change of κY . However, the stability of equilibrium states depends on κ E and κ I ; thus, the position of the Hopf bifurcation line changes.
1328
T. Kanamaru and M. Sekine
(A) 0.25 0.2
JE 0.15
J
JI
0.1 0.05 0 0
50
100
150
200
150
200
t (B)
index of neuron
2000
1500
1000
500
0 0
50
100
t Figure 7: The anomalous high-frequency synchronization. (A) Flows of the probability flux and (B) the raster plots for the finite system with NE = NI = 1000 for D = 0.014, gext /gint = 0.3, and gint = 3.5.
5 Alpha Coupling and Double Exponential Coupling In the limit of NY → ∞, the coupling term IXY (t) for the network with alpha coupling is approximated as (Gardiner, 1985) IXY (t) = gXY
t
−∞
dt α(t − t ; κY )J Y (t ),
t t α(t; κ) ≡ 2 exp − , κ κ
(5.1) (5.2)
Synchronized Firings in Class 1 Networks
(A) 1
1329
(B)
κ E = κ I =0.1, gint =3.5
κ E = κ I =0.5, gint =3.5
1
SN 0.8
SN 0.8
HB
0.6
SNL
gext / gint
HB
0.6
2
Hopf
Hopf 0.4
H
DLC
0.2 0 0.005
0.4
HB
0.2
SN
SN 0.01
0.1
0 0.005
0.2
0.01
0.1
D
0.2
D
(C)
κ E = κ I =3.0, gint =3.5
1 0.8
HB C
0.6
gext / gint
0 0.005
Hopf
SN
0.4 0.2
2
H
H
0.01
0.1
0.2
D
Figure 8: The bifurcation sets of the exponentially coupled network for (A) κ E = κ I = 0.1, (B) κ E = κ I = 0.5, and (C) κ E = κ I = 3.0. The internal coupling strength gint is fixed at gint = 3.5. The boundaries of the areas where the periodic solution with cycle 2 or the chaotic solution exist are rough sketches. The areas where the anomalous high-frequency synchronization exists are also shown.
and it satisfies the differential equations written as 1 (0) IXY − IXY , IXY˙ (t) = − κY 1 (0) (0)˙ IXY − gXY J Y , IXY (t) = − κY
(5.3) (5.4)
where (0)
IXY (t) = gXY
t
−∞
dt e(t − t ; κY )J Y (t ),
1 t e(t; κ) ≡ exp − . κ κ
(5.5) (5.6)
1330
T. Kanamaru and M. Sekine
By integrating the Fokker-Planck equations with equations 5.3 and 5.4, the behavior of the network with alpha coupling can be analyzed. For the network with double exponential coupling, the coupling term IXY (t) can be approximated as 1 (1) (2) κ1Y IXY − κ2Y IXY , κ1Y − κ2Y t (1) IXY (t) = gXY dt e(t − t ; κ1Y )J Y (t ),
IXY (t) =
−∞
(2) IXY (t) = gXY
t
−∞
dt e(t − t ; κ2Y )J Y (t ), (1)
(5.7) (5.8) (5.9)
(2)
in the limit of NY → ∞. And IXY (t) and IXY (t) satisfy the differential equations 1 (1) (1)˙ IXY (t) = − I − gXY J Y , κ1Y XY 1 (2) (2)˙ I − gXY J Y . IXY (t) = − κ2Y XY
(5.10) (5.11)
By integrating the Fokker-Planck equations with equations 5.7, 5.10, and 5.11, the behavior of the network with double exponential coupling can be analyzed. Following the above procedures, bifurcation sets of the network with alpha coupling or double exponential coupling are shown in Figure 9. Figure 9A shows the result for alpha coupling with κ E = κ I = 1, and Figures 9B and 9C show the results for double exponential coupling with κ1E = κ1I = 3 and κ2E = κ2I = 1, and κ1E = κ1I = 1 and κ2E = κ2I = 0.5, respectively. The internal coupling strength gint is fixed at gint = 3.5. In Figure 9D, the corresponding parameter values are plotted in the (κ1Y , κ2Y ) plane. By comparing Figures 9A and 9B, the effect of the decay time of the synaptic interaction can be summarized. Similar to the results for the exponential coupling in Figure 8, with the increase of the decay time, the area where the synchronized firings exist becomes narrower along the D-axis. By comparing Figures 9A and 9C, the effect of the rise time can be summarized. Unlike the decay time, it is observed that the change of the rise time does not have a large effect on the overall bifurcation structure of the network. We perform more detailed analyses on the dependence of the bifurcation structure on the synaptic time constants. We focus only on the large D and consider the Hopf bifurcation observed when varying the synaptic time constants κ1 ≡ κ1E = κ1I and κ2 ≡ κ2E = κ2I for fixed D, gext and gint . As stated in the previous section, the change of synaptic time constants affects the stability of the equilibrium points, but it does not affect the firing
Synchronized Firings in Class 1 Networks
(A) 1 0.8
(B)
κ 1E = κ 1I =3.0 double exp., κ 2E = κ 2I =1.0, gint =3.5
alpha, κ E = κ I =1.0, gint =3.5
HB or crisis
1331
1
0.8 HB or crisis
C 2
0.6
C
0.6
Hopf
gext / gint 0.4
2
Hopf
0.4
H
H
0.2
SN 0 0.005
0.01
0.1
0.2
H
H
0.2 0 0.005
SN 0.01
0.1
D
(C)
0.2
D
κ 1E = κ 1I =1.0 double exp., κ 2E = κ 2I =0.5, gint =3.5
(D) κ 2Y
1
3.0 0.8
gext / gint
κ 1Y = κ 2Y
HB or crisis
0.6
C
2
2.0
Hopf
0.4 0.2
A H
B
1.0
H
C
SN 0 0.005
κ 1Y
0.01
0.1
0.2
O
1.0
2.0
3.0
D
Figure 9: The bifurcation sets of the network with the alpha coupling or the double exponential coupling. (A) Alpha coupling, κ E = κ I = 1. (B) Double exponential coupling, κ1E = κ1I = 3 and κ2E = κ2I = 1. (C) κ1E = κ1I = 1 and κ2E = κ2I = 0.5. The internal coupling strength gint is fixed at gint = 3.5. (D) The chosen parameters are plotted in the (κ1Y , κ2Y ) plane. The filled circles denote the parameters investigated in this figure, and the open circles denote the parameters treated in the previous sections.
rate of each ensemble. The Hopf bifurcation lines observed when varying the synaptic time constants are shown in Figure 10. Typically, as shown in Figure 10A, the Hopf bifurcation takes place by decreasing the synaptic decay time κ1 , and the synchronized firings appear. This is because the area for the synchronized firings widens along the D-axis with the decrease of κ1 , as shown in Figure 8. On the other hand, as shown in Figure 10A, its dependence on the synaptic rise time κ2 is not uniform. The Hopf bifurcation takes place with the change of κ2 only when the synaptic decay time κ1 is appropriately chosen. It is also observed that the synchronized firings appear even with the increase of κ2 .
1332
T. Kanamaru and M. Sekine
(A) 3
g ext /g int =0.7, D=0.04 g ext /g int =0.6, D=0.05
2.5 2
κ2
S
1.5 1 0.5
P
0 0
0.5
1
1.5
2
κ1
2.5
3
2.5
3
(B) 3
g ext /g int =0.25, D=0.013
2.5 2
κ2
1.5 1
P
0.5
S
0 0
0.5
1
1.5
κ1
2
Figure 10: The Hopf bifurcation observed when varying the synaptic time constants for (A) gext /gint = 0.7 and D = 0.04, gext /gint = 0.6 and D = 0.05, and (B) gext /gint = 0.25 and D = 0.013. The internal coupling strength gint is fixed at gint = 3.5. For A, the synchronized periodic firings exist for short decay time κ1 , and for B, the synchronized periodic firings exist for long decay time κ1 . The synchronized state is stable in the area labeled P, and the asynchronous state is stable in the area labeled S.
Moreover, as shown in Figure 10B, there exist parameter values where long synaptic time constants cause the synchronized firings. This is because the area for the synchronized firings slightly widens along the gext axis with the increase of κ1 , as shown in Figure 8. However, these synchronized firings are anomalous high-frequency synchronization; thus, this phenomenon might not have a physiological correspondence.
Synchronized Firings in Class 1 Networks
1333
6 Conclusions and Discussions On the synchronized firings in the networks of class 1 excitable neurons with excitatory and inhibitory connections, their dependence on the forms of interactions is analyzed. As the forms of interactions, we treat double exponential coupling and the interactions derived from it in some limiting cases—pulse coupling, exponential coupling, and alpha coupling—and investigate the dependence of the bifurcation structure on the rise time and the decay time of interactions. By investigating the dependence of the solutions on the external connection strength gext and the noise intensity D, various synchronized firings are observed, such as synchronized periodic firings, synchronized chaotic firings, and anomalous high-frequency synchronization. The decay time κ1 of the synaptic potential affects the bifurcation structure of the synchronized firings on the (D, gext ) plane. With the decrease of κ1 , the area showing the synchronized firings widens along the D-axis. In other words, the synchronized firings are more easily obtained for a short synaptic decay time. It is also found that a relatively large value of κ1 is required to observe the synchronized chaotic firings. The dependence of the overall bifurcation structure on the synaptic rise time κ2 is weaker than that on κ1 . In the analysis of synchronization in neural systems, the average firing rate of the ensemble is often fixed by regulating the constant input to the network (e.g., see Hansel & Mato, 2003). With such a procedure, it is possible to separate the effects of the firing rate and the other parameters on the bifurcation structure. In our model, the firing rate corresponds to the probability flux, and such a fixation of the firing rate is not performed in our analysis—namely, the value of the firing rate varies depending on parameters gext , gint , and D. Thus, our bifurcation sets reflect the effect of both the firing rate and the other parameters. However, when gext , gint , and D are fixed, the firing rate of the equilibrium point takes a constant value; thus, the effect of the firing rate is eliminated in the bifurcation set in the (κ1 , κ2 ) plane (see Figure 10). Let us consider the effect of the timescale of the synaptic interaction on the synchronized firings. In the networks of self-oscillating excitatory neurons with alpha couplings, it is known that the perfectly synchronized state is unstable (van Vreeswijk, 1996; van Vreeswijk et al., 1994). In such a network, the asynchronous state is stable for long synaptic timescales, and the partially synchronized state is stabilized for short synaptic timescales (van Vreeswijk, 1996). The synchronization observed in our noisy network would correspond to their partial synchronization, and, similar to their results, the short synaptic timescale facilitates the synchronization in our network (see Figure 10A). It is noticeable that the synchronous state is stabilized even for long synaptic times in some parameter range (see Figure 10B). This effect can be understood by considering the overall bifurcation structure. However, this synchronous state corresponds to the anomalous high-frequency
1334
T. Kanamaru and M. Sekine
synchronization; thus, this phenomenon might not have a physiological correspondence. As for the rise time of the synaptic interaction, it is known that a pair of self-oscillating excitatory leaky integrate-and-fire neurons with exponential couplings shows perfect synchronization, although the network with alpha couplings shows only the partial synchronization or the antiphase synchronization (van Vreeswijk et al., 1994). Moreover, for a pair of self-oscillating excitatory neurons with double exponential couplings, it is also known that the short rise time widens the parameter range where the partial synchronization is observed (Hansel et al., 1995). These results suggest that the short rise time facilitates the synchronization in the small network. However, our results show that the rise time of the synaptic interaction gives smaller effects on the overall bifurcation structure than that of the decay time. It might be because our network contains a very large number of neurons, and the effect of a single pulse is scaled as ∼NX−1 (X = E or I ) and negligible. In such a network, the bifurcation structure might be determined by the characteristic timescale of the synaptic input IXY (t). In our configuration, the decay time is longer than the rise time (κ1 ≥ κ2 ); thus, it is dominant in IXY (t). Let us consider the roles of inhibition. In our network, the synchronized firings are not observed without inhibitory neurons (see the bifurcation set at gext = 0), and it might be because our network is composed of excitable neurons. Although the period of the firings of networks of self-oscillating excitatory neurons is typically determined by the period of a single neuron, it can take various values depending on the parameters in the network of excitable neurons with excitatory and inhibitory connections. Typically, the period is long around the saddle node on limit cycle bifurcation and the homoclinic bifurcation, and it is short around the Hopf bifurcation. Note that the period of the firings near the Hopf bifurcation can take large values if the activities of excitatory and inhibitory ensembles are balanced and weakly synchronized periodic firings are realized (Kanamaru & Sekine, 2004). In the analysis of the pulse-coupled network, it is found that its bifurcation structure is similar to that of the network with the waveform-coupling written by equation 2.8 (Kanamaru & Sekine, 2003). The width of the pulse injected to the next neuron with the waveform coupling is as large as ∼ 5, and the width of the interaction of the pulse-coupled neuron is infinitesimal; thus, this similarity seems to be odd. This contradiction might be explained as follows. In the network with the waveform coupling or the pulse coupling, each neuron has its characteristic timescale determined by (τ E , a ) or (τ I , a ). By the coupling, the additional characteristic timescale is not introduced to the network because the interaction with the waveform coupling has the same characteristic timescale as the neuron, and the interaction with the pulse coupling does not have a characteristic timescale. In the network with the other couplings such as exponential coupling, a new characteristic timescale of the synapse is introduced to the network; thus, its dynamics becomes complex.
Synchronized Firings in Class 1 Networks
1335
Hoppensteadt and Izhikevich considered weakly connected networks of class 1 neurons that are close to the saddle node bifurcation point and derived a canonical model described by phase variables connected with the pulse coupling (Hoppensteadt and Izhikevich, 1997; Izhikevich, 1999). Because of the closeness to the bifurcation point, the characteristic timescale of the neuron is long, and the characteristic timescale of the coupling becomes relatively short; thus, the pulse coupling is justified in the canonical model. The behavior of this model is expected to be similar to that of our pulse-coupled active rotators. When the neuron is away from the bifurcation point, the approximation with the pulse coupling does not hold; thus, couplings such as double exponential coupling might be required. In such a network, synchronized firings appear mainly through the Hopf bifurcation or the homoclinic bifurcation, and synchronized periodic firings with large cycles and synchronized chaotic firings are typically observed in a wide range of parameters. This ubiquity of the chaotic firings might suggest the importance of chaos in the brain dynamics. In this article, for simplicity, we treated only the case where the time constants of the excitatory neurons and the inhibitory neurons are identical. Kanamaru and Sekine (2004) treated a network with the waveform coupling that has different time constants τ E = 1 and τ I = 2, and it was found that its dynamics is more complex than that of the network with τ E = τ I = 1. Moreover, the weakly synchronized periodic firings that are often observed in physiological experiments (Gray & Singer, 1989; Buzs´aki, Horv´ath, Urioste, Hetke, & Wise, 1992; Fisahn, Pike, Buhl, & Paulsen, 1998) are also observed in the network with τ E = 1 and τ I = 2. The physiological neurons have many characteristic timescales such as those of the various ion channels and the synaptic interactions, and it is known that the excitatory neurons and the inhibitory neurons have different values of time constants. Thus, it would be important to find dominant characteristic timescales in the physiological system and incorporate it in the theoretical model. Appendix: Numerical Integration of the Fokker-Planck Equation In this section, we give a method for the numerical integration of FokkerPlanck equations 2.9 and 2.10. Two densities given by equations 2.13 and 2.14 are 2π-periodic functions of θ E and θ I , respectively. Thus, they can be expanded as
n E (θ E , t) = n I (θ I , t) =
∞ E 1 a k (t) cos(kθ E ) + b kE (t) sin(kθ E ) , + 2π k=1
(A.1)
∞ I 1 a k (t) cos(kθ I ) + b kI (t) sin(kθ I ) , + 2π k=1
(A.2)
1336
T. Kanamaru and M. Sekine
and, by substituting them, equations 2.9 and 2.10 are transformed into a set of ordinary differential equations x˙ = f (x) where x = (a 1E , b 1E , a 1I , b 1I , a 2E , b 2E , a 2I , b 2I , · · ·)t , k2 D da k a k (X) k (X) (X) (X) a , = − (1 + I X )b k + a k−1 − a k+1 − dt τX 2τ X 2τ X2 k
(A.3)
k2 D db k a k (X) k (X) (X) (X) b , = (1 + I X )a k + b k−1 − b k+1 − dt τX 2τ X 2τ X2 k
(A.4)
(X)
(X)
I E ≡ IEE − IEI ,
(A.5)
I I ≡ IIE − III ,
(A.6)
1 , π
(A.7)
(X)
a0 ≡ (X)
b 0 ≡ 0,
(A.8)
k ≥ 1, and X = E or I . These ordinary differential equations are numerically integrated with the fourth-order Runge-Kutta algorithm.
Acknowledgments T.K. is grateful to T. Horita for his careful reading of the manuscript. This research was partially supported by a Grant-in-Aid for Encouragement of Young Scientists (B) (No. 14780260) from the Ministry of Education, Culture, Sports, Science, and Technology, Japan.
References Abbott, L. F., & van Vreeswijk, C. (1993) Asynchronous states in networks of pulsecoupled oscillators. Physical Review E, 48, 1483–1490. Borgers, ¨ C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Computation, 15, 509– 538. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. Journal of Computational Neuroscience, 8, 183–208. Buzs´aki, G., Horv´ath, Z., Urioste, R., Hetke, J., & Wise, K. (1992). High-frequency network oscillation in the hippocampus. Science, 256, 1025–1027. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40Hz in the hippocampus in vitro. Nature, 394, 186–189.
Synchronized Firings in Class 1 Networks
1337
Gardiner, C. W. (1985). Handbook of stochastic methods. Berlin: Springer-Verlag. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Golomb, D., & Ermentrout, G. B. (2001). Bistability in pulse propagation in networks of excitatory and inhibitory populations. Physical Review Letters, 86, 4179–4182. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. Journal of Computational Neuroscience, 1, 11–38. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Sciences of USA, 86, 1698–1702. Guckenheimer, J., & Holmes, P. (1983). Nonlinear oscillations, dynamical systems, and bifurcations of vector fields. New York: Springer. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Computation, 15, 1–56. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer. Izhikevich, E. M. (1999). Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Transactions on Neural Networks, 10, 499–507. Kanamaru, T., & Sekine, M. (2003). Analysis of globally connected active rotators with excitatory and inhibitory connections using the Fokker-Planck equation. Physical Review E, 67, 031916. Kanamaru, T., & Sekine, M. (2004). An analysis of globally connected active rotators with excitatory and inhibitory connections having different time constants using the nonlinear Fokker-Planck equations. IEEE Transactions on Neural Networks, 15, 1009–1017. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: Springer. Kuramoto, Y. (1991). Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. Kurrer, C., & Schulten, K. (1995). Noise-induced synchronous neuronal oscillations. Physical Review E, 51, 6213–6218. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM Journal of Applied Mathematics, 50, 1645–1662. Ott, E., (1993). Chaos in dynamical systems. Cambridge: Cambridge University Press. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C. Cambridge: Cambridge University Press. Sakaguchi, H., Shinomoto, S., & Kuramoto, Y. (1988). Phase transitions and their bifurcation analysis in a large population of active rotators with mean-field coupling. Progress of Theoretical Physics, 79, 600–607. Sato, Y. D., & Shiino, M. (2002). Spiking neuron models with excitatory or inhibitory synaptic couplings and synchronization phenomena. Physical Review E, 66, 041903. Shinomoto, S., & Kuramoto, Y. (1986). Phase transitions in active rotator systems. Progress of Theoretical Physics, 75, 1105–1110.
1338
T. Kanamaru and M. Sekine
Tanabe, S., Shimokawa, T., Sato, S., & Pakdaman, K. (1999). Response of coupled noisy excitable systems to weak stimulation. Physical Review E, 60, 2182–2185. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Physical Review Letters, 71, 1280–1283. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Physical Review E, 54, 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. Journal of Computational Neuroscience, 1, 313– 321. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726.
Received March 11, 2004; accepted November 3, 2004.
LETTER
Communicated by Suzanna Becker
A Hierarchy of Associations in Hippocampo-Cortical Systems: Cognitive Maps and Navigation Strategies J. P. Banquet
[email protected] INSERM U483 Neuroscience and Modelization, Universit´e Pierre et Marie Curie, 75252 Paris, France
Ph. Gaussier
[email protected] M. Quoy
[email protected] A. Revel
[email protected] CNRS U2235 ETIS-Neurocybern´etique, Universit´e de Cergy-Pontoise-ENSEA, 95014 Cergy-Pontoise, France
Y. Burnod
[email protected] INSERM U483 Neuroscience and Modelization, Universit´e Pierre et Marie Curie, 75252 Paris, France
In this letter we describe a hippocampo-cortical model of spatial processing and navigation based on a cascade of increasingly complex associative processes that are also relevant for other hippocampal functions such as episodic memory. Associative learning of different types and the related pattern encoding-recognition take place at three successive levels: (1) an object location level, which computes the landmarks from merged multimodal sensory inputs in the parahippocampal cortices; (2) a subject location level, which computes place fields by combination of local views and movement-related information in the entorhinal cortex; and (3) a spatiotemporal level, which computes place transitions from contiguous place fields in the CA3-CA1 region, which form building blocks for learning temporospatial sequences. At the cell population level, superficial entorhinal place cells encode spatial, context-independent maps as landscapes of activity; populations of transition cells in the CA3-CA1 region encode context-dependent maps as sequences of transitions, which form graphs in prefrontal-parietal cortices. The model was tested on a robot moving in a real environment; these tests produced results that could help to interpret biological data. Neural Computation 17, 1339–1384 (2005)
© 2005 Massachusetts Institute of Technology
1340
J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod
Two different goal-oriented navigation strategies were displayed depending on the type of map used by the system. Thanks to its multilevel, multimodal integration and behavioral implementation, the model suggests functional interpretations for largely unaccounted structural differences between hippocampo-cortical systems. Further, spatiotemporal information, a common denominator shared by several brain structures, could serve as a cognitive processing frame and a functional link, for example, during spatial navigation and episodic memory, as suggested by the applications of the model to other domains, temporal sequence learning and imitation in particular.
1 Introduction In recent years, the understanding of the hippocampo-cortical connectivity (Witter et al., 2000; Lavenex & Amaral, 2000; Amaral & Witter, 1989) and evidence from a variety of experimental approaches indicate that each of the component fields of the hippocampal system (parahippocampal region, entorhinal cortex, hippocampus proper) may serve different yet complementary functions. Both anatomical and experimental results suggest the existence of at least three main processing levels of complex temporospatial information: a first level in the perirhinal and postrhinal cortex for pattern location association, a second level in the entorhinal cortex for the integration of visuospatial and self-motion information into a coarse spatial code, and a third level for temporospatial and contextual integration in the trisynaptic loop, which forms a major input to the subiculum. Moreover, two parallel streams (conveying respectively “what” and “where” information) have been delineated by tracers all the way through parahippocampal and entorhinal systems. Local connections within and between these streams potentially lead to increased associativity and integration of the information that reaches the different rostrocaudal or mediolateral regions of the hippocampal system (Lavenex & Amaral, 2000). Yet contrasting with these latter partial connections, a layered projection of the “What” and “Where” streams leads to a considerable convergence and a loss of anatomical topology at the level of the dentate gyrus (DG) and the CA3 fields (Witter et al., 2000; Lavenex & Amaral, 2000; Amaral & Witter, 1989). These two structures are also in receipt of important modulating signals from the septum of the basal forebrain cholinergic system and the dopaminergic system. The functional meaning of these structural characteristics is poorly understood. A biologically realistic and functionally integrated model should help to clarify the properties of the different subsystems and their contribution to global functions attributed to the hippocampus, such as spatial processing and navigation, or episodic memory. The hippocampus of the rat has been hypothesized to host a spatial representation of the animal’s environment (O’Keefe & Nadel, 1978). The main
A Hierarchy of Associations in Hippocampo-Cortical Systems
1341
evidence in support of this theory is the existence of hippocampal place cells (PC, pyramidal neurons whose firing is strongly correlated with the location of a freely moving rat in its environment) (O’Keefe & Dostrovsky, 1971). The activity of each cell is selective of the current location of the animal. This cell-specific region of intense discharge is named the firing field, by analogy to the receptive field of cortical neurons. The firing fields of PCs can be seen in all parts of the environment accessible to the rat, so that collectively, active PCs and their specific firing profile provide a potent signature of the environment and, plausibly, the components of a map. If the shape of the apparatus (Muller & Kubie, 1987; Lever, Willis, Cacucci, Burgess, & O’Keefe, 2002), the color of the objects within the apparatus (Bostock, Muller, & Kubie, 1991; Kentros, Hargreaves, Kandel, Shapiro, & Muller, 1998), or the orientation of the apparatus relative to background (Cressant, Muller, & Poucet, 2002; Skaggs, Knierim, Kudrimoti, & McNaughton, 1995; Tanila, Sipila, Shapiro, & Eichenbaum, 1997) are changed, “remapping” takes place: some cells active in one apparatus become silent, and inversely. The fields of the cells active in both apparatuses are unrelated. This phenomenon suggests that the hippocampus learns and holds distinct maps for distinct environments. The hallmark of our hippocampal model was a dynamical spatiotemporal (and not just spatial) representation of the space and task environment through the computation and encoding of transitions in the CA field (transition cells), the inclusion of this hippocampal structure into a larger corticalsubcortical network, and the storage of the maps at cortical level. These characteristics provided for a straightforward solution to the theoretical difficulty to switch from a spatial cognitive map to its motor implementation during goal-oriented navigation (Banquet, Gaussier, Quoy, Revel, & Burnod, 2004; Gaussier, Revel, Banquet, & Babeau, 2002; Banquet, Gaussier, Revel, Moga, & Burnod, 2001). Even though different implementations of neural fields (Amari, 1977) and chaotic attractors (Tsuda, 2001) were used (Quoy, Banquet, & Dauce, 2001; Dauce, Quoy, & Doyon, 2002), the model presented here differs from a classical attractor model in that no recurrent connections were implemented in CA3. Our main goal in this letter is to delineate within a coherent frame the distinct complementary contributions, in spatial processing and navigation, of the parahippocampal region (perirhinal, PR; parahippocampal, PH; and entorhinal, EC cortices) and of the hippocampus (HS) proper, based on anatomical and experimental observations, and to make testable predictions. Our model comprises successive levels of association of different types and of increasing complexity. These associative neural nets, functionally paired to pattern-encoding and recognition networks, provided increasingly multimodal and abstract representations of the inputs. The differences between the associations performed by the local recurrent cortical circuits of pyramidal cells and the extensive, global CA3 associations (Cohen & Eichenbaum, 1993) were attributed to the distinct structures of hippocampal and cortical networks. Object-location associations (here
1342
J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod
landmarks), encountered in the parahippocampal cortex (Rolls & Treves, 1998), combined to form local views; in conjunction with idiothetic inputs (here, idiothetic information means all direct self-motion information, including optic flow, vestibular signals, corollary discharge, and somatosensory feedback), these views created position-dependent activity in the medial EC (Quirk, Muller, Kubie, & Ranck, 1992; Sharp, 1999) and HS. The wealth of experiments on spatial and navigation tasks in rodents and primates provided a test bench for model development and analysis. The recording of at least two types of PCs (hippocampal and entorhinalsubicular) suggests the encoding by distinct neural populations of two types of maps for the same environment. Classically, PCs with well-delimited place fields have been recorded in CA3-CA1 pyramidal cells and DG granule cells (Jung & McNaughton, 1993). More recently, place cell–like activity has been recorded in the superficial (Quirk et al., 1992; Sharp, 1999) and deep layers (Frank, Brown, & Wilson, 2000; Mizumori, Ward, & Lavoie, 1992) of medial entorhinal cortex (MEC), as well as in the subiculum (SUB) (Sharp & Green, 1994). The firing fields of these pyramidal neurons have no clear-cut boundaries but a graded decay starting from spatially stable maxima. No remapping of these fields takes place when the geometry of the environment is changed; rather, there is a topological adaptation to the shape of the environment (Quirk et al., 1992; Sharp, 1999). The relatively weak and coarse place codes found in superficial EC are refined in the hippocampus proper to create a finely grained representation of position in DG, transformed into larger, overlapping place fields in CA3-CA1, and further embedded in the context of a trajectory in deep EC (Frank et al., 2000). Our model reproduced two types of place fields, entorhinal and hippocampal, starting from real views taken from the environment by the camera of a moving robot and also provided mechanistic and functional interpretations and predictions. The hypothesis of a hierarchy of associativity allowed us to consider the spatial information precoded in EC as the source for both a DG refined spatial code and a CA3-CA1 temporospatial code. Recent results confirmed the main assumption of our model (Banquet et al., 1997; Revel, Gaussier, Leprˆetre, & Banquet, 1998) of two distinct functions of EC-DG and CA3-CA1 for the processing of spatial and temporal order information. After selective DG or CA1 lesions, a double dissociation in the separation of respectively small-grain spatial patterns and temporospatial patterns (Gilbert, Kesner, & Lee, 2001) supported this view. Collectively, the corresponding two types of place cells encoded two types of coexisting hippocampo-cortical maps, associated with distinct navigation strategies during robotic experiments. The first was a “universal” context-independent map (Sharp, 1999) computed by superficial entorhinal neural populations with weak position-dependent activity, based on “landscapes” of PC potentials proper to each location. The classical concept of spatial map was extended here to the acquisition of a coarse yet specific location-action mapping, close to the concept of cognitive map.
A Hierarchy of Associations in Hippocampo-Cortical Systems
1343
The second type was a context-dependent map computed by the CA3-CA1 association networks based on place field transitions encoded by transition cells. A transition cell or set (representing a population in the model) was a minimal representation of changing subsets of active CA3 neurons during navigation. The transition cells formed the building blocks of the neural representations of temporospatial sequences, graphs, and contextual maps putatively stored in parietal or prefrontal cortex. While the first type of map could be characterized as spatial and stable, this second type could be characterized as temporospatial and dynamic. Universal and contextual maps were both modulated by the “head direction system,” thus achieving external coherence aligned to the external world, as well as internal coherence by the alignment of views from multiple directions. Our model built on previous models of place cells (O’Keefe, 1991; Sharp, Blair, & Brown, 1996; Burgess, Recce, & O’Keefe, 1994; McNaughton, Knierim, & Wilson, 1994; Touretzky & Redish, 1996), and yet made several original contributions. First, the use of transitions to guide actions provided a straightforward transition between spatial representation and navigation. Second, a theoretical analysis of the process of place field and map learning resulted in a single analytical equation (equation 2.10) that summarized the spatial properties of the network and was useful to understand the relations between the landmarks and the geometrical properties of the place fields. Third, visual information was automatically extracted from the environment by a biologically inspired vision system combined with path integration to provide a mechanistic and functional integration and interpretation of the two types of place fields. Stable invariant but coarse spatial codes were combined with context- and task-dependent dynamic codes (transitions) to produce robust and flexible temporospatial representations. At the population level, the map concept was extended to a mathematical mapping between the spaces of representations and actions, which could be shared by both spatial and cognitive maps. Finally, the most significant subsystems of the parahippocampal region and the hippocampus proper were functionally integrated in order to implement, beyond the simple simulation of a model, a robot control system during navigation experiments that were more recently conducted in parallel in rat and robot (Paz-Villagran, Save, & Poucet, 2003, 2004).1 This letter emphasizes the anatomical and physiological support and detailed mathematical formulation of the different model subsystems and the corresponding experimental predictions; it also proposes a functional significance for the different types of PCs and corresponding maps by establishing a link between map types and navigation strategies. Nevertheless, the letter focuses on the input stages (PR, PH) and 1 Koala robot built by K-team, equipped with a CCD camera mounted on a servomotor to take panoramic views of the environment; the visual field varied from 60 to 300 degrees with a maximal resolution of (256 x 1200) pixels. A magnetic compass simulated the vestibular system.
1344
J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod
the early stages (EC, DG) of hippocampal processing, which form a sound basis for the development of the whole system. The functions of CA3, CA1, and subiculum are only sketched here. In spite of its apparently limited scope, the further developments of the model proved its general relevance for hippocampal and brain processing, since the same architecture receiving different input modalities was successfully used for learning purely temporal or spatiotemporal sequences (Banquet, Gaussier, Revel, et al., 2001; Banquet, Gaussier, Quoy, Revel, & Burnod, 2002), as well as learning by imitation (Gaussier, Moga, Banquet, & Quoy, 1998; Banquet, Gaussier, Revel, et al., 2001; Andry, Gaussier, Moga, Banquet, & Nadel, 2001) and could be adapted to any type of information in different formal spaces (e.g., word list learning). This result is in agreement with the detection of spatiotemporal information in a large variety of brain structures more or less directly related to the hippocampal system. This information could help to monitor the specific processing performed by these structures and provide a functional link between them. We first outline the anatomical and physiological bases, the architecture, and the functioning of the model in the methodological section, before presenting the results and a discussion. 2 Methods 2.1 Anatomical and Physiological Basis of the Model. The parahippocampal region, first level in the hierarchy of associativity of the hippocampo-cortical loop, receives convergent inputs from neocortex unimodal and polymodal association areas, and yet preserves some modal segregation (Lavenex & Amaral, 2000; Suzuki, Zola-Morgan, Squire, & Amaral, 1993; Witter et al., 2000). Selective lesions of PR and PH induced mild navigation deficits (Wiig & Bilkey, 1994, 1995; Liu & Bilkey, 1998), qualitatively different from hippocampal deficits, or no deficit at all (Kolb, Buhrmann, McDonald, & Sutherland, 1994; Glenn & Mumby, 1998; Bussey, Muir, & Aggleton, 1999). Conversely, PR and PH removal disrupted the animal’s ability to detect the changed position of a specific object in a familiar environment (Aggleton, Vann, Oswald, & Good, 2000). Accordingly, these lesions enduringly impaired DMS/DNMS (delay match/nonmatch to sample) tasks based on object-location associations in monkey (Suzuki et al., 1993; Zola-Morgan, Squire, Amaral, & Suzuki, 1989; Zola-Morgan, Squire, & Ramus, 1994) and equivalent navigation tasks in rats (Eichenbaum, Otto, & Cohen, 1994; Wiig & Bilkey, 1994). These tasks can be considered to depend on a simple stimulus-response strategy. The PR lesions induced more severe visual recognition deficits than EC lesions, and their effect was doubly dissociated from that of HS (Aggleton et al., 2000). The PH and posterior EC lesions produced a more severe spatial deficit than lesions of the rostral PR and EC (Parkinson, Murray, & Miskin, 1988). PR-PH areas remain cortically oriented because stimulus responsive
A Hierarchy of Associations in Hippocampo-Cortical Systems
1345
cells are more frequent there than in EC. In the model, these two structures were represented by two one-dimensional layers representing pattern and direction that combined in a landmark-encoding two-dimensional array. A second wave of association and pattern encoding was hypothesized to take place in EC superficial layers that receive inputs from PR, PH, and other polysensory areas. EC deep layer V receives, via subiculum, hippocampal backprojections that close the major hippocampal loop (see Figure 1) through a unidirectional internal projection to superficial EC layers (Kohler, Eriksson, Davies, & Chan-Palay, 1986; Jones, 1993; Witter et al., 2000). EC deep layers also send external projections to the cortex, thus closing the hippocampo-cortical loop. Preferentially, layer II projects to DG and CA3 and layer III to the CA3-CA1 region. The direct EC projections on the CA3CA1 region are at least as strong as the projections relayed through DG (Yeckel & Berger, 1990). An inhibitory barrier on EC layer II prevents any important traffic in the trisynaptic loop except for high-frequency (7 Hz) firing (Jones, 1993). Like PR or PH lesions, selective EC lesions induce more severe deficits in DNMS than selective HS lesions (Eichenbaum et al., 1994). More important, extensive EC lesions reduce the fraction of hippocampal cells presenting location-specific firing, and the stability of the place fields after maze rotation (Miller & Best, 1980) causes spatial deficits comparable to hippocampal deficits (Miller & Best, 1980; Olton, Walker, & Wolf, 1982; Goodlett, Nichols, Halloran, & West, 1989; Schenk & Morris, 1985), thus confirming the importance of EC spatial information in hippocampal spatial processing. In an attempt to overcome the limitations of lesion studies, Vann (Vann, Brown, Erichsen, & Aggleton, 2000) found a highly significant increase in C-fos expression in all HS and SUB subfields, in proportion to the (radial maze) task demands on spatial capacities for self-location and navigation. The parahippocampal region showed a lower yet highly significant increase in the C-fos label, with the exception of PR, which reacted only to novel stimuli. Simple spatial rearrangement of familiar icons increased C-fos expression in PH and parts of HS. Finally, place cell–like activity has been recorded in the superficial (Quirk et al., 1992; Sharp, 1999) and deep layers (Frank et al., 2000; Mizumori et al., 1992) of MEC, and in SUB (Sharp & Green, 1994). Furthermore, prospective and retrospective coding and path equivalence (tendency to fire at same relative locations along different paths) in deep EC suggest a coding by these neurons of the similarities between different trajectories at the same relative location with respect to a starting point (rather than precisely coding locations per se), thus relating location and behavior (Frank et al., 2000), and suggesting a dominance of path integration–related information in deep EC layers. In the model, EC (superficial) cells generated place-specific activity by implementing an unsupervised pattern learning on PR-PH inputs. The role of the dentate gyrus (DG) in spatial processing is ambiguous. DG is essential for subtle (but not coarse) spatial pattern discrimination, and
1346
J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod
a double dissociation exists between DG lesions associated with deficits in fine spatial discrimination and CA1 lesions associated with deficits in temporospatial sequence learning (Gilbert et al., 2001). Selective destruction of the DG granule cells preserves the spatial selectivity of CA3 cells but induces a spatial learning deficit (McNaughton, Barnes, Meltzer, & Sutherland, 1989). Some coherence emerges from these results if two facts are emphasized: the presence of a weak spatial code in EC and the direct and indirect connections of EC to downstream structures CA3, CA1, and SUB, susceptible to functioning independently (Yeckel & Berger, 1990). Accordingly, our model assumed that EC weak spatial code is used for a refined spatial localization by DG (orthogonalization) and also for spatiotemporal sequence learning by CA3-CA1. This hypothesis predicts that selective bilateral EC lesions should impair both a fine spatial discrimination by DG and a temporal spatial sequence learning by CA1. At present, it is known that deficits in maze performance follow bilateral EC lesions but not bilateral DG lesions in rats (Jarrard, Okaichi, Steward, & Goldschmidt, 1984). Other relevant spatiotemporal characteristics of DG processing are implemented in the model: 1. The anatomical topography reflected by the LEC-MEC subdivision is lost at the DG-CA3 stage due to the laminated projection (Amaral, l993) of superficial EC neurons on the distal DG-CA3 dendrites. The highly convergent EC projections on DG granules and their divergent widespread distribution on the DG field were believed to further foster intermodal integration. 2. The dominance of feedforward DG activation, in the absence of any significant direct recurrent connectivity between granule cells, was thought to be responsible for the sharp delimitation of DG place fields (Jung & McNaughton, 1993) and was implemented in the model by a full feedforward convergent EC-DG connectivity and a winner-takeall (WTA) long-range competition between active neurons (orthogonalization). 3. Excitatory interneurons (mossy cells), modeled by a local recurrent activation of granule cell assemblies, implemented a delay in DG cell activity that created a sliding window of activation, including past and present events, encoded as an event transition by CA3. 4. The convergence onto CA3 of the direct distal inputs from the perforant pathway and the indirect spatially restricted proximal DG projections onto CA3 (each granule cell contacts at most 15 CA3 pyramidal cells) enforced a pattern of activation on CA3. Temporal processing and delay activity believed to take place in DG are also a part of HS function:
A Hierarchy of Associations in Hippocampo-Cortical Systems
1347
1. During single stimulus response, an initial monosynaptic activation of the pyramidal cells in CA3-CA1 through direct EC projections was followed by a weaker activation of the same cells through the DG-CA3 trisynaptic route (Yeckel & Berger, 1990). Thus, with spatially close place fields corresponding to temporally overlapping subsets of active PCs, coding for sequentially visited locations could also support the coding of place transitions at the level of neural populations (Banquet, Gaussier, Revel, et al., 2001). 2. A remarkably long time constant of the CA3 NMDA receptors (150 msec) and their capacity for short-term potentiation endow CA3CA1 with a memory range adapted for learning transitions or short event sequences. 3. A familiarity-dependent, increasing place field overlap in the CA3CA1 region (Mehta, Barnes, & McNaughton, 1997) could correspond to an earlier anticipation of upcoming fields, when the rat is at the border of the current field. 4. Some hippocampal cells discharge according to the stage of a task, independent of the animal’s location (Eichenbaum, Kuperstein, Fagan, & Nagode, 1987; Wiener, Paul, & Eichenbaum, 1989; Wiener & Korshunov, 1995). 5. Recent developments (Gilbert et al., 2001) in pattern separation paradigms (Chiba, Kesner, & Gibson, 1994; Gilbert, Kesner, & DeCoteau, 1998) confirm a double dissociation between a DG finely grained spatial pattern separation and a CA3-CA1 (spatial) temporal order pattern separation. 2.2 Network Model. The network architecture includes two onedimensional input layers. A PR “What” layer, receiving pattern codes from temporal areas TE, and a PH “Where” layer, receiving object direction and location codes from posterior parietal cortex (plus V4 in primates), are dedicated, respectively, to the recognition of novel items and their spatial arrangement. These input layers converge on a merging module PR-PH, coding landmark constellations. Pattern selection-recognition in an EC module results in a weak place code that combines visual and movement-related information. A DG module performs a feedforward self-organizing, competitive separation of patterns (orthogonalization) and their transient storage in working memory. Current direct and delayed indirect inputs to CA3 allow the computation of transitions. These transitions are associated with their corresponding movement vector by convergence of place information and path integration on SUB. An analytical formulation of place coding and recognition, based on a comparison (match-mismatch) between current and memorized views of an environment, summarizes the performances of the different networks.
1348
J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod
2.2.1 Network Input. This letter does not aim at a detailed presentation of the process of visual pattern learning (Gaussier, Joulain, Banquet, Leprˆetre, & Revel, 2000). In the first visual processing stages, the identification of focal features at the center of subareas partitioning a scene resulted from gradient and curvature extraction, end stop, and corner detection, among others. The gradient extraction was followed by a convolution with filters (e.g., difference of gaussians) for the detection of corners. A serial search resulted from the emergence of a new winner feature-coding neuron after the inhibition of the previous winner. Typically, the pattern and location of 20 to 30 areas were extracted from a panoramic scene. In mammals and more so in primates, ocular saccade and pop-out attention play an important role during scene exploration. In our model, sequential snapshots of a scene identified separately “what” (a significant feature and its context) and “where” (azimuth) information, which was then recombined into landmarks. A localization-navigation paradigm (visually based in particular) involves a similarity measure between learned and current views. Such a match mechanism at the level of features allowed a more robust scene recognition than a global correlation (without feature extraction) because the recognition level depended only on the correct recognition of the selected features in their context and on their relative displacement compared to the learned image (see the analytical equation of the model, equation 2.10). A one-shot learning of the patterns took place within the connections between input pathways and “What” layer, where the pattern was recognized or a new code recruited. The absence of identification of symbolic objects avoided the binding problem related to this process. A given configuration of landmarks (constellation) allowed the recognition of a place. The whole process simulated a spotlight mechanism, whatever its nature (attention, saccade, head direction), performed by the rotation of the camera. 2.2.2 Model of Perirhinal-Parahippocampal Cortices: “What” and “Where” Input Association. In the model (see Figure 2), for a given landmark l, the effect of lateral diffusion on activity j of neuron j on the “Where” PH layer was expressed as a nonnormalized gaussian activity profile: 2 ((θ l − 2π j )mod2π ) j = exp − k N 2σ 2 where θkl represents the azimuth of the lth land2π mark and N j the preferred direction of neuron j. N represents the number of neurons (120) on the PH “Where” network. The influence on j of the activity related to lth landmark decays exponentially as a function of the angular distance between neuron j preferred direction and the azimuth of the lth landmark. If this difference is nil (the direction of lth landmark corresponds to the preferred direction of neuron j), j = 1. The activity level of each “Where” neuron represented an internal measure of the angular distance between the azimuth of the current head gaze direction and the preferred direction of this neuron.
A Hierarchy of Associations in Hippocampo-Cortical Systems
1349
The lateral diffusion of activation to neighbor neurons implied that a neuron did not need to be precisely tuned to the direction of a given landmark in order to become active. Neurons Njk belonging to the jth neighborhood and projecting to the PR-PH cells of the k column are defined by Njk =
jmax j : k. − j < d Nθ . kmax
− j| < d Nθ determined the neighborhood of the jth “Where” neuron |k. kjmax max that projected to neuron lk in the PR-PH network; kjmax was the ratio between max the number of neurons in the “Where” layer and the number of columns in the PR-PH network; d Nθ determined the size of the neighborhood of “Where” cells that project to a single PR-PH cell. This encoding of object direction is consistent with a polar coordinate system. Ultimately, object direction was referred to the body axis orientation, which itself referred to an external reference. This external reference allowed that landmark information be aligned with the environment and also independent of the orientation of the agent. In vivo, the head direction system, scattered in different brain structures and integrated into the hippocampal system in the subiculum (Sharp, Blair, Etkin, & Tzanetos, 1995) or in a HS-SUB-EC loop (Redish & Touretzky, 1997), is believed to perform this function. The activity of pattern-encoding PR and direction-encoding PH converged on the PR-PH two-dimensional array that merged “What” and “Where” streams to code landmarks by a product (pi, AND operator). PRPH is a “necessary” zone of convergence for “What” and “Where” information. This convergence has been proven by the recording of neurons in different structures (PH, EC, CA3) that respond specifically for one object in a given location (Rolls & Treves, 1998). Therefore, several possible structures or neuron populations could correspond to the PR-PH network. It could be PH since strong connections exist between PR and PH or even a subpopulation of neurons in EC superficial layers that include both stellate and pyramidal cells. AND operations in biological networks can be performed by the staged merging of excitatory synapses on dendritic trees (Shepherd, 1993). All the cells of a column of the PR-PH matrix received inputs from the same neighborhood in the “Where” layer. These neighborhoods partially overlapped. In summary, four characteristics of the network deserve to be emphasized: 1. Although full feedforward connectivity between “Where” and PRPH networks led to accurate performance, PR-PH units received only a fraction of “Where” units in order to increase the capacity of the network. 2. Only maximally active inputs were learned by the PR-PH neurons.
1350
J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod
3. Due to input codes, the level of activation of product neurons reflected the angular distance of the corresponding landmark to the current head gaze direction. 4. Assuming that the visual system cannot recognize several patterns in parallel, we use an automatic spotlight system to explore sequentially the visual scene according to a saliency map. This sequential exploration makes “What” and “Where” information temporally correlated and bound. The time-sliced sensory sweep performed by the visual system is corrected by the PR-PH working memory, which bridges the temporal gap introduced by the sequential exploration (EC delay neurons). A similar mechanism has been demonstrated for visual saccades in posterior parietal cortex. pr ph
The discrete equation of the PR-PH neurons activity Xkl pr ph Xkl (t
+ dt) =
pr ph Xkl (t) + Ikl
−
pr ph Xkl
.
is +
I n− pr ph I nm .Wm,kl
(2.1)
m
[x]+ =
x 0
if x > 0 otherwise
The excitatory component of equation 2.1 includes Ikl , a global input to pr ph neuron kl detailed below, and Xkl (t), a memory term allowing the buildup of a landmark constellation and fluctuating between 0 and 1. The inhibitory term in equation 2.1 induces a reset of the representation of a learned landmark constellation. I nm represents the activity of mth inhibitory interneuron triggered by a sensorimotor reset signal at T, 2T, 3T, . . . , nT, where T is a constant period for a visual panoramic exI n− pr ph ploration of the scenery; Wm,kl represents fixed weights between the inhibitory interneuron m and a PRPH pyramidal cell kl. Ikl , the global input to neuron kl of the PR-PH matrix, is computed as a product: pr − pr ph ph− pr ph Ikl = max L i .Wi,kl . max j .Wj,kl . i∈Nli
pr − pr ph
j∈Nl j
(2.2)
ph− pr ph
Wi,kl (Wj,kl ) are the connection weights between any ith landmark ( jth azimuth) input to the kl PR-PH neuron; L i and j represent the “What” and “Where” network inputs, respectively. The synaptic weights between input unit j and PR-PH neurons learn in one trial, in the absence of inhibitory reset and only for maximal input lines: ph− pr ph
Wj,ki
= (L i ) . ( j ) . f (I − In ).
(2.3)
A Hierarchy of Associations in Hippocampo-Cortical Systems
1351
i = arg(max p∈Nli L p ), j = arg(maxq ∈Nk j q ); In is an inhibitory reset activity that prevents learning in case of reset; I is: I =
(maxi∈Nli L i ) + (max j∈Nk j j ) . 2
(2.4)
f (x) = 1 if x > 0.99 and 0 otherwise; this thresholded Heaviside function corresponds to a learning modulation common to all active neurons. The Max operator in equations 2.2 through 2.4 expressed a competition between “Where” neurons belonging to the same neighborhood of inputs to PR-PH neurons. Thus, the optimally tuned “Where” neuron could get control of PR-PH neuron activation and learn the corresponding patternazimuth conjunction. In summary, the PR-PH network has two functions: to bind the “What” and “Where” information in order to create a landmark and to bridge the temporal gap between successive landmarks (working memory) in order to create a landmark constellation or view that is directly learned or recognized as a place by EC. 2.2.3 Entorhinal Cortex and Place Coding. In the second wave of integration and association—between sensory (visual) inputs and path integration— the emergence of place cell–like activity in EC is accounted for in the model by a summation (OR operator) that complements the AND operator of the PR-PH network to globally perform a sigma pi. The activity Xec j of an EC pyramidal neuron j coding for places is given by Xec j
= f Dj
pr ph−ec pr ph Wkl, j .Xkl
,
(2.5)
kl∈Nkl
where f D (x) represents an output function that performs a learningdependent tuning of EC neuron response such that the response, which is weak and mildly specific before learning, becomes larger for specific inputs after learning: f D j (x) = D j .r.e
(−Vig+0.01).(1−x/r )2 σ j .(D j −1.01)2
.
(2.6)
The three parameters (D, r , Vig) modulated height, width, and slope of the gaussian function: (1) D, a neuron tuning factor increased with learning; (2) Vig, a vigilance parameter was the inverse of the activity level resulting from the comparison between memorized and new input patterns; and (3) r , a scaling factor, allowed the output integration on EC to work at constant energy in spite of the fluctuations in input levels.
1352
J. Banquet, Ph. Gaussier, M. Quoy, A. Revel, and Y. Burnod
A local competition was implemented: Xec j =
Xec j 0
ec if Xec j = maxi:|i− j| 0}, . . . , E m = {(ωm , Hm ) : f (ωm , Hm |D) > 0}. For convenience in programming, the models are ordered in the number of hidden units in this article. The order of the models has no any effect on the convergence of the algorithm. If CMC is run for this problem with the choice ψ(·) = P(D|ω, H)P(ω|H), theorem 1 implies that gˆ (s,k) (E i ) will converge to the evidence of the model Hi (up to a multiplication factor) in probability as δs → 0 and K s → ∞, that is, gˆ (s,k) (E i ) → g(E i ) = c
Xi
P(D|ω, H)P(ω|H)dω,
for i = 1, . . . , m.
Thus, if g(E j ) > 0, gˆ (s,k) (E i )/gˆ (s,k) (E j ) will converge in probability to g(E i )/g(E j ), the ratio of the evidence of the models Hi and H j . Following is an implementation of CMC for evidence evaluation for Bayesian MLPs. Let Q = (q i j ) be a stochastic matrix, where q i j denotes the proposal probability of the transition from the model Hi to H j . In this article, Q is a triangular matrix with q i,i−1 = q i,i = q i,i+1 = 13 for 1 < i < m, q 1,1 = q m,m = 23 , and q 1,2 = q m,m−1 = 13 . Partition the sample space according to the index of models, initialize gˆ (0,0) (E 1 ) = · · · = gˆ (0,0) (E m ) = 1, and iterate between the following steps: 1. Draw a random number u from the uniform distribution Unif(0,1). 2. If u < q i,i−1 , try a “death” move, which is to delete a hidden unit from the current model, that is, try to jump from model Hi to model Hi−1 . If the move is accepted, set i ← i − 1. 3. If u < q i,i−1 + q i,i , try a “within-subregion” move, which is to update the parameters of the current model, that is, try to draw a new sample from the posterior distribution f (ωi |Hi , D) of model Hi . 4. Otherwise, try a “birth” move, which is to add a new hidden unit to the current model, that is, try to jump from model Hi to model Hi+1 . If the move is accepted, set i ← i + 1. The acceptance of the “birth,” “death,” and “within-subregion” moves are all guided by equation 2.2. For the within-subregion move, equation 2.6 is reduced to the conventional Metropolis-Hastings rule; the only difference is that the gˆ (s,k) ’s need to be updated here. For the sample space partition used in this article, the within-subregion move is also a within-model move, that is, trying to update ω (s,k) , the parameter vector of the current model to a new vector in the parameter space of the current model. In this article, ω (s,k) is updated in blocks as follows. The weights on the connections fed to the same hidden or output unit are grouped into one block.
Evidence Evaluation for Bayesian Neural Networks
1395
This grouping method usually works well for networks with a small or moderate number of input and hidden units. For networks with a large number of input and hidden units, the group may need to break down further. Suppose that the connection weights have been grouped into blocks. One block is then selected at random, and the weights in the selected block are updated with a spherical proposal distribution. A direction is first generated uniformly, and then the radius is drawn from N(0, ςm2 ), where ςm2 is calibrated such that the acceptance rate of the CMC moves is about 0.2 to 0.4. This proposal is symmetric in the sense that the transition probability ratio T(ω ∗ → ω (s,k) )/T(ω (s,k) → ω ∗ ) = 1, where ω ∗ denotes the proposed parameter vector. It is easy to see that the within-subregion move is just a Metropolis-Hastings move in the above implementation. We note that the within-subregion move can be any MCMC move performed in a fixed dimensional space, say, the Gibbs sampler (Geman & Geman, 1984) or hybrid Monte Carlo (Neal, 1996). In the birth move, a new hidden unit is proposed to add to the current model, and the weights on the new connections are drawn from N(µb , ςb2 ), where µb and ςb2 are the sample mean and sample variance of the weights of the current model, respectively. The resulting transition probability ratio is T(ω ∗ → ω (s,k) ) q i+1,i 1 1 , = T(ω (s,k) → ω ∗ ) q i,i+1 i + 1 ξj=1 φ((ω j − µb )/ςb ) where ξ is the number of newly added connections, ω1 , . . . , ωξ denote the weights drawn for the new connections, and φ(z) is the density function of the standard normal random variable z. In the death move, a hidden unit is selected at random, with the proposal to delete it from the model. Let µd and ςd2 denote the sample mean and sample variance of the connection weights, except for those on the connections fed to or exiting from the selected hidden unit of the current model. The resulting transition probability ratio is T(ω ∗ → ω (s,k) ) q i−1,i i = T(ω (s,k) → ω ∗ ) q i,i−1 1
ξ j=1
φ((ω j − µd )/ςd ) 1
,
where ω1 , . . . , ωξ denote the connection weights on the connections fed to or exiting from the selected hidden unit. In the above implementation of CMC, each of the candidate models is essentially trained by the Metropolis-Hastings algorithm, which is used in the within-subregion moves to draw samples from the posterior f (ωi |Hi , D). Hence, the performance of CMC will depend on the effectiveness of the Metropolis-Hastings algorithm for the problems under consideration. In the current implementation, CMC may not work well for a set of large
1396
F. Liang
MLPs due to the inability of the Metropolis-Hastings algorithm in sampling from the posterior distributions of large MLPs. This difficulty can be alleviated by the following techniques. The first is to replace the MetropolisHastings algorithm by an advanced MCMC algorithm that performs in a fixed dimensional space, say, hybrid Monte Carlo (Neal, 1996) or the Langevin algorithm (Grenander & Miller, 1994; Phillips & Smith, 1996). The second one is to integrate out as many parameters as possible for the posterior f (ω, H|D). The theoretical basis is Rao-Blackwell’s theorem (Casella & Berger, 2002; Liu, 2001), which suggests that an analytical integration is helpful for reducing the variance of Monte Carlo computation. In the context of Bayesian neural networks, this is also suggested experimentally by Denison, Holmes, Mallick, and Smith (2002). In this letter, for regression MLPs, we integrate out some of the parameters, including the variance of the random errors and all the weights on the connections from the hidden units to the output unit; for classification MLPs, we cannot do so. Hence, sampling from the posterior of a classification MLP usually means a more difficult task than sampling from the posterior of a regression MLP, provided that other conditions, say, the network structure and the data amount, are all similar. The third one is to repartition the sample space according to the index of models and the energy function jointly. This is to reduce the dependence of CMC on the algorithm used in the within-subregion moves. This technique will be discussed at the end of this article.
3 Evidence Evaluation for Regression MLPs We first consider the evidence evaluation for regression MLPs. Let D = {(xt1 , . . . , xt P ; yt ) : t = 1, . . . , n} denote the training data, where x1 , . . . , x P are explanatory variables and y is the response variable. Let the data be modeled by a one-hidden-layer MLP with P input units and H hidden units. The model can be written as yt = β0 +
H i=1
βi ϕ γi0 +
P
xtj γi j
+ t ,
t = 1, . . . , n,
(3.1)
j=1
where t ∼ N(0, σ 2 ), βi ’s and γi j ’s denote the connection weights from the hidden units to the output unit and from the input units to the hidden units, respectively. Here, the bias unit is treated as a special input unit with a constant input, say, 1. The ϕ(·) is the activation function. It is set to the hyperbolic tangent function in this section. The hyperbolic tangent function is equivalent to the sigmoid function in modeling, but it often leads to a faster convergence in training than the sigmoid function.
Evidence Evaluation for Bayesian Neural Networks
1397
To work with the evidence framework, we specify the following priors for the model. First, we assume that H ∼ Unif{H1 , . . . , Hm }, that is, P(Hi ) = 1/m for i = 1, . . . , m. Independent of the prior P(H), we assume
σ 2 ∼ Inv-Gamma(ν, η), βi |σ 2 ∼ N(0, τβ σ 2 ),
i = 0, . . . , K
γi j|σ ∼ N(0, τγ σ ),
i = 1, . . . , K ,
2
2
(3.2) j = 0, . . . , P,
where ν, η, τβ , and τγ are hyperparameters to be specified by the user. We note that a more sophisticated prior specification for the connection weights of MLPs is the automatic relevance determination (ARD) prior (MacKay, 1995; Neal, 1996), which performs a soft feature selection for MLPs. However, the prior, equation 3.2, has its own advantages. First, it leads to a closed expression for the marginal posterior of γ = {γi j : i = 1, . . . , K , j = 0, . . . , P} by integrating out β = (β0 , . . . , β K ) and σ 2 explicitly. Second, the prior, equation 3.2, is essentially noninformative if we restrict ν ≤ 1. If ν ≤ 1, it is easy to calculate that E(σ 2 ) = ∞, Var(σ 2 ) = ∞, and Var(βi ) = E(Var(βi |σ 2 )) + Var(E(βi |σ 2 )) = τβ E(σ 2 ) = ∞, and Var(γi j ) = ∞. A vague and proper prior is usually preferred in Bayesian model selection problems as the evidence of the models can be sensitive to the prior. (See Kass & Raftery, 1995, for more discussion on the choice of priors.) With the above prior specification, we have the following posterior distribution, f (β, γ , σ 2 , H|D) ∝
1 2πσ 2
n2
2
n H P 1 yt − β0 − × exp − 2 βi ϕ γi0 + xtj γi j 2σ t=1 i=1 j=1
H+1 (P+1)H H 2 2 1 1 1 2 exp − β × i 2 2 2 2πτβ σ 2τβ σ i=0 2π τγ σ
2 P H 1 ην e −η/σ 1 2 exp − γ . 2τγ σ 2 i=1 j=0 i j (ν) (σ 2 )ν+1 m
(3.3)
1398
F. Liang
Integrating out β and σ 2 and taking the logarithm, we have H+1 H(P + 1) log(2π ) + log(τβ ) 2 2 1 H(P + 1) log(τγ ) + log |B| 2 2 n H(P + 1) log + +ν 2 2
n H(P + 1) 1 + + ν log η + y y 2 2 2 H P 1 1 −1 2 γ , y ZB Z y + 2 2τγ i=1 j=0 i j
−log f (γ , H|D) = Const + + − +
−
(3.4)
where “Const” is the constant term independent of γ and H, y = (y1 , . . . , yn ) , Z = (z1 , . . . , zn ) is a matrix with the tth row z t = [1, ϕ(γ10 + P P 1 j=1 xtj γ1 j ), . . ., ϕ(γ H0 + j=1 xtj γ H j )], and B = τβ I H+1 + Z Z. The I H+1 denotes the identity matrix of order H + 1. We note that the posterior distribution is invariably a nonlinear and multimodal function. For example, the posterior is invariant with respect to arbitrary relabeling of hidden units and simultaneous changes of the sign of βi and γi j ’s with j = 0, . . . , P. Muller ¨ and Insua (1998) suggesting avoid the multimodality by imposing a relabeling constraint on the parameter space, say, 0 < γ10 < . . . < γ H0 . In this article, we impose no constraint on the parameter space. In theory, this is straightforward and complete for computing the model evidence. Here we mention one point: the model evidence is invariant with respect to the relabeling constraint. This can be seen from the equation 1.2, which depends on only the values of the integrand, which is invariant with respect to the relabeling constraint. 3.1 Example 1: Simulated Data. We simulated y1 , . . . , yn from equation 3.1 with P = 2, H = 2, γ1 = (γ10 , γ11 , γ12 ) = (−0.5, 1, 3), γ2 = (γ20 , γ21 , γ22 ) = (0.5, 2, 2), β = (β0 , β1 , β2 ) = (−1, 3, −1.5), n = 30, σ = 0.2, and xi, j ∼ Unif (−2, 2) for i = 1, . . . , n and j = 1, 2. For this example, we consider the models with one to four hidden units, which are denoted by H1 , . . . , H4 , respectively. In simulation, we set the hyperparameters ν = 1, η = √ 1, and τβ = τγ = 100 and set the CMC parameters ρ = 0, δ1 = 0.01, δs+1 = δs + 1 − 1, K 1 = 105 , and K s+1 = 1.2K s . The CPU time cost by a single run is about 43 seconds. Figure 3 shows the path of the model transition at the last stage of the simulation with δ = 1.56 × 10−4 , where one point was plotted every 100 iterations. CMC leads to a free
1399
1.0
1.5
2.0
Models 2.5 3.0
3.5
4.0
Evidence Evaluation for Bayesian Neural Networks
0
1000
2000 Iteration (X 100)
3000
Figure 3: Transition path of the models in a CMC run for example 1. The samples are collected every 100 iterations in the simulation. Table 3: Sensitivity Analysis for Hyperparameters for Example 1. Model Frequency (%) Setting (1,100) H1 H2 H3 H4
24.61 (0.26) 25.18 (0.13) 25.13 (0.08) 25.07 (0.15)
Evidence (1,10)
Evidence (1,100)
Evidence (1,1000)
Evidence (0.1,100)
0.15 (0.01) 70.31 (0.43) 24.82 (0.34) 4.72 (0.10)
0.08 (0.01) 84.99 (0.39) 13.77 (0.34) 1.16 (0.05)
4.14 (0.99) 81.03 (1.87) 13.90 (1.01) 0.92 (0.09)
0.00 (0.00) 82.11 (0.36) 16.14 (0.32) 1.75 (0.05)
Notes: The choices of hyperparameters are shown in the second row, where (a , b) represents ν = η = a and τβ = τγ = b. The Frequency (%) and Evidence columns show the relative sampling frequencies and the estimates of the evidence of the respective models. The numbers in parentheses are the standard deviations of the preceding estimates.
random walk in the model space. The model transition paths in the other stages are similar. m Later we repeated the run 10 times. In these runs, the constraint, i=1 gˆ (E i ) = 100, was imposed on gˆ (E i )’s to eliminate the unknown constant c of equation 2.3. Note that the same constraint was imposed on gˆ (E i )’s in all simulations of this article. The estimates of the evidence (up to a multiplication factor) and the relative sampling frequencies of the models were reported in the columns “evidence (1,100)” and “Frequency (1,100)” of Table 3, respectively. It shows that the true model H2 has been identified by CMC. To assess the sensitivity of the evidence to hyperparameters, we tried the choices ν = η = 1 and τβ = τγ = 10, ν = η = 1 and τβ = τγ = 100, and ν = η = 0.1 and τβ = τγ = 100. For each choice, CMC was run 10 times. The
1400
F. Liang
Table 4: Sensitivity Analysis for the Sample Size for Example 1. Sample Size Error Rate H1 H2 H3 H4
10 8
20 1
25 0
30 0
40 0
50 0
56.49 (7.78) 14.35 (7.60) 0.57 (0.39) 0.05 (0.05) 0.00 (0.00) 0.00 (0.00) 36.20 (6.18) 71.35 (6.19) 83.94 (1.34) 84.82 (0.85) 86.40 (0.74) 87.99 (0.42) 6.57 (1.45) 13.02 (1.52) 14.21 (1.12) 13.95 (0.75) 12.66 (0.66) 11.25 (0.38) 0.73 (0.18) 1.28 (0.20) 1.27 (0.20) 1.17 (0.11) 0.95 (0.08) 0.76 (0.04)
Notes: The second row shows the error rate (out of 10) that the true model could not be identified by CMC. The other rows show the estimates of the evidence of the respective models and the standard deviations (in parentheses) of the estimates.
computational results are also summarized in Table 3. Since CMC samples from each of the models equally, we omitted the relative sampling frequencies and report only the estimates of the evidence of the models. The results show that the evidence framework is rather robust to the variation of hyperparameters. The true model can be identified with all choices of hyperparameters shown in Table 3. Table 3 shows that the evidence framework tends to select a complex model as τβ and τγ decrease. This is consistent with our intuition: a more complex model is needed to compensate the limitation for the effect of each connection weight. Relatively, the evidence framework is more robust to the variation of ν and η. Next, we evaluated the sensitivity of the evidence framework to data amounts. We varied the sample size n: 10, 20, 25, 30, 40, and 50. For each value of n, 10 independent data sets were generated. For each of the data sets, CMC was run once with ν = η = 1 and τβ = τγ = 100. Table 4 shows the estimates of the evidence of the models, which are calculated by averaging over the 10 data sets. It shows that the evidence framework has a very good response to the data amount. Note that the total number of parameters of the true model is 10. With as few as 25 observations, the method has already been able to identify the true model for all data sets. As expected, the method tends to select a simpler model if the data information is not enough and tends to select the true model as the data information increases. For comparison, the RJMCMC and gaussian approximation methods were also applied to the data sets of n = 50. RJMCMC was implemented as CMC except that the birth, death, and Metropolis moves are guided by the Metropolis-Hastings rule instead of equation 2.2. For each of the 10 data sets, RJMCMC was run with 1.7e+6 iterations, which costs about the same CPU time as a run of CMC. These runs produce the following estimates for the evidence of the respective models: H1 : 0.00(0.00); H2 : 86.53(0.62); H3 : 12.51(0.53); and H4 : 0.96(0.09), where the numbers in the parentheses are the standard deviations of the preceding estimates. These estimates are quite consistent with those obtained by CMC, although the accuracy is slightly worse.
Evidence Evaluation for Bayesian Neural Networks
1401
In the gaussian approximation method, the evidence, equation 1.3, is calculated MLP model as follows (Bishop, K 1995). for an K βi ϕ(γi0 + Pj=1 xtj γi j )]2 ; E w = 12 [ i=0 βi2 + Let E D = 12 nt=1 [yt − β0 − i=1 K P 2 MP MP , γ ) be the minimizer of ζE D + λE w , that is, i=1 j=1 γi j ]; (β (β MP , γ MP ) = arg minβ,γ ζ E D + λE w ; B = ζ ∇∇ E D be the Hessian matrix of the training error function E D , d be the order of B, ν1 , . . . , νd be the eigenvalues of B, and A = B + λI. The log evidence of an MLP model with H hidden units is approximately 1 d log det(A) + log λ MP 2 2 n 1 2 + log ζ MP + log H! + 2 log H + log 2 2 ν 1 2 + log , (3.5) 2 n−ν
MP log f (D|H) = −λ MP E wMP − ζ MP E D −
MP where E wMP and E D are respective values of E w and E D evaluated at MP MP MP (β , γ ), and λ and ζ MP are respective choices of λ and ζ that maximize the log evidence. The estimates produced by the equation 3.5 for the evidence of the respective models are H1 : 0.00(0.00); H2 : 86.25(2.57); H3 : 10.23(1.50); and H4 : 3.52(1.98), where the numbers in the parentheses are the standard deviations of the preceding estimates. These estimates are not very accurate in comparison with those produced by CMC and RJMCMC. This is consistent with the findings of other authors (Bishop, 1995); the estimates produced by the gaussian approximation method are often not very accurate.
3.2 Example 2: Ischemic Heart Disease. The ischemic heart disease data (Kutner, Nachtsheim, & Neter, 2004) were collected by a health insurance plan and provide information concerning 788 subscribers who made claims resulting from coronary heart disease. The response variable is the natural logarithm of the total cost of services provided, and the predictors include the number of interventions or procedures carried out, the number of tracked drugs prescribed, the number of other diseases that the subscriber had during period, and the number of other complications that arose during heart disease treatment. Kutner et al. (2004) used the data set to illustrate MLPs as a nonlinear regression model. In this article, we use it to compare CMC, RJMCMC, and the gaussian approximation method for evidence evaluation for Bayesian MLPs. For this example, we consider the models H1 , . . . , H7 , which have one to seven hidden units, respectively. The first 600 observations were used for model building. CMC was run five times√with ν = η = 1, τβ = τγ = 100, K 1 = 60000, K s+1 = 1.2K s , δ1 = 0.01, δs+1 = δs + 1 − 1, and δend = 5 ×
1402
F. Liang
Table 5: Model Selection for Example 2. Models
CMC
RJMCMC
Gaussian Approximation
H1 H2 H3 H4 H5 H6 H7
0.0047 (0.0044) 92.5505 (0.7151) 7.1970 (0.6787) 0.2425 (0.0495) 0.0053 (0.0017) 0.0001 (0.0000) 0.0000 (0.0000)
0.0008 (0.0008) 93.2279 (1.6667) 6.5628 (1.5878) 0.2038 (0.0769) 0.0041 (0.0024) 0.0004 (0.0002) 0.0001 (0.0000)
0.0061 96.2062 3.5778 0.2042 0.0046 0.0006 0.0006
Notes: Hi denotes the MLP model with i hidden units. The numbers in columns 2 and 3 are, respectively, the estimates of the evidence and the standard deviations of the estimates.
Table 6: Training and Prediction Errors for Example 2. Models
Training Error
Prediction Error
H1 H2 H3 H4 H5 H6 H7
0.3975 (0.0008) 0.3771 (0.0032) 0.3650 (0.0011) 0.3609 (0.0013) 0.3595 (0.0015) 0.3581 (0.0017) 0.3570 (0.0006)
0.2794 (0.0014) 0.2744 (0.0033) 0.2890 (0.0032) 0.2904 (0.0022) 0.2866 (0.0049) 0.2778 (0.0038) 0.2800 (0.0023)
Note: The numbers in columns 2 and 3 are, respectively, the training/prediction errors and their standard deviations.
10−5 . The CPU time cost by each run is about 850 seconds. The computational results are summarized in Table 5. For comparison, RJMCMC and the gaussian approximation method were also applied to this example. RJMCMC was run five times independently. Each run consists of 2e+6 iterations and costs about the same CPU time as that of CMC. The gaussian approximation method evaluates the evidence of the models with equation 3.5. The results are shown in Table 5. Table 5 shows that the estimates produced by CMC, RJMCMC, and the gaussian approximation method are consistent: all identify model H2 as the most probable model. We also compared the generalization ability of models H1 , . . . , H7 . Each of the models was simulated with the Metropolis moves for 1.1e + 5 iterations, where the first 10,000 iterations were discarded for the burn-in process and the remaining iterations were used for prediction. Table 6 shows the training and prediction errors produced by the models in five runs. The model H2 produces the smallest prediction error. The larger evidence models tend to produce smaller prediction errors, although the
Evidence Evaluation for Bayesian Neural Networks
1403
Table 7: Model Selection for Example 3. Model
H1
H2
H3
H4
H5
Evidence
27.53 (1.91)
60.76 (1.59)
10.64 (0.30)
1.00 (0.03)
0.07 (0.00)
Notes
Best BIC
Best AICc
Notes: The Hi denotes the model with i hidden units. The estimates of the evidence and their standard deviations (in parentheses) are computed by averaging over 10 independent runs.
anticorrelation is weak. This example provides an experimental judgment for the importance of evidence evaluation for Bayesian MLPs. 3.3 Example 3: Airline Data. The data set is Series G of Box and Jenkins (1970), the monthly totals (in 100,000s) of international airline passengers from January 1949 through December 1960, for a total of 144 observations. Faraway and Chatfield (1998) analyzed the data using MLPs with various hidden units and various input patterns. The “best” model that they identified with the BIC and AICc criteria (Sugiura, 1978; Hurvich & Tsai, 1989) are shown in Table 7. For comparison, CMC was applied to this example. As in Faraway and Chatfield (1998), the first 11 years of observations (132 observations) were used for model building, and (yt−13 , yt−2 , yt−1 ) was used as the input pattern. In modeling time series data, the input pattern of a MLP can generally be chosen according to the partial autocorrelation graph of the data, as suggested by Chatfield (2001). CMC was run 10 times with ν = √ η = 1 and τβ = τγ = 100, K 1 = 2.4 × 105 , K s+1 = 1.1K s , δ1 = 10−4 , δs+1 = δs + 1 − 1, and δend = 2 × 10−5 . Each run costs about 115 seconds CPU time. The computational results in Table 7 coincide with those obtained with the information criteria (Faraway & Chatfield, 1998). The two largest evidence models are the best AICc model and the best BIC model, respectively. 4 Evidence Evaluation for Classification MLPs Let D = {(xt1 , . . . , xt P ; yt ) : t = 1, . . . , n} denote the training data, where y is the response variable that takes values in a finite set {0, 1, . . . , q − 1} only. Here, q ≥ 2 denotes the number of classes of the data. Let the data be modeled by a one-hidden-layer MLP with P input units, H hidden units, and Q output units, where Q = 1 if q = 2 and Q = q otherwise. If q = 2, the model can be written as yt =
1 with probability pt , 0 with probability 1 − pt ,
(4.1)
1404
F. Liang
where pt = ϕ
β0 +
H
βi ϕ γi0 +
P
i=1
xtj γi j
,
t = 1, . . . , n,
(4.2)
j=1
where β = (β0 , . . . , β H ) and γ = {γi j : i = 1, . . . , H, j = 0, . . . , P} denote the connection weights as for the regression MLPs, and ϕ(·) and ϕ (·) denote the activation functions for the hidden units and the output unit, respectively. In this section, ϕ(·) and ϕ (·) are both the sigmoid function for all examples. If q > 2, we have yt = l − 1 with the probability ptl for 1 ≤ l ≤ q , where exp(ztl ) , ptl = q j=1 exp(ztj )
(4.3)
and ztl = βl0 +
H
βli ϕ γi0 +
P
i=1
xtj γi j ,
j=1
where βli denotes the weight on the connection from the ith hidden unit to the lth output unit, and γi j denotes the weight on the connection from the jth input unit to the ith hidden unit as before. For the classification MLPs, we have the following mutually independent priors, H ∼ Unif{H1 , . . . , Hm }, βi ∼ N 0, σβ2 , i = 0, . . . , H, γi j ∼ N 0, σγ2 , i = 1, . . . , H,
(4.4) j = 0, . . . , P,
where σβ2 and σγ2 are hyperparameters to be specified by the user. With prior 4.4, we have the negative log posterior (up to an additive constant) for the case q = 2, −log f (β, γ , H|D) = Const +
1 + H(P + 2) H+1 log(2π ) + log σβ2 2 2
+
H H(P + 1) 1 β2 log σγ2 + 2 2 2σβ i=0 i
+
P H n 1 γi2j − yt log pt 2 2σγ i=1 j=1 t=1
−
n (1 − yt ) log(1 − pt ). t=1
(4.5)
2
Evidence Evaluation for Bayesian Neural Networks
1 Y 0 −1
1
0 1 1 11 1 1 0 0 1 0 00 1 1 0 0 0 0 1 1 1 1 1 0 1 00 0 1 1 0 1 0 0 00 01 00 1 1 0 11 0 0 1 0 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 0 1 1 0 11 1 1 1 0 1 1 0 0 00 1 0 0 1 0 0 0 1 1 1 1 1 0 1 −2 −1 0 1 2 X 1
−2
1405
Figure 4: One training data set of example 4, where 0 and 1 denote the respective observations of the two classes.
For the case q > 2, we have −log f (β, γ , H|D) = Const +
H+1 1 + H(P + 2) log(2π) + log σβ2 2 2
+
H H(P + 1) 1 log σγ2 + β2 2 2σβ2 i=0 i
+
q H P n 1 γi2j − ytl log ptl , 2 2σγ i=1 j=1 t=1 l=1
(4.6)
where ytl = 1 if yt = l − 1 and 0 otherwise. The following examples illustrate the use of CMC for classification MLPs. 4.1 Example 4: Simulated Data. We simulated y1 , . . . , yn from equations 4.1 and 4.2 with P = 2, H = 2, γ1 = (γ10 , γ11 , γ12 ) = ( 22.12, −17.59, 17.15 ), γ2 = ( γ20 , γ21 , γ22 ) = (6.00, 4.49, −3.75), β = (β0 , β1 , β2 ) = (18.38, −10.66, −10.50), n = 100, and xi j ∼ Unif(−2, 2) for i = 1, . . . , n and j = 1, 2. Figure 4 shows one data set generated with the above setting. To eliminate the effect of randomness of the data generation process, we generated 10 data sets independently. For each data √ set, CMC was run once with ν = η = 1, σβ2 = σγ2 = 100, δ1 = 0.1, δs+1 = δs + 1 − 1, δend = 10−5 , K 1 = 104 , and K s+1 = 1.5K s . Each run costs about 13.5 minutes of CPU time. Here we took more iterations in each run to get accurate estimates for the evidence of the models. In fact, with only 10 s CPU time (about 1% of the CPU time reported above), the correct order of evidence of the
1406
F. Liang
Table 8: Sensitivity Analysis for Hyperparameters for Example 4. Model Frequency (%) Setting (1,100) H1 H2 H3 H4
25.77 (1.05) 25.00 (0.31) 24.74 (0.38) 24.49 (0.44)
Evidence (1,20)
Evidence (1,100)
Evidence (1,500)
Evidence (0.1,100)
0.00 (0.00) 54.39 (1.20) 34.03 (0.60) 11.58 (0.63)
0.00 (0.00) 62.66 (2.05) 29.53 (1.33) 7.81 (0.75)
0.00 (0.00) 62.37 (3.08) 29.46 (2.24) 8.17 (1.31)
0.00 (0.00) 63.12 (2.36) 29.07 (1.48) 7.81 (0.92)
Notes: The choices of hyperparameters are shown in the second row, where (a , b) represents ν = η = a and τβ = τγ = b. The Frequency (%) and Evidence columns show the relative sampling frequencies and the estimates of the evidence of the respective models. The numbers in parentheses are the standard deviations of the preceding estimates.
models has been identified by CMC. The computational results in Table 8 show that the true model is identified by CMC. The equal relative sampling frequency of each model implies that CMC works work well for classification MLPs. To assess the sensitivity of the evidence framework to hyperparameters, we tried the choices ν = η = 1 and σβ2 = σγ2 = 20, ν = η = 1 and σβ2 = σγ2 = 500, and ν = η = 0.1 and σβ2 = σγ2 = 100. For each choice, CMC was run once for each of the data sets. The computational results are summarized in Table 8. The true model can be identified by CMC with each of the above parameter settings. From the data generation process, we know the setting σβ2 = σγ2 = 20 is inappropriate for this example. Even with this setting, the true model can still be identified by CMC. Of course, here the data contain enough information to correct the inappropriate prior. 4.2 Example 5: Ripley Data. This is a synthetic problem from Ripley (1994). It consists of two input features, two classes, and 250 training patterns. For this data set, we considered the models with two to six hidden units, which are denoted by H2 , . . . , H6 , respectively. In simulation, we set ν = 1 = η = 1, σβ2 = σγ2 = 100; K 1 = 104 , K i+1 = 1.5K i , δ1 = 0.1, √ δi+1 = δi + 1 − 1, and δend = 10−5 . The simulation was repeated 10 times independently. The computational results (see Table 9) show that the highest evidence model is H3 . This is different from the result of Penny and Roberts (1999), where H4 was the largest evidence model. We note that Penny and Roberts adopted the ARD prior for the connection weights and estimated the model evidence with the gaussian approximation method. We also note that the evidence values reported in Penny and Roberts have large variance, and the evidence of H3 and H3 is not significantly different. Later, CMC was run with σβ2 = σγ2 = 20. With this setting, the largest evidence model is identified as H4 . The results are also reported in Table 9.
Evidence Evaluation for Bayesian Neural Networks
1407
Table 9: Model Selection for the Ripley Data and the Irises Data. Model Evidence Data
H1
H2
H3
H4
H5
H6
Ripleya
— 6.32 (1.43) 40.74 (1.79) 34.66 (1.48) 14.44 (1.06) 3.84 (0.38) Ripleyb — 3.95 (0.88) 28.14 (0.98) 37.15 (0.63) 22.28 (0.66) 8.49 (0.33) Irises 0.02 (0.01) 61.41 (0.48) 29.11 (0.31) 7.92 (0.15) 1.54 (0.04) — Notes: The Hi denotes the model with i hidden units. The estimates of the evidence and their standard deviations (in parentheses) are computed by averaging over 10 independent runs. a Results with σ 2 = σ 2 = 100. γ β b Results with σ 2 = σ 2 = 20. γ β
4.3 Example 6. Irises Data. The irises data (Fisher, 1936) are classified into three categories: setosa, versicolor, and virginica. Each category has 50 samples. Each sample possesses four attributes: sepal length, sepal width, petal length, and petal width. A subset of 90 samples was chosen at random for training. For this data set, we considered the models with one to five hidden units. In simulation, we set ν = 1, η = 1, σβ2 = σγ2 = 100; K 1 = 5 × 104 , √ K s+1 = 1.5K s , δ1 = 0.01, δi+1 = δi + 1 − 1, and δend = 10−6 . The simulation was repeated 10 times independently. The computational results summarized in Table 9 show that the highest evidence model is H2 , of which the total number of connection weights (including biases) is 13. This coincides with that obtained with the method of structure learning with forgetting (Ishikawa, 2000).
5 Discussion In summary, we have provided a new method for evidence evaluation for Bayesian neural networks. The simulated examples show that the true model can be identified by the method with an appropriate prior setting and enough training data. We also assessed the sensitivity of the evidence framework to hyperparameters. The numerical results show that the evidence framework is rather robust to choices of hyperparameters. Since many authors, including MacKay (1992b), Thodberg (1995), and Penny and Roberts (1999), have reported the empirical correlation on the model evidence and generalization error, we skip this part in the article. Now it is widely believed that there is a weak (anti) correlation between evidence and generalization error (MacKay, 1992b). (For more discussion on the issue, refer to Bishop, 1996.) In simulation from equation 1.4, if there are too many models under consideration, we may evaluate them part by part. For example, we first work on the models H1 , . . . , Hm1 and then work on the models Hm1 , Hm1 +1 , . . . , Hm .
1408
F. Liang
The ratios of the evidence of all the models can be adjusted according to the overlap model Hm1 . This method is extremely useful when some models are significantly different from others in structures. In this case, we can group the models with similar structures together to accelerate the simulation process. Note that CMC can be used only to simultaneously estimate the evidence of the multiple models for which the parameter have reasonable overlaps, for example, a sequence of nested models. It cannot compare the evidence for a sequence of models that are completely different. In this article, we considered only the problem of selecting the number of hidden units for MLPs. Extensions to other structure selection problems are immediate, say, the selection of input patterns, the selection of noise models, or outlier detection. In this article, the model is trained (withing model moves) with the Metropolis-Hastings algorithm, and the model transition (between model moves) is guided by the CMC rule. In the following we present one extension of CMC, where both the within- and between-model moves are guided by the CMC rule. Suppose that the sample space X is partitioned according to the model and the energy function jointly, X =
vi m
Ei j ,
i=1 j=1
i where vj=1 E i j = Xi is a partition of Xi made according to the energy function. Without loss of generality, we assume that E i,1 , . . . , E i,vi are arranged in ascending order of energy, that is, for any w ∈ E i, j1 and w ∈ E i, j2 , if j1 < j2 then Ui (w) < Ui (w ), where Ui (·) denotes the energy function of the model Hi . As δs → 0 and K s → ∞, Theorem 1 implies that vi1 ˆ (E i1 , j ) j=1 g , RE i1 ,i2 = vi2 ˆ (E i2 , j ) j=1 g
(5.1)
forms a consistent estimator of the ratio of the evidence of the models Hi1 and Hi2 . The CMC algorithm can easily overcome high energy barriers of the energy landscape; the above algorithm has taken advantage of this advance in both network training and structure selection. It will be useful for hard or big networks, for which the conventional Metropolis-Hastings algorithm cannot sample efficiently from their posterior distributions. Acknowledgments I thank the editor, the associate editor, and two anonymous referees for their constructive comments and suggestions that led to significant improvement of this article.
Evidence Evaluation for Bayesian Neural Networks
1409
References Berg, B. A., & Neuhaus, T. (1991). Multicanonical algorithms for 1st order phasetransitions. Physics Letters B, 267, 249–253. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Box, G. E. P., & Jenkins, G. M. (1970). Time series analysis, forecast and control. San Francisco: Holden Day. Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury, MA: Thomson Learning. Chatfield, C. (2001). Time-series forecasting. London: Chapman & Hall. Denison, D., Holmes, C., Mallick, B., & Smith, A. F. M. (2002). Bayesian methods for nonlinear classification and regression. New York: Wiley. Faraway, J., & Chatfield, C. (1998). Time series forecasting with neural networks: A comparative study using the airline data. Appl. Statist., 47, 231–250. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problem. Annals of Eugenics, 7, 179–188. Gelman, A., & Meng, X. L. (1998). Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical Science, 13, 163– 185. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 57, 97–109. Grenander, U., & Miller, M. (1994). Representations of knowledge in complex systems (with discussion). J. Roy. Statist. Soc. B, 56, 549–603. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Hurvich, C. M., & Tsai, C. L. (1989). Regression and time series model selection in small samples. Biometrika, 76, 297–307. Ishikawa, M. (2000). Structural learning and rule discovery. In I. Cloeth & J. M. Zurada (Eds.), Knowledge-based neurocomputing (pp. 153–206). Cambridge, MA: MIT Press. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc., 90, 773– 795. Kutner, M. H., Nachtsheim, C. J., & Neter, J. (2004). Applied regression models (4th ed.). New York: McGraw-Hill. Liang, F. (2003). Contour Monte Carlo: Theoretical results, practical considerations, and applications (Tech. Rep.). College Station: Department of Statistics, Texas A&M University. Liang, F. (2004). Annealing contour Monte Carlo for structure optimization in an off-lattice protein model. Journal of Chemical Physics, 120, 6756–6763. Liu, J. S. (2001). Monte Carlo strategies in scientific computing. New York: Springer. MacKay, D. J. C. (1992a). The evidence framework applied to classification problems. Neural Computation, 4, 720–736. MacKay, D. J. C. (1992b). A practical Bayesian framework for back-propagation networks. Neural Computation, 4, 448–472.
1410
F. Liang
MacKay, D. J. C. (1995). Bayesian non-linear modeling for the 1993 energy prediction competition. In G. Heidbreder (Ed.), Maximum entropy and Bayesian methods, Santa Barbara 1993. Dordrecht: Kluwer. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091. Muller, ¨ P., & Insua, D. R. (1998). Issues in Bayesian analysis of neural network models. Neural Computation, 10, 749–770. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. No. CRG-TR-93-1). Toronto: Department of Computer Science, University of Toronto. Neal, R. M. (1996). Bayesian learning for neural networks. New York: Springer. Penny, W. D., & Roberts, S. J. (1999). Bayesian neural networks for classification: How useful is the evidence framework? Neural Networks, 12, 877–892. Perrone, M. P., & Cooper, L. N. (1993). When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone (Ed.), Artificial neural networks for speech and vision (pp. 126–142). London: Chapman & Hall. Phillips, D. B., & Smith, A. F. M. (1996). Bayesian model comparison via jump diffusions. In W. R. Gilks, S. T. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 215–240). London: Chapman & Hall. Ripley, B. D. (1994). Neural networks and related methods for classification. Journal of the Royal Statistical Society, B, 409–456. Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics—Theory and Methods, 7, 13–26. Thodberg, H. H. (1995). A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks, 7, 56–72. Walker, A. M. (1969). On the asymptotic behavior of posterior distributions. Journal of the Royal Statistical Society B, 31, 80–88. Wang, F., & Landau, D. P. (2001). Efficient, multiple-range random walk algorithm to calculate the density of states. Physical Review Letters, 86, 2050–2053.
Received January 7, 2004; accepted November 1, 2004.
LETTER
Communicated by Klaus Obermayer
An Ensemble of Cooperative Extended Kohonen Maps for Complex Robot Motion Tasks Kian Hsiang Low
[email protected] Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213-3890, U.S.A.
Wee Kheng Leow
[email protected] Department of Computer Science, National University of Singapore, Singapore 117543, Singapore
Marcelo H. Ang, Jr.
[email protected] Department of Mechanical Engineering, National University of Singapore, Singapore 119260, Singapore
Self-organizing feature maps such as extended Kohonen maps (EKMs) have been very successful at learning sensorimotor control for mobile robot tasks. This letter presents a new ensemble approach, cooperative EKMs with indirect mapping, to achieve complex robot motion. An indirect-mapping EKM self-organizes to map from the sensory input space to the motor control space indirectly via a control parameter space. Quantitative evaluation reveals that indirect mapping can provide finer, smoother, and more efficient motion control than does direct mapping by operating in a continuous, rather than discrete, motor control space. It is also shown to outperform basis function neural networks. Furthermore, training its control parameters with recursive least squares enables faster convergence and better performance compared to gradient descent. The cooperation and competition of multiple self-organized EKMs allow a nonholonomic mobile robot to negotiate unforeseen, concave, closely spaced, and dynamic obstacles. Qualitative and quantitative comparisons with neural network ensembles employing weighted sum reveal that our method can achieve more sophisticated motion tasks even though the weighted-sum ensemble approach also operates in continuous motor control space. 1 Introduction Goal-directed, collision-free motion in a complex, dynamic, and unpredictable environment is an important task for an autonomous mobile robot Neural Computation 17, 1411–1445 (2005)
© 2005 Massachusetts Institute of Technology
1412
K. Low, W. Leow, and M. Ang, Jr.
operating alone or in a team. In particular, this task is widely employed in service and field robotics (Shastri, 1999), which includes sewer inspection (Hertzberg, Christaller, Kirchner, Licht, & Rome, 1998), cleaning and housekeeping (Fiorini, Kawamura, & Prassler, 2000), surveillance (Rybski, Stoeter, Gini, Hougen, & Papanikolopoulos, 2002), sensor network coverage (Howard, Matari´c, & Sukhatme, 2002), search and rescue (Davids, 2002), and tour guides (Burgard et al., 1999). All these applications require the mobile robot to perform target- or goal-reaching movements while avoiding undesirable and potentially dangerous impact with obstacles or other robots on its team. The robot motion control problem can be stated succinctly as follows: Given an initial state described by the sensory input vector u(0) in the sensory input space U, determine a collision-free sequence of motor control vectors c(t), t = 0, . . . , T − 1, in the motor control space C that moves the robot toward a desired goal state described by u(T) ∈ U. Three general classes of algorithms have been investigated for learning sensorimotor control, which is required for this task: multivariate regression, reinforcement learning, and feature mapping. The first approach formulates the problem as a nonlinear multivariate regression problem and trains a multilayer perceptron (MLP) to perform continuous mapping from U to C (Pomerleau, 1991; Sharkey, 1998; Tani & Fukumura, 1994). It offers good generalization capability. However, prior to training the network, training samples have to be collected for every time step t = 0, . . . , T − 1 to define the quantitative error signals. This sample collection process can be very difficult and tedious, if not impossible, for a mobile robot. The reinforcement learning approach (Kaelbling, Littman, & Moore, 1996; Sutton, 1998) circumvents the above difficulty by providing a qualitative success or failure feedback only at the end of executing the motor control sequence. It estimates how well each previously executed motor control vector c(t) contributes to the overall success or failure of achieving the desired goal and modifies the algorithm accordingly. The training process tends to converge slowly due to sparse reinforcements and imprecise estimate of each motor control vector’s contribution. The third approach uses a self-organizing feature Map (SOFM) (Kohonen, 2000) such as the extended Kohonen map (EKM) (Ritter & Schulten, 1986) that self-organizes to partition the continuous input (or output) space into localized regions. The generalization capability of the feature map arises from its self-organization during training such that each neuron is trained to map a localized sensory region to a desired motor control output. As compared to predefined, uniform partitioning of the feature space (Kuperstein, 1991; Schaal & Atkeson, 1998; Zalama, Gaudiano, & Coronado, 1995), self-organization may lead to better performance and learning efficiency because more neural resources are automatically allocated to frequently encountered sensory regions during learning (Martinetz, Ritter, & Schulten, 1990; Santam´aria, Sutton, & Ram, 1998). This approach increases the resolution of the sensory representation in the frequently
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1413
encountered regions. Such a behavior is reminiscent of biological sensorimotor systems where frequently practiced movements become more fluid and accurate. This article describes a new feature map approach to learning sensorimotor control: cooperative EKMs with indirect mapping. An indirect-mapping EKM (Low, Leow, & Ang, 2002) differs from existing direct-mapping methods (Cameron, Grossberg, & Guenther, 1998; Heikkonen & Koikkalainen, 1997; Rao & Fuentes, 1998; Ritter & Schulten, 1986; Smith, 2002; Touzet, 1997; Versino & Gambardella, 1995) in two ways: 1. Direct-mapping methods map a sensory input directly to a motor control command. In contrast, our indirect-mapping approach maps a sensory input indirectly to a motor control command through control parameters. 2. As a consequence, the indirect-mapping approach maps continuous sensory input space to continuous motor control space (see section 3.1 for detailed discussion). On the other hand, direct-mapping methods map continuous sensory input space to discrete motor control commands. The motor control space is often discretized into a set of commands to be used by reinforcement learning algorithms (Mill´an, Posenato, & Dedieu, 2002; Santam´aria et al., 1998; Smith, 2002; Touzet, 1997), committee machines with voting schemes (Battiti & Colla, 1994; Hansen & Salamon, 1990; Kittler, Hatef, Duin, & Matas, 1998; Sharkey & Sharkey, 1997), and robot action selection mechanisms (Decugis & Ferber, 1998; Huntsberger & Rose, 1998; Maes, 1995; Rosenblatt, 1997). However, recent autonomous agent research in dynamical systems theory (Beer, 1995; Port & van Gelder, 1995) and reinforcement learning (Mill´an et al., 2002; Smart & Kaelbling, 2000) advocates operating in continuous motor control space, which enables our indirect-mapping method to provide finer, smoother, and more efficient motion control than does direct mapping (see section 4.1). Such a high degree of smoothness, flexibility, and precision in motion control is essential for efficiently executing complex tasks and interacting with humans. It is well understood how a SOFM or EKM is used for learning sensorimotor control (Littmann & Ritter, 1996; Walter & Schulten, 1993). However, the nontrivial problem of combining multiple SOFMs or EKMs for sophisticated control (e.g., negotiation of unforeseen complex obstacles and cooperative multirobot tracking of moving targets; Low, Leow, & Ang, 2003) is not well studied. If solved poorly, the control outputs produced while performing a complex motion task may be unexpected or undesirable. For example, a widely used ensemble technique for motion control that combines neural network outputs via weighted sum (e.g., ensemble averaging and mixture of experts; Hashem, 1997; Haykin, 1999; Jacobs, 1995) causes the robot to be trapped easily by unforeseen, complex obstacles. This navigation issue is
1414
K. Low, W. Leow, and M. Ang, Jr.
central to the robotics community as it is often encountered during robot motion in a real-world environment (Kim & Khosla, 1992; Koren & Borenstein, 1991; Rimon & Koditschek, 1992). Note that such a problem will arise even when SOFMs or EKMs are utilized in the weighted-sum ensemble (see section 4.2). To solve this robot motion problem, we propose a new ensemble approach called cooperative EKMs. The cooperation and competition of multiple EKMs (Low, Leow, & Ang, 2003) that self-organize in the same manner can enable a nonholonomic mobile robot to negotiate unforeseen, concave, and closely spaced obstacles (see section 4.2). In contrast, a robot controlled by weighted-sum ensemble (Low, Leow, & Ang, 2002) may fail in such tasks even though the networks also use continuous motor control space (see section 4.2). Before proceeding to the details of cooperative EKMs, we will discuss some related work. 2 Related Work In an MLP, all the training data are used to fit a single global model or representation. Therefore, during learning, all the network weights are susceptible to negative interference that may arise due to dynamically changing data distributions (Schaal & Atkeson, 1998). On the other hand, an EKM fits localized regions of data rather than the entire region of interest into local models or representations, thus localizing the effects of interference. Consequently, learning of a single new training datum affects fewer network weights in an EKM than in a MLP (Atkeson, Moore, & Schaal, 1997; Martinetz et al., 1990). The cost of training in an EKM is kept small by imposing a topology among the neurons such that each learning step involves a subset of neighboring neurons. Initially, the subsets are chosen large, resulting in rapid learning of the coarse sensorimotor mapping. As learning progresses, the size of the subsets is gradually reduced to refine the mapping more and more locally. This strategy allows computationally efficient and accurate training of many neurons and facilitates scaling up the EKM to a larger number of neurons for improved accuracy. An EKM also uses a smaller proportion of network weights for motor control prediction and has been reported to achieve more precise robot positioning than an MLP (Gorinevsky & Connolly, 1994; Jansen, van der Smagt, & Groen, 1995; van der Smagt, Groen, & van het Groenewoud, 1994). However, like other local model networks, an EKM suffers from the curse of dimensionality. That is, the proportion of training data lying within a fixed-radius neighborhood of a point decreases exponentially with an increasing number of dimensions of the input space. Basis function network (BFN) is another type of local model network like EKM. However, it is architecturally different from EKM in that each incoming sensory input is reduced to activation strengths by basis functions, which are linearly mapped to a corresponding control output. To do so, the output weights of all BFN neurons are required in predicting the
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1415
target-reaching motion. In contrast, the EKM uses only the winning neuron’s output weights to map each sensory input to a control output (see section 3.1). During network learning, BFN updates all its output weights with each training datum. On the other hand, only the output weights of the winning neuron and its neighbors in the EKM are updated (see section 3.5). As such, BFN may experience much more interference during online learning than an EKM would. Our indirect-mapping EKM resembles the EKM models of (Littmann and Ritter (1996) and Walter and Schulten (1993), which utilize locally linear mappings. In their models, each neuron stores both the motor control vector and the matrix of motor control parameters as output weights (see equation 3.1). On the other hand, each neuron in the indirect-mapping EKM stores only the matrix of motor control parameters. In the context of our article, their EKM models and indirect-mapping EKM, respectively, use 1800 and 1350 parameters in a network of 15 × 15 neurons (see Table 1). The extra parameters employed by their models are not necessary in achieving good target-reaching performance, as demonstrated by the indirectmapping EKM in this letter (see section 4.1). Furthermore, we have shown that training the control parameters with recursive least squares enables faster convergence and better performance compared to gradient descent. Their EKM models (Littmann & Ritter, 1996; Walter & Schulten, 1993) have used gradient descent only to learn the control parameters. It is typical for a robot to require a very large number of training data for accurate sensorimotor learning. The collection of these data for off-line training is a very difficult and tedious, if not impossible, task. Therefore, online learning is preferred over off-line to eliminate the need to store these data and avoid running complex batch training algorithms such as support vector machines (Schaal, Atkeson, & Vijayakumar, 2002). Hence, our focus in this article is on online learning. Walter and Ritter (1996) have proposed an ensemble of multiple SOFMs called hierarchical PSOM (parameterized self-organizing maps) for learning sensorimotor control of the robot arm under different system contexts. PSOM is a variant of SOFM that can learn with a small set of training data and still achieve good accuracy. To be able to do so, PSOM performs interpolation on a continuous mapping manifold using a small set of predefined, data-independent basis functions and weight vectors constructed directly from the data samples. To ensure good interpolation, the data samples have to be topologically ordered to form the weight vectors. Furthermore, the minimization of the distance function in SOFM algorithm (see equation 3.2) turns into a continuous search problem for PSOM due to its continuous manifold. For hierarchical PSOM, each PSOM is trained separately, resulting in different weight values for different PSOMs. In contrast, for our cooperative EKMs, all EKMs are trained simultaneously to obtain the same input weight values (see section 3.5). Moreover, training of our EKMs is performed online rather than off-line, as is the case for PSOM.
1416
K. Low, W. Leow, and M. Ang, Jr.
In the absence of precise quantitative error signals for training, reinforcement learning algorithms can be used if qualitative feedback signals are available. Nevertheless, they suffer from problems of generalization and continuity. Many reinforcement learning methods encode discrete sensory states and motor commands (Dietterich, 2000; Mahadevan & Connell, 1992; Rohanimanesh & Mahadevan, 2003), which cannot apply directly to the continuous sensorimotor domains of the real-world control tasks. A priori discretization of the continuous space may introduce hidden states and weak generalization, if done poorly. By combining with function approximators (e.g., MLP or feature map) that are capable of generalizing across continuous sensory input and motor control output spaces (Baird, 1995; Gross, Stephan, & Krabbes, 1998; Mill´an et al., 2002; Santam´aria et al., 1998; Smart & Kaelbling, 2000; Smith, 2002; Touzet, 1997), this limitation can be overcome. However, generalizing with function approximators does not guarantee that the algorithms will learn to produce continuous motor commands that vary smoothly and accurately in response to continuous changes in sensory state. In effect, some reinforcement learning algorithms (Mill´an et al., 2002; Santam´aria et al., 1998; Smith, 2002; Touzet, 1997) that are combined with function approximators map from continuous sensory input space to discrete motor control commands, which is exactly what the direct-mapping EKM does (see section 3.1). The drawbacks of such a continuity problem will be demonstrated in section 4.1. To resolve this problem, some methods (Baird, 1995; Gross et al., 1998; Smart & Kaelbling, 2000) map to continuous motor control space but are burdened by very slow iterative search for the optimal action.
3 Ensemble of Cooperative EKMs 3.1 Overview. An EKM is a neural network that extends Kohonen’s (2000) SOFM. Its self-organization of the input space is similar to Voronoi tessellation such that each tessellated region is encoded by the input weights of an EKM neuron. In addition to encoding a set of input weights that selforganize the sensory input space, the EKM neurons also produce outputs that vary with the incoming sensed inputs. The EKMs described in this article adopt an egocentric representation of the sensory input vector: u(t) = {α, d}T , where α and d are the direction and the distance of a target location relative to the robot’s current location and heading. At the goal state at time T, u(T) = (α, 0)T for any α. In many proposed EKMs (Cameron et al., 1998; Heikkonen & Koikkalainen, 1997; Rao & Fuentes, 1998; Ritter & Schulten, 1986; Smith, 2002; Touzet, 1997; Versino & Gambardella, 1995), each sensory input u is mapped directly to a motor control command c. In such a directmapping EKM (see Figure 1A), each neuron i has a sensory weight vector wi = (αi , di )T that encodes a tessellated region in U centered at wi . It also
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
ws u U
s
cs C
A
ws u U
Ms
s
M
1417
c C
B
Figure 1: EKM architectures. (A) Neurons of a direct-mapping EKM map the sensory input space U directly to discretized points in the motor control space C. (B) Neurons of an indirect-mapping EKM map the sensory input space U indirectly to the continuous motor control space C through the control parameter space M. It resembles the EKM model of Walter and Schulten (1993), which stores both motor control vectors and matrices of control parameters as output weights. The indirect-mapping EKM stores only matrices of control parameters as output weights.
has a weight vector ci that encodes the motor control outputs produced by the neuron. With an incoming sensory input u, the winning neuron s is determined such that its sensory weight vector ws is nearest to u (see equation 3.2). This winning neuron s outputs its motor control vector cs to move the robot (see Figure 1A). Note that any incoming sensory input u that lies within the tessellated region encoded by ws will produce the same motor control vector cs . If sensorimotor control is a linear problem, then the motor control vector c would be related to the sensory input vector u by the linear equation c = Mu,
(3.1)
where M is a matrix of motor control parameters. The control problem would be reduced to one of determining M from the training samples. In practice, however, sensorimotor coordination is typically a nonlinear problem because a real motor takes a finite but nonzero amount of time to accelerate or decelerate in order to change speed. This problem is exacerbated in nonholonomic robots. A nonholonomic robot has restrictions in the way it can move due to kinematic or dynamic constraints such as limited turning abilities or momentum at high velocities (e.g., a car) (Arkin, 1998). Hence, a nonholonomic robot is much harder to control and to achieve smooth motion than a holonomic robot (Russell & Norvig, 1995). To solve the nonlinear problem, our indirect-mapping EKM (Low, Leow, & Ang, 2002) is trained to partition the sensory input space U into locally linear regions. Each neuron i in the EKM has a sensory weight vector wi similar to that of a neuron in the direct-mapping EKM. However, unlike the directmapping approach, the output weights of neuron i represent control parameters Mi in the parameter space M (see Figure 1B) instead of the motor
1418
K. Low, W. Leow, and M. Ang, Jr.
time scale
target reaching module
slower
target localization EKM
target
a
obstacle avoidance module local obstacles
faster
. . .
obstacle localizatio n EKM
...
obstacle localization EKM
neural integration module
b1
. . .
bh
motor control
c
actuators
EKM
Figure 2: Framework of cooperative EKMs.
control vector ci . The control parameter matrix Mi is mapped to the actual motor control vector c by the linear model of equation 3.1. To elaborate, the direct-mapping approach maps all the sensory inputs u in a tessellated region in the sensory input space U, represented by a neuron s, to the same discrete point cs in the motor output space C, that is, c = cs . Thus, only a small number of points in C are represented by the neurons’ outputs (i.e., C is very sparsely sampled). In contrast, our indirectmapping approach maps each u in a local region in U to a different point c in C through equation 3.1. Since this mapping is linear and continuous, the indirect-mapping approach maps a region in U to a region in C (see Figure 1B). This method permits finer, smoother, and more efficient sensorimotor control of the robot’s target-reaching motion compared to the direct-mapping approach (see section 4.1). Cooperative EKMs (Low, Leow, & Ang, 2003) are implemented by connecting an ensemble of EKMs into three modules: target reaching, obstacle avoidance, and neural integration (see Figure 2). The target localization EKM in the target-reaching module is activated by the presence of a target within the robot’s target-sensing range. The EKM receives a sensed target location and outputs corresponding excitatory signals to the motor control EKM in the neural integration module at and around the locations of the sensed target. The obstacle localization EKMs in the obstacle avoidance module are activated by the presence of obstacles within the robot’s obstacle-sensing range. Each EKM receives a sensed obstacle location and outputs corresponding inhibitory signals to the motor control EKM in the neural integration module at and around the locations of the sensed obstacles. The motor control EKM in the neural integration module serves as the sensorimotor interface, which integrates the activity signals from the EKMs
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1419
for cooperation and competition to produce an appropriate motor signal to the actuators. This motor signal allows a robot to approach a target and negotiate obstacles. The cooperative EKM’s framework allows the modules to operate asynchronously at different rates, which is the key to preserving reactive capabilities. For example, the target-reaching module operates at about 256 ms between servo ticks while the obstacle avoidance module can typically operate faster at intervals of 128 ms. The neural integration module is activated as and when neural activities are received. 3.2 Target Reaching. The target-reaching module uses the target localization EKM to self-organize the sensory input space U. Each neuron i in the EKM has a sensory weight vector wi = (αi , di )T that encodes a region in U centered at wi . Based on each incoming sensory input u of the target location, the target localization EKM outputs excitatory signals to the motor control EKM in the neural integration module (see section 3.4). 3.2.1 Target Localization. The target localization EKM is activated as follows. Given a sensory input u of a target location: 1. Determine the winning neuron s in the target localization EKM. The winning neuron s is the one whose sensory weight vector ws = (αs , ds )T is nearest to the input u = (α, d)T : D(u, ws ) = min D(u, wi ). i∈A(α)
(3.2)
The difference D(u, wi ) is a weighted difference between u and wi , D(u, wi ) = βα (α − αi )2 + βd (d − di )2 ,
(3.3)
where βα and βd are constant parameters. The minimum in equation 3.2 is taken over the set A(α) of neurons encoding very similar angles as α: |α − αi | ≤ |α − α j |,
for each pair i ∈ A(α), j ∈ / A(α).
(3.4)
In other words, direction has priority over distance in the competition between EKM neurons. This method allows the robot to quickly orientate itself to face the target while moving toward it. An EKM contains a limited set of neurons, each of which has a sensory weight vector wi that encodes a point in the sensory input space U. The region in U that encloses all the sensory weight vectors of these neurons is called the local workspace U . Even if the target falls outside U , the nearest neuron can still be activated (see Figure 3A). 2. Compute output activity a i of neuron i in the target localization EKM: a i = G a (ws , wi ).
(3.5)
1420
K. Low, W. Leow, and M. Ang, Jr.
+
+
+
+
X
U’
A
X
X
U’
U’
B
X X
X
C
U’
D
Figure 3: Conceptual description of cooperative EKMs. (A) In response to the target ⊕, the nearest neuron (black dot) in the target localization EKM (ellipse) of the robot (gray circle) is activated. (B) The activated neuron produces a target field (dotted region) in the motor control EKM. (C) Three of the robot’s sensors detect obstacles and activate three neurons (crosses) in the obstacle localization EKMs, which produce the obstacle fields (dashed ellipses). (D) Subtraction of the obstacle fields from the target field results in the neuron at to become the winner in the motor control EKM, which moves the robot away from the obstacle.
The function G a is an elongated gaussian: (αs − αi )2 (ds − di )2 . − G a (ws , wi ) = exp − 2σa2α 2σa2d
(3.6)
Parameter σa d is much smaller than σa α , making the gaussian distance sensitive and angle insensitive. These parameter values elongate the gaussian along the direction perpendicular to the target direction αs (see Figure 3B). This elongated gaussian is the target field, which plays an important role in overcoming concave obstacles. The effects of these parameters on the robot’s target-reaching capabilities will be examined in section 4.2. The output activities of the neurons in the target localization EKM are aggregated in the motor control EKM to produce a motion that moves the robot toward the target. This will be explained in section 3.4. In the next section, we present the obstacle localization EKMs, which are activated in a similar manner as the target localization EKM. 3.3 Obstacle Avoidance. The obstacle avoidance module uses obstacle localization EKMs. The robot has h directed distance sensors around its body for detecting obstacles. Hence, each activated sensor encodes a fixed direction α j and a variable distance d j of the obstacle relative to the robot’s
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1421
heading and location. Each sensor’s input u j = (α j , d j )T induces an obstacle localization EKM. Note that each distance sensor (e.g., laser) can only reflect the nearest obstacle in its sensing direction. Hence, the number of obstacle localization EKMs that are activated does not depend on the number of obstacles but, rather, on the number of distance sensors. The obstacle localization EKMs have the same number of neurons and input weight values as the target localization EKMs; that is, each neuron i in the obstacle localization EKM has the same input weight vector wi as the neuron i in the target localization EKM. The EKM’s output inhibitory signals to the motor control EKM in the neural integration module (see section 3.4). 3.3.1 Obstacle Localization. The obstacle localization EKMs are activated as follows: For each sensory input u j , j = 1, . . . , h (i.e., h distance sensors): 1. Determine the winning neuron s in the jth obstacle localization EKM. The obstacle localization EKM is activated in the same manner as step 1 of target localization (see section 3.2). 2. Compute output activity b i of neuron i in the jth obstacle localization EKM: b i = G b (ws , wi ),
(3.7)
where
(αs − αi )2 (ds − di )2 − G b (ws , wi ) = exp − 2 2 2σbα 2σbd (ds , di ) 2.475 if di ≥ ds σbd (ds , di ) = 0.02475 otherwise.
(3.8)
The function G b is a gaussian stretched along the obstacle direction αs so that motor control EKM neurons beyond the obstacle locations are also inhibited to indicate inaccessibility (see Figure 3C). If no obstacle is detected, G b = 0. In the presence of an obstacle, the neurons in the obstacle localization EKMs at and near the obstacle locations will be activated to produce obstacle fields. The neurons nearest to the obstacle locations have the strongest activities. The effects of the parameters σbd and σbα on the robot’s obstacle avoidance capabilities will be investigated in section 4.2. 3.4 Neural Integration and Motor Control. The neural integration module uses a motor control EKM to integrate the activities from the neurons in the target and obstacle localization EKMs. The motor control EKM has the same number of neurons and input weight values as the target and robot localization EKMs.
1422
K. Low, W. Leow, and M. Ang, Jr.
3.4.1 Neural Integration. The neural integration is performed as follows: 1. Compute activity e i of neuron i in the motor control EKM, ei = ai −
h
b ji ,
(3.9)
j=1
where a i is the excitatory input from neuron i of the target localization EKM (see section 3.2) and b ji is the inhibitory input from neuron i of the jth obstacle localization EKM (see section 3.3). 2. Determine the winning neuron k in the motor control EKM. Neuron k is the one with the largest activity: e k = max e i . i
(3.10)
3.4.2 Motor Control. The motor control EKM also has a set of output weights, which encode the outputs produced by the neuron. However, unlike existing direct-mapping methods (Cameron et al., 1998; Heikkonen & Koikkalainen, 1997; Rao & Fuentes, 1998; Ritter & Schulten; 1986; Smith, 2002; Touzet, 1997; Versino & Gambardella, 1995), the output weights of neuron i of the motor control EKM represent control parameters Mi in the parameter space M instead of the actual motor control vector (see Figure 1). The control parameter matrix Mi is mapped to the actual motor control vector c by a linear model (see equation 3.11). With indirect-mapping EKM, motor control is performed as follows: Compute motor control vector c, c=
Mk u
if |Mk u| ≤ c∗ and k = s
Mk wk
otherwise,
(3.11)
where s is the winning neuron in the target localization EKM, and Mk and wk are, respectively, the control parameter matrix and sensory weight vector of the winning neuron k in the motor control EKM (step 2 of Neural Integration). The constant vector c∗ denotes the upper limit of physically realizable motor control signal. For instance, for the Khepera robots, c consists of the motor speeds vl and vr of the robot’s left and right wheels. In this case, we define c ≤ c∗ if vl ≤ vl∗ and vr ≤ vr∗ . Note that if c is beyond c∗ , simply saturating the wheel speeds does not work. For example, if the target is far away and not aligned with the robot’s heading, then saturating both wheel speeds only moves the robot forward. Without correcting the robot’s heading, the robot will not be able to reach the target. Hence, the winning neuron’s input weights wk are used to generate the physically realizable motor control output. This motor control would be the best substitution for the sensory input u because wk is closest to u compared to other weights wi , i = k.
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks d (m) 0.15
d (m) 0.15
d (m) 0.15
0.1
0.1
0.1
0.05
0.05
0.05
0
0 -4 -3 -2 -1 0 1 2 3 4
A
-4 -3 -2 -1 0 1 2 3 4
B
C
d (m) 0.15
d (m) 0.15
0.1
0.1
0.1
0.05
0.05
0.05
0
α(rad)
0 -4 -3 -2 -1 0 1 2 3 4
d (m) 0.15
0
1423
α(rad)
0
-4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4
D
E
F
Figure 4: Output activities of neurons in (A) target localization EKM, (B) obstacle localization EKM activated by distance sensor at −π/6 radian, (C) obstacle localization EKM activated by distance sensor at 0 radian, (D) obstacle localization EKM activated by distance sensor at π/6 radian, (E) obstacle localization EKMs combined, and (F) localization EKMs combined during neural integration. Each dot denotes the sensory weights wi = (αi , di )T of a neuron. A darker dot implies that the neuron has a stronger output activity. A lighter dot implies the opposite.
In activating the motor control EKM (see Figure 3D), the obstacle fields are subtracted from the target field (see equation 3.9). If the target lies within the obstacle fields, the activation of the motor control EKM neurons close to the target location will be suppressed. Consequently, another neuron at a location that is not inhibited by the obstacle fields becomes most highly activated (see Figure 3D). This neuron produces a control parameter that moves the robot away from the obstacle. While the robot moves around the obstacle, the target and obstacle localization EKMs are continuously updated with the current locations and directions of the target and obstacles. Their interactions with the motor control EKM produce fine, smooth, and accurate motion control of the robot to negotiate the obstacle and move toward the target until it reaches the goal state u(T) at time step T. Figure 4 shows the output activities of the neurons in different EKMs produced in response to the environment setup depicted in Figure 3. In Figure 4A, the output activities of the neurons in the target localization EKM form the target field (see Figure 3B). Since the neuron at d = 0.16 m and
1424
K. Low, W. Leow, and M. Ang, Jr.
α = 0.1 radian (darkest dot in Figure 4A) is closest to the target location, it is most strongly activated and thus produces the highest output activity. This neuron corresponds to the black dot in Figure 3B. Its neighboring neurons also produce relatively strong output activities to form the target field used in overcoming the concave obstacle. The obstacle localization EKMs shown in Figures 4B, 4C, and 4D are activated by distance sensors positioned at −π/6, 0, and π/6 radian, respectively. For each EKM induced by a sensor, the neuron that is closest to its sensed obstacle becomes most strongly activated. These activated neurons are at d = 0.132 m and α = −0.4 radian (darkest dot in Figure 4B), d = 0.165 m and α = 0 radian (darkest dot in Figure 4C), and d = 0.139 m and α = 0.4 radian (darkest dot in Figure 4D). They correspond to the three crosses in Figure 3C. Figure 4E shows the combined output activities of the neurons in the obstacle localization EKMs, which form the obstacle fields (see Figure 3C). Since the target lies within the obstacle fields, the strong excitatory activities from the target localization EKM neurons that are close to the target location will be suppressed. As a result, another neuron at d = 0.098 m and α = 0.9 radian (darkest dot in Figure 4F) that is not inhibited by the obstacle fields becomes most strongly activated in the motor control EKM. This neuron corresponds to in Figure 3D. It produces a control parameter that enables the robot to negotiate the concave obstacle. Recall that the various modules run asynchronously at different rates (see section 3.1). In particular, the obstacle avoidance module runs at a faster rate than the target-reaching module. During neural integration, the localization EKMs remain activated until they are updated asynchronously at the next sensing cycle. So, the motor control EKM can receive continuous inputs from the localization EKMs and is always able to produce a motor signal as and when new inputs are sensed.
3.5 Self-Organization of EKMs. In contrast to most existing off-line learning methods (Bruske & Sommer, 1995; Gorinevsky & Connolly, 1994; Karayiannis & Mi, 1997; Moody & Darken, 1989), online learning is adopted for the EKMs. Initially, the EKMs have not been trained, and the motor control vectors c generated are inaccurate. Nevertheless, the EKMs selforganize, using these control vectors c and the corresponding robot displacements v produced by c, to map v to c indirectly. Note that v is used as the training input rather than sensory input u. Since the untrained EKMs produce inaccurate motor control vectors c in response to u (i.e., c does not move the robot to the target location specified by u), the robot will learn the wrong sensorimotor mapping if u is used as the corresponding training input. On the other hand, v is the actual displacement that corresponds to c. Using v as the training input will enable the robot to learn the correct mapping as it moves around. Hence, its sensorimotor control becomes
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1425
more accurate. At this stage, the online learning just fine-tunes the indirect mapping. The self-organized learning algorithm (in an obstacle-free environment) is as follows: Self-Organized Learning Repeat 1. Get sensory input u. 2. Execute target-reaching procedure, and move robot. 3. Get new sensory input u and compute actual displacement v as a difference between u and u. 4. Use v as the training input to determine the winning neuron k (same as step 1 of Target Localization except that u is replaced by v). 5. Adjust the weights wi of neurons i in the neighborhood Nk of the winning neuron k toward v, wi = η G(k, i)(v − wi ),
(3.12)
where G(k, i) is a gaussian function of the distance between the positions of neurons k and i in the EKM and η is a constant learning rate. This step is similar to the self-organization of Kohonen’s selforganizing map. 6. Update the weights Mi of neurons i in the neighborhood Nk to minimize the error e: e=
1 G(k, i) c − Mi v 2 . 2
(3.13)
That is, apply a recursive stochastic approximation algorithm, which can be cast into this general form, Mi = −η
∂e Hi , ∂Mi
(3.14)
where Hi is a weighting matrix. If Hi = I, a first-order learning method, gradient descent, is obtained from equation 3.14: Mi = −η
∂e I = η G(k, i)(c − Mi v)vT . ∂Mi
(3.15)
In the case of the quadratic error function e (see equation 3.13), learning can be accelerated by a second-order learning method (Battiti, 1992). This can be achieved by setting Hi to Ri−1 where Ri is a Gauss-Newton approximation of the Hessian ∂ 2 e/∂Mi2 . A secondorder learning method, recursive least squares (Glentis, Berberidis,
1426
K. Low, W. Leow, and M. Ang, Jr.
& Theodoridis, 1999), is thus derived from equation 3.14 with η = 1 (optimum step size), Ri−1 =
1 (1 − λ)R−1 − i λ
Mi = −
Ri−1 vvT Ri−1 λ T −1 + v Ri v G(k, i)
∂e −1 R = G(k, i)(c − Mi v)vT Ri−1 , ∂Mi i
(3.16)
(3.17)
where λ is a constant forgetting rate and Ri−1 is initialized to I. Note that the recursive online update of Ri−1 (see equation 3.16) is obtained using matrix inversion lemma to avoid the costly matrix inversion operation (Haykin, 2002). Each update of Mi requires O(n2 ) computations and O(n2 ) additional memory to store Ri−1 where n is the number of dimensions in v. In contrast, gradient descent requires O(n) computations and no additional memory. The performance of these two learning methods is compared in section 4.1. The target and obstacle localization EKMs self-organize in the same manner as the motor control EKM except that step 6 is omitted. At each training cycle, the weights of the winning neuron k and its neighboring neurons i are modified. The amount of modification is proportional to the distance G(k, i) between the neurons in the EKM. The input weights wi are updated toward the actual displacement v, and the control parameters Mi are updated so that they map the displacement v to the corresponding motor control c. After self-organization has converged, the neurons will stabilize in a state such that v = wi and c = Mi v = Mi wi . For any winning neuron k, given that u = wk , the neuron will produce a motor control output c = Mk wk , which yields a desired displacement of v = wk . If u = wk but close to wk , the motor output c = Mk u produced by neuron k will still yield the correct displacement if linearity holds within the input region that activates neuron k. Thus, given enough neurons to produce an approximate linearization of the sensory input space U, indirect-mapping EKM can produce finer and smoother motion control than direct-mapping EKM, as shown in section 4.1.
4 Experiments and Discussion 4.1 Online Learning of Target-Reaching Motion. This section presents a quantitative evaluation of the indirect-mapping EKM in online sensorimotor learning of the robot’s target-reaching motion. For the purpose of evaluating performance, the following network architectures were compared:
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1427
1. B15: BFN with 15×15 neurons 2. D15: direct-mapping EKM with 15 × 15 neurons 3. G9: indirect-mapping EKM with 9 × 9 neurons trained by gradient descent 4. G12: indirect-mapping EKM with 12 × 12 neurons trained by gradient descent 5. G15: indirect-mapping EKM with 15 × 15 neurons trained by gradient descent 6. R15: indirect-mapping EKM with 15 × 15 neurons trained by recursive least squares Our implementation of BFN was similar to those proposed by Bruske and Sommer (1995), Hartman and Keeler (1991), Karayiannis and Mi (1997), and Moody and Darken (1989), except that it was trained online rather than offline. The basis function centers were trained in a similar manner as the input weights of indirect-mapping EKM (see equation 3.12). Each basis function width was updated to approach the Euclidean distance between itself and its nearest neighbor (Hartman & Keeler, 1991; Moody & Darken, 1989; Platt, 1991). The output weights were trained by gradient descent. We also attempted to train the basis function centers with gradient descent (Ghosh & Nag, 2001; Karayiannis, 1999; Platt, 1991; Poggio & Girosi, 1990; Wettschereck & Dietterich, 1992), but learning was unsuccessful despite extensive tuning of parameters. Although the robot learned to move toward the target locations successfully, it was not able to come to a stop at these locations, even after prolonged training. One possible explanation, as detailed by Moody and Darken (1989), is that gradient descent training of the basis function centers may lead to unpredictable target-reaching motions because the centers are sometimes squeezed out of the region of input space that contain data. Furthermore, learning converges slowly due to nonlinear optimization. In contrast, the self-organization of the input space in our implemented BFN is datacentric. More neurons are committed to input regions with dense sampling of data during online learning, which improves the resolution in these regions (see section 1). Faster convergence in learning has also been reported in this case (Moody & Darken, 1989). The tests were performed using Webots (http://www.cyberbotics.com), a 3D, kinematic, sensor-based simulator for Khepera mobile robots, which incorporates 10% white noise in its sensors and actuators. The simulator computes the trajectories and sensory inputs of a robot situated in an environment corresponding to a given physical setup. The resulting simulation allows the controller to be transferred to a real robot without changes (Michel, 2004). The simulated behaviors are very close to those of a real robot, as demonstrated in these works (Hayes, Martinoli, & Goodman,
1428
K. Low, W. Leow, and M. Ang, Jr.
2002; Ijspeert, Martinoli, Billard, & Gambardella, 2001; Martinoli, Ijspeert, & Mondada, 1999). In the experiments, the neural networks were trained in a 5 m by 5 m obstacle-free environment. Each training-testing trial took 100,000 time steps, and each time step for target-reaching motion lasted 1.024 sec. During training, the input weights were initialized to correspond to regularly spaced locations in the sensory input space U. The robot began its network training at the center of the environment, and a randomly selected sequence of targets was presented. The robot’s task was to move to the targets, one at a time, and weight modification was performed at each time step after the robot had made a move. At each time interval of 10,000 steps during training, a fixed testing procedure was conducted. In each test, the robot began at the center of the environment and was presented with 50 random target locations in sequence. The robot’s task was to move to each of the target locations. No training was performed during this testing phase. The training-testing trial was repeated five times and, testing performance was averaged over the five trials. Three testing performance indices are measured in the training-testing trials. The first index is the mean positioning error E, which measures the average distance εi between the center of the robot and the ith target location after it has come to a stop (i.e., motor control c = 0): E=
1 εi , RN i
(4.1)
where R is the number of trials and N is the number of testing target locations. The second index normalized time-to-target T measures how long it takes the robot to reach the target locations: T=
1 ˜ ˜ ti ti , ti = , RN i li
(4.2)
where ti is the time it takes the robot to reach the ith target, li is the straightline distance between targets i − 1 and i, and t˜i is the normalized time taken to reach target i. That is, normalized time to target measures the average amount of time the robot takes to travel a distance of 1 m toward a target. The third index, mean deviation from straight-line trajectory D, measures how straight or wavy the robot’s trajectory is, D=
1 |di − li | δ˜i , δ˜i = , RN i li
(4.3)
where di is the distance traveled to reach the target location i and δ˜i is the deviation from straight-line trajectory for target i.
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks T (m-1) 1000
E (mm) 100
1429
B15 D15 G9
10
G12
100
G15 R15
1
time steps
10
0
1
2
3
4
5
A
6
7
8
9 10
0
1
2
3
4
5
6
7
8
9 10 (x10,000)
B
Figure 5: Performance comparison between various network architectures in (A) mean positioning error and (B) normalized time to target.
Figures 5A and 5B show, respectively, how the mean positioning error and normalized time to target decreased during the self-organized learning of various network architectures. In Figure 5A, both B15 and D15 stabilized as early as 10,000 time steps but achieved much poorer E performance compared to the other networks. G9, G12, G15, and R15 stabilized more gradually at about 70,000, 60,000, 50,000, and 15,000 time steps, respectively, but they could all achieve lower E. Notice that the larger indirect-mapping EKMs stabilized faster. To explain this counter intuitive result, note that the standard deviations of the gaussian functions in equations 3.12 and 3.15 (see section 3.5) were the same for all EKMs. That is, the proportion of neurons requiring weight updates at each time step was greater in smaller EKMs than in larger EKMs. As such, the neurons in smaller EKMs updated their weights more frequently, thus stabilizing more slowly. While the E performance shows the quality of the robot positioning at the target location, the T and D performance demonstrate the quality of the robot’s trajectory. We will illustrate only T in Figure 5B; the convergence of D is similar. The self-organization of D15 stabilized at about 50,000 time steps. G9, G12, and G15 stabilized at about 90,000, 70,000, and 50,000 time steps, respectively, which supported the observation that larger EKMs stabilized more quickly. R15 stabilized at 40,000 time steps, which was faster than that trained by gradient descent. Therefore, its training-testing process was stopped at 50,000 time steps, which was sufficient for its self-organized learning to stabilize. Although B15 stabilized as early as 10,000 time steps, its poorer performance, relative to the other networks, became obvious with increasing training time. Table 1 shows the test results after training. All indirect-mapping EKMs achieved lower mean positioning errors, normalized time to target, and mean deviation from straight-line trajectory than D15 and B15. R15 achieved much lower E and D than G9, G12, and G15. Among the indirect-mapping EKMs trained by gradient descent, G12 enabled the robot
1430
K. Low, W. Leow, and M. Ang, Jr.
Table 1: Performance Comparison Between Networks After Training.
Network B15 D15 G9 G12 G15 R15
Performance Indices
Total Parameters
E (mm)
T (m−1 )
D
1350 900 486 864 1350 1350
10.02 ± 3.81 8.37 ± 2.25 3.40 ± 1.83 3.89 ± 1.61 3.23 ± 0.66 1.29 ± 0.14
166.40 ± 28.49 36.78 ± 11.11 15.31 ± 4.64 19.31 ± 8.47 18.96 ± 4.37 17.00 ± 2.48
0.11 ± 0.06 0.18 ± 0.15 0.10 ± 0.04 0.06 ± 0.02 0.07 ± 0.02 0.04 ± 0.01
to travel the straightest path to stop at the target location. Reducing the number of neurons to 9×9 caused the path to be more convoluted. Increasing the number of neurons to 15×15 increased, instead of decreased, D slightly. This phenomenon could be explained by how the neurons selforganized in the sensory input space U, which is elaborated in the next paragraph. Neurons in G15 were self-organized into four clusters: d = 0 m and α = −3, 0, +3 radian (see Figure 6C). Neurons in G9 and G12 were selforganized into two clusters only: d = 0 m and α = 0 radian (see Figures. 6A and 6B). With more neurons, G15 gained the flexibility of backward motion (α = −3, +3 radian). However, these two regions of input space were less well sampled by the neurons than the region at α = 0 radian. As such, if a distant target appeared behind the robot with G15, its backward motion would produce a wavier path. The robot with G12 would instead turn around to face the target via the cluster at d = 0 m before moving forward in a much straighter path. As for G9 (see Figure 6A), since its neurons sampled the input space at α = 0 radian more sparsely than those in G15 at α = 0, +3, −3 radian (i.e., both forward and backward motion), it would inevitably produce a more convoluted path than G15 regardless of whether the target is in front or behind. Table 1 also shows that smaller mean deviation did not necessarily imply shorter normalized time to target. During learning, direction had priority over distance in the competition between EKM neurons (see equation 3.4). So a larger EKM had more neurons allocated for adjusting orientation without moving long distances (i.e., d = 0 m cluster in Figure 6). Consequently, the robot might move short motion steps to adjust its orientation first before moving straight to the target. Therefore, its trajectory deviated less from the straight-line path. The advantages of indirect-mapping EKMs over D15 and B15 can also be assessed from the self-organization results (see Figure 6). The neurons in the indirect-mapping EKMs cover larger areas in the sensory input space than those in D15 or B15. Moreover, they sample distances up to 0.16 m, whereas D15 and B15 neurons sample distances only up to 0.12 m. Note
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks d (m) 0.15
d (m) 0.15
d (m) 0.15
0.1
0.1
0.1
0.05
0.05
0.05
0
α(rad)
0
0
-4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4
A
B
C
d (m) 0.15
d (m) 0.15
d (m) 0.15
0.1
0.1
0.1
0.05
0.05
0.05
0
0
1431
α(rad)
0
-4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4
-4 -3 -2 -1 0 1 2 3 4
D
E
F
Figure 6: Self-organization results of (A) G9, (B) G12, (C) G15, (D) R15, (E) D15, and (F) B15 taken after one of the training trials. Each dot denotes the weights wi = (αi , di )T of a neuron.
that 0.16 m is the farthest that a Khepera robot can move in a single time step of 1 second. That is, indirect-mapping EKMs sample the sensory input space more completely than do D15 and B15 and thus produce finer, smoother, and more efficient motor control. To determine whether there is statistically significant difference between the test results of different networks, t-tests were performed. In Table 2, a large value indicates that the test results between two networks are similar, that is, not significantly different (Mendenhall & Sincich, 1994). The E and T of indirect-mapping EKMs are significantly different from those of D15 and B15 because the t-test values are less than 0.1. However, the differences in E and T of G9, G12, and G15 are not significant. This means that G9 is sufficient for the robot to stop very close to the targets at a rate that is as fast as G12 or G15. While the difference in E and D between R15 and indirectmapping EKMs trained by gradient descent is significant, the difference in T is not. The low D of G15 is not significantly different from that of B15. The difference in D of G9 from D15 and B15 is also not significant. This means that G9 achieves similar D performance as D15 and B15 even though it uses fewer network weights. Often a robot is required to move through several checkpoints in a complex environment before stopping at the goal. Given that the radius of the Khepera robot is 30 mm, it is reasonable to regard the robot to have reached
1432
K. Low, W. Leow, and M. Ang, Jr.
Table 2: Significance Levels from t-Tests on Similarity in Performance Between Networks After Training. B15
D15
G15
G12
G9
E (mm) R15 G9 G12 G15 D15
0.00 0.00 0.01 0.00 0.22
0.00 0.00 0.00 0.00
0.00 0.42 0.21
0.00 0.33
0.02
T (m−1 ) R15 G9 G12 G15 D15
0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.01 0.01
0.20 0.12 0.47
0.29 0.19
0.25
0.01 0.43 0.04 0.12 0.19
0.04 0.16 0.06 0.09
0.01 0.09 0.09
0.04 0.02
0.00
D R15 G9 G12 G15 D15
(and touched) a target checkpoint if the distance-to-target ε is less than 30 mm. Figure 7 illustrates the performance comparison that evaluates this target-reaching criterion after the robot has been trained. The target-reaching probability P(ε) measures the probability of the robot’s reaching closer than a distance of ε (with or without stopping) from the target locations. The normalized time-to-target T(ε) measures how long it takes the robot to reach closer than a distance of ε (with or without stopping) from the target locations. The mean deviation from straight-line trajectory D(ε) measures how straight or wavy the robot’s trajectory is. Test results show that with indirect-mapping EKMs, the robot could get much closer to the targets with higher probability (see Figure 7A) and reach the targets much faster (see Figure 7B) than with D15 or B15. Moreover, it could travel in straighter paths (see Figure 7C) than with D15. Table 3 shows the test results for ε = 5 mm. R15 outperformed G12 and G15 in P(5) and G9 and G15 in D(5). Among the indirect-mapping EKMs trained by gradient descent, G9 enabled a robot to reach closer than 5 mm from target locations with higher probability than G15. This could be because there were more neurons in G9 than in G15 at very small, nonzero d and |α| < 1.57 radian in the input space. This set of neurons was responsible for moving the robot, at less than 10 mm away from the target location, forward to closer than 5 mm. As a result, a higher P(5) could be achieved. It was also observed that R15, which achieved the highest P(5), had more
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1433
T (ε)(m-1) 1000
P (ε) 1 0.9
100
0.8 0.7
10
0.6 ε (mm)
0.5 0
5
10
15
20
ε (mm)
1
25
0
5
10
15
20
25
B
A D ( ε) 1
B15 D15 G9
0.1
G12 G15 R15 ε (mm)
0.01 0
5
10
15
20
25
C Figure 7: Performance comparison between various network architectures in (A) target-reaching probability, (B) normalized time to target, and (C) mean deviation from straight-line trajectory after training. Table 3: Performance Comparison Between Networks.
Network B15 D15 G9 G12 G15 R15
Performance Indices
Total Parameters
P(5)
T(5) (m−1 )
D(5)
1350 900 486 864 1350 1350
0.84 ± 0.05 0.54 ± 0.07 0.95 ± 0.08 0.92 ± 0.09 0.86 ± 0.04 1.00 ± 0.01
144.70 ± 20.16 18.61 ± 1.66 8.83 ± 0.96 8.48 ± 0.74 8.96 ± 0.47 8.85 ± 0.35
0.04 ± 0.03 0.18 ± 0.08 0.07 ± 0.02 0.03 ± 0.02 0.06 ± 0.01 0.03 ± 0.01
neurons than G9 in this region of the input space. G15 had many more neurons at approximately zero d than at very small, nonzero d. G12 achieved lower D(5) than G9 and G15. This outcome could be justified, in a similar manner, by the explanation provided for the previous test results on D. By comparing the differences in normalized time-to-target and mean deviation between Tables 1 and 3, we could notice a greater amount of time and distance required for the robot to come to a stop.
1434
K. Low, W. Leow, and M. Ang, Jr.
Table 4: Significance Levels from t-Tests on Similarity in Performance at ε = 5 mm Between Networks. B15
D15
G15
G12
G9
P(5) R15 G9 G12 G15 D15
0.00 0.02 0.06 0.26 0.00
0.00 0.00 0.00 0.00
0.00 0.03 0.11
0.04 0.29
0.12
D(5) R15 G9 G12 G15 D15
0.12 0.09 0.31 0.20 0.00
0.00 0.01 0.00 0.01
0.00 0.14 0.03
0.19 0.02
0.00
T(5) (m−1 ) R15 G9 G12 G15 D15
0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00
0.35 0.40 0.13
0.17 0.27
0.48
One other interesting comparison is the total number of parameters or weights utilized by the various networks (see Tables 1 and 3). G12 uses fewer weights than both D15 and B15 but still performs better comparatively. Table 4 shows the t-test values at ε = 5 mm. While the difference in P(5) and D(5) between R15 and indirect-mapping EKMs trained by gradient descent is significant, the difference in T(5) is not. Among the indirectmapping EKMs trained by gradient descent, the t-tests for P(5) show no significant difference between G9 and G12 and between G12 and G15. The differences in T(5) between G9, G12, and G15 are also not significant. To summarize, R15 achieved the best overall performance among the various networks, in particular, its performance in E, D, P(5), and D(5). Its T and T(5) performance were not significantly different from those of the other indirect-mapping EKMs. Among the indirect-mapping EKMs, it stabilized most quickly. Although B15 stabilized much faster than the other networks, it produced the poorest performance in E, T, and T(5). D15 offered the poorest performance in D, D(5), and P(5). 4.2 Neural Network Ensemble for Target-Reaching Motion with Obstacle Avoidance. This section evaluates qualitatively and quantitatively the performance of cooperative EKMs in goal-directed, collision-free robot motion in complex, unpredictable environments. The experiments
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks 0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
1435
Width
-0.4 -0.4
-0.2
0
A
0.2
0.4
-0.4 -0.4
Depth
-0.2
0
0.2
0.4
B
Figure 8: Negotiating unforeseen concave obstacle that was 34 cm wide and 12 cm deep. (A) The robot using command fusion was trapped, but (B) the one adopting cooperative EKMs successfully moved around the obstacle.
were also performed using Webots. Twelve directed long-range sensors were modeled around its body of radius 3 cm. Each sensor had a range of 17 cm, enabling the detection of obstacles at 20 cm or nearer from the robot’s center and a resolution of 0.5 cm to simulate noise. Two tests were performed to compare cooperative EKMs with another ensemble method (Low, Leow, & Ang, 2002). The latter approach, termed command fusion, linearly combines the motion control outputs, using weighted sum, of different neural networks implementing different behaviors. This is a widely used technique to integrate the motion control outputs produced by different neural networks (Hashem, 1997; Haykin, 1999; Jacobs, 1995). For our case, the target-reaching motion is produced by an indirect-mapping EKM while obstacle avoidance is performed using the method of Braitenberg’s type-3C vehicle (Braitenberg, 1984). To elaborate, when the robot senses the presence of an obstacle, say, in front and on the left, the right motor will rotate backward faster than the left motor’s rotation forward, thus turning the robot away from the obstacle. For both ensemble methods, the target-reaching and obstacle avoidance modules ran at intervals of 256 ms and 128 ms, respectively. The robot’s performance was assessed in an environment under two unforeseen conditions: (1) concave obstacle and (2) narrow doorway between closely spaced obstacles. In the first test (see Figure 8), the robot fitted with command fusion got trapped by the concave obstacle (see Figure 8A). The target-reaching behavior tried to move the robot forward to reach the target while the obstacle avoidance behavior moved it backward to avoid the obstacle. The combined output cancelled each other, causing the robot to be trapped by the obstacle. The robot with cooperative EKMs could overcome the obstacle to reach the goal successfully (see Figure 8B).
1436
K. Low, W. Leow, and M. Ang, Jr.
700 600 500 400 300 200 100 0 8
700 600 500 400 Width (cm) 300 200 100 0 27 22
12
Depth (cm)
17 Sensing Range (cm)
16 20 12
Figure 9: Maximum width of concave obstacle (see Figure 8B) that a robot with cooperative EKMs can overcome with different combinations of obstacle depths and robot-sensing ranges.
It is noted that a robot with cooperative EKMs can still get trapped if the obstacle is so concave that the obstacle fields cannot completely inhibit the neurons at or near the target location. Figure 9 shows the maximum obstacle width that a robot with cooperative EKMs can overcome with varying obstacle depths and robot-sensing ranges. Given a fixed sensing range, the maximum negotiable obstacle depth decreases with increasing width. When the sensing range increases, the robot with cooperative EKMs can negotiate an extremely wide concave obstacle if it is not too deep. Conversely, to be able to overcome a fairly deep obstacle, its width cannot be too large. This limitation, however, does not diminish the significance of our method as it is simpler than many existing reactive robot motion methods for overcoming unforeseen concave obstacles (Lagoudakis & Maida 1999; Liu, Ang, Krishnan, & Lim, 2000; Zelek & Levine, 1996). In particular, it utilizes only local information of the target location and the unforeseen obstacles, as opposed to motion planners (Latombe, 1999) that require global knowledge of the environment to operate. In the second test (see Figure 10), the robot endowed with command fusion could not pass through the narrow doorway between closely spaced obstacles (see Figure 10A) because its obstacle avoidance behavior counteracted the target-reaching behavior. In contrast, the robot with cooperative EKMs could always traverse through the narrow doorway to the goal successfully (see Figure 10B). These two simple tests show that for command fusion, though each neural network proposes an action that is optimal by itself, the weighted sum of these action commands produces a combined action that may not satisfy the overall task. Cooperative EKMs, however, consider the activity signals
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks 0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
-0.4 -0.4
-0.2
0
A
0.2
0.4
-0.4 -0.4
-0.2
0
0.2
1437
0.4
B
Figure 10: Passing through an unforeseen narrow doorway between closely spaced obstacles that was 86 mm wide. (A) The robot using command fusion was trapped, but (B) the one adopting cooperative EKMs successfully passed through the narrow doorway to the goal.
of each localization EKM and integrate them to determine an action that can satisfy each localization EKM to a certain degree. Such tightly coupled interaction between the localization EKMs and the motor control EKM in the cooperative EKMs framework enables the robot to achieve more complex tasks. Recall that the standard deviations σ of the gaussian functions for the target and obstacle fields play an important role in the robot motion capabilities of cooperative EKMs (see sections 3.2 and 3.3). For the target field (see Figure 3A), σa α and σa d control the elongation of the target field perpendicular to and along the target direction, respectively. Parameters σbα and σbd achieve a similar effect for the obstacle field. For the negotiation of concave obstacles (e.g., see Figure 8B), the target field has to be considerably elongated perpendicular to the target direction. This requires a large enough σa α parameter value. However, as this value increases, the tendency of the robot moving along the shorter path to the goal via the narrow doorway (see Figure 10B) decreases and the longer detour featured in Figure 8B is increasingly preferred in the situation of Figure 10B. When σa α is large, subtraction of the obstacle field from the highly elongated target field (see Figure 3D) in the motor control EKM, rather than the neuron at the narrow doorway, may result in the neuron at the edge of the concave obstacle to be more highly activated. Nonetheless, the above two tests and the subsequent ones can be achieved by a single σa α value of 2.475. If σa d is too small, the target field may be totally suppressed by the obstacle field depending on σa α . This may or may not cause the robot to be trapped in the concave obstacle since any neuron not inhibited by the obstacle field can be potentially activated. If σa d is too large, the robot may get trapped in the concave obstacle. Superposition of the fields may cause the neuron in the cavity of
1438
K. Low, W. Leow, and M. Ang, Jr. 0.25
0.25
0
0
-0.25 -0.5
-0.25
0
0.25
0.5
-0.25 -0.5
0.25
0.25
0
0
-0.25 -0.5
-0.25
0
0.25
0.5
-0.25 -0.5
-0.25
0
0.25
0.5
-0.25
0
0.25
0.5
Figure 11: Motion of robot (gray) in an environment with two unforeseen obstacles (black) moving in anticlockwise circular paths. The robot could successfully negotiate past the extended walls and the dynamic obstacles to reach the goal (small black dot).
the concave obstacle, rather than the neuron at the edge of the obstacle, to be more highly activated. In all the tests, σa d is set to 0.0495. The parameter values of σbα and σbd have to be large enough for the robot to avoid collision with obstacles as well as discriminate whether a doorway is wide enough to pass through. However, if these values are too large, the robot cannot move to target locations near the obstacles or detect the presence of narrow but traversable doorways. In all the tests, σbα is set to 0.495, while σbd uses the values given in equation 3.8. The above evaluation of the target and obstacle field parameters highlights their significance to the robot motion capabilities of cooperative EKMs. In our future work, we will consider using reinforcement learning to train the appropriate parameter values for negotiating different obstacles when the robot encounters them during motion. The next two tests aim to demonstrate the capabilities of cooperative EKMs in performing more complex motion tasks. The environment for the first test consisted of three rooms connected by two doorways (see Figure 11). The middle room contained two obstacles moving in anticlockwise circular paths. The robot began in the left-most room and was tasked to move to the right-most room. Test results show that the robot was able to negotiate past the extended walls and the dynamic obstacles to reach the goal. Note that this target-reaching motion was completely determined by the cooperation and competition between the EKMs, and no global planning was used. The environment for the second test consisted of three rooms connected by two doorways and some unforeseen static obstacles (see Figure 12). The robot began in the top corner of the left-most room and was tasked to move
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1439
0.25
0
-0.25 -0.5
-0.25
0
0.25
0.5
Figure 12: Motion of robot (dark gray) in a complex environment. The checkpoints (small black dots) were located at the doorways and the goal position. The robot could successfully navigate through the checkpoints to the goal by traversing between unforeseen narrowly spaced convex obstacles (light gray) in the first and the last room and overcoming an unforeseen concave obstacle (light gray) in the middle room.
into the narrow corner of the right-most room via checkpoints plotted by a planner (Low, Leow, & Ang, 2002). The robot was able to move through the checkpoints to the goal by traversing between narrowly spaced convex obstacles in the first and the last room, and overcoming an unforeseen concave obstacle in the middle room. The results of these last two tests further confirm the effectiveness of cooperative EKMs in handling complex tasks in complex, unpredictable environments. 5 Conclusion This article presents a new approach of learning sensorimotor control for complex robot motion tasks using cooperative EKMs. Quantitative evaluation reveals that indirect-mapping EKM can produce finer, smoother, and more efficient robot motion control than other local learning methods such as direct-mapping EKM and BFN. Furthermore, training the control parameters of the indirect-mapping EKM with recursive least squares allows faster convergence and better performance than with gradient descent. The cooperation and competition of multiple EKMs enable the nonholonomic mobile robot to negotiate unforeseen concave, closely spaced, and dynamic obstacles. These tasks can easily trap robots that are controlled by neural network ensembles employing command fusion techniques. Cooperative EKMs can thus augment the reactive capabilities of an autonomous mobile robot significantly. Recently, we have enhanced cooperative EKMs further to achieve multirobot motion tasks such that multiple robots fitted with cooperative EKMs can coordinate their tracking of moving targets (see Figure 13). Qualitative and quantitative test results of the improved cooperative EKMs for multirobot tasks are presented in Low, Leow, and Ang
1440
K. Low, W. Leow, and M. Ang, Jr.
0.35 0.25 0.15 0.05 -0.05 -0.15 -0.35 -0.175
0
0.175
0.35
0
0.175
0.35
0
0.175
0.35
0.35 0.25 0.15 0.05 -0.05 -0.15 -0.35 -0.175 0.35 0.25 0.15 0.05 -0.05 -0.15 -0.35
-0.175
Figure 13: Cooperative tracking of moving targets. When the targets were moving out of the robots’ sensory range, the two robots moved in opposite directions to track the targets. In this way, all targets could still be observed by the robots.
(2003) and Low, Leow, and Ang (2004). Our continuing research goal is to generalize this approach to other sensorimotor control problems such as those of static and mobile robot manipulators.
References Arkin, R. C. (1998). Behavior-based robotics. Cambridge, MA: MIT Press. Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1–5), 11–73.
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1441
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proc. 12th International Conference on Machine Learning (ICML-95) (pp. 30–37). San Francisco: Morgan Kaufmann. Battiti, R. (1992). First- and second-order methods for learning: Between steepest descent and Newton’s method. Neural Comput., 4(2), 141–166. Battiti, R., & Colla, A. (1994). Democracy in neural nets: Voting schemes for classification. Neural Networks, 7(4), 691–707. Beer, R. D. (1995). A dynamical systems perspective on agent-environment interaction. Artificial Intelligence, 72(1–2), 173–215. Braitenberg, V. (1984). Vehicles: Experiments in synthetic psychology. Cambridge, MA: MIT Press. Bruske, J., & Sommer, G. (1995). Dynamic cell structure learns perfectly topology preserving map. Neural Comput., 7(4), 845–865. Burgard, W., Cremers, A. B., Fox, D., H¨ahnel, D., Lakemeyer, G., Schulz, D., Steiner, W., & Thrun, S. (1999). Experiences with an interactive museum tour-guide robot. Artificial Intelligence, 114(1–2), 3–55. Cameron, S., Grossberg, S., & Guenther, F. H. (1998). A self-organizing neural network architecture for navigation using optic flow. Neural Comput., 10(2), 313– 352. Davids, A. (2002). Urban search and rescue robots: From tragedy to technology. IEEE Intelligent Systems, 17(2), 81–83. Decugis, V., & Ferber, J. (1998). Action selection in an autonomous agent with a hierarchical distributed reactive planning architecture. In Proc. 2nd International Conference on Autonomous Agents (pp. 354–361). New York: ACM Press. Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artificial Intelligence Res., 13, 227–303. Fiorini, P., Kawamura, K., & Prassler, E. (Eds.). (2000). Autonomous Robots, Special Issue on Cleaning and Housekeeping Robots, 9(3). Ghosh, J., & Nag, A. (2001). An overview of radial basis function networks. In R. J. Howlett & L. C. Jain (Eds.), Radial basis function networks 2: New advances in design (pp. 1–36). New York: Physica-Verlag. Glentis, G.-O., Berberidis, K., & Theodoridis, S. (1999). Efficient least squares adaptive algorithms for FIR transversal filtering. IEEE Signal Processing Mag., 16(4), 13–41. Gorinevsky, D., & Connolly, T. H. (1994). Comparison of some neural networks and scattered data approximations: The inverse manipulator kinematics example. Neural Comput., 6(3), 521–542. Gross, H.-M., Stephan, V., & Krabbes, M. (1998). A neural field approach to topological reinforcement learning in continuous action spaces. In Proc. International Joint Conference on Neural Networks (Vol. 3, pp. 1992–1997). New York: IEEE Press. Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. IEEE Trans. Pattern Anal. Machine Intell., 12(10), 993–1001. Hartman, E., & Keeler, D. (1991). Predicting the future: Advantages of semilocal units. Neural Comput., 3(4), 566–578. Hashem, S. (1997). Optimal linear combinations of neural networks. Neural Networks, 10(4), 599–614. Hayes, A. T., Martinoli, A., & Goodman, R. M. (2002). Distributed odor source localization. IEEE Sensors, 2(3), 260–271.
1442
K. Low, W. Leow, and M. Ang, Jr.
Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Haykin, S. (2002). Adaptive filter theory (4th ed.). Upper Saddle River, NJ: Prentice Hall. Heikkonen, J., & Koikkalainen, P. (1997). Self-organization and autonomous robots. In O. Omidvar & P. van der Smagt (Eds.), Neural systems for robotics (pp. 297–337). Orlando, FL: Academic Press. Hertzberg, J., Christaller, T., Kirchner, F., Licht, U., & Rome, E. (1998). Sewer robotics. In R. Pfeifer, B. Blumberg, J.-A. Meyer, & S. W. Wilson (Eds.), From animals to animats 5: Proc. International Conference on Simulation of Adaptive Behavior (pp. 427– 436). Cambridge, MA: MIT Press. Howard, A., Matari´c, M. J., & Sukhatme, G. S. (2002). Network deployment using potential fields: A distributed, scalable solution to the area coverage problem. In H. Asama, T. Arai, T. Fukuda, & T. Hasegawa (Eds.), Distributed autonomous robotic systems 5: Proc. 6th International Symposium on Distributed Autonomous Robotic Systems (pp. 299–308). New York: Springer. Huntsberger, T., & Rose, J. (1998). BISMARC: A biologically inspired system for map-based autonomous rover control. Neural Networks, 11(7–8), 1497–1510. Ijspeert, A. J., Martinoli, A., Billard, A., & Gambardella, L. M. (2001). Collaboration through the exploitation of local interactions in autonomous collective robotics: The stick pulling experiment. Autonomous Robots, 11(2), 149–171. Jacobs, R. A. (1995). Methods for combining experts’ probability assessments. Neural Comput., 7(5), 867–888. Jansen, A., van der Smagt, P., & Groen, F. C. A. (1995). Nested networks for robot control. In A. F. Murray (Ed.), Applications of neural networks (pp. 221–239). Dordrecht: Kluwer. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. J. Artificial Intelligence Res., 4, 237–285. Karayiannis, N. B. (1999). Reformulated radial basis neural networks trained by gradient descent. IEEE Trans. Neural Networks, 10(3), 657–671. Karayiannis, N. B., & Mi, G. W. (1997). Growing radial basis neural networks: Merging supervised and unsupervised learning with network growth techniques. IEEE Trans. Neural Networks, 8(6), 1492–1506. Kim, J.-O., & Khosla, P. (1992). Real-time obstacle avoidance using harmonic potential functions. IEEE Trans. Robot. Automat., 8(3), 338–349. Kittler, J., Hatef, M., Duin, R. P. W., & Matas, J. (1998). On combining classifiers. IEEE Trans. Pattern Anal. Machine Intell., 20(3), 226–239. Kohonen, T. (2000). Self-organizing maps (3rd ed.). New York: Springer. Koren, Y., & Borenstein, J. (1991). Potential field methods and their inherent limitations for mobile robot navigation. In Proc. IEEE International Conference on Robotics and Automation (ICRA’91) (pp. 1394–1404). New York: IEEE Computer Society Press. Kuperstein, M. (1991). INFANT neural controller for adaptive sensory-motor coordination. Neural Networks, 4(2), 131–146. Lagoudakis, M. G., & Maida, A. S. (1999). Robot navigation with a polar neural map: Student abstract. In Proc. 16th National Conference on Artificial Intelligence (AAAI-99) (p. 965). Menlo Park, CA: AAAI Press.
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1443
Latombe, J.-C. (1999). Motion planning: A journey of robots, molecules, digital actors, and other artifacts. International Journal of Robotics Research, 18(11), 1119– 1128. Littmann, E., & Ritter, H. (1996). Learning and generalization in cascade network architectures. Neural Comput., 8(7), 1521–1539. Liu, C., Ang Jr., M. H., Krishnan, H., & Lim, S. Y. (2000). Virtual obstacle concept for local-minimum-recovery in potential-field based navigation. In Proc. IEEE International Conference on Robotics and Automation (ICRA’00) (Vol. 2, pp. 983–988). New York: ACM Press. Low, K. H., Leow, W. K., & Ang, Jr., M. H. (2002). A hybrid mobile robot architecture with integrated planning and control. In Proc. 1st International Joint Conference on Autonomous Agents and MultiAgent Systems (AAMAS-02) (pp. 219–226). New York: ACM Press. Low, K. H., Leow, W. K., & Ang, Jr., M. H. (2003). Action selection for single- and multirobot tasks using cooperative extended Kohonen maps. In Proc. 18th International Joint Conference on Artificial Intelligence (IJCAI-03) (pp. 1505–1506). New York: IEEE Press. Low, K. H., Leow, W. K., & Ang, Jr., M. H. (2004). Task allocation via self-organizing swarm coalitions in distributed mobile sensor network. In Proc. 19th National Conference on Artificial Intelligence (AAAI-04) (pp. 28–33). Menlo Park, CA: AAAI Press. Maes, P. (1995). Modeling adaptive autonomous agents. In C. G. Langton (Ed.), Artificial life:An overview (pp. 135–162). Cambridge, MA: MIT Press. Mahadevan, S., & Connell, J. (1992). Automatic programming of behavior-based robots using reinforcement learning. Artificial Intelligence, 55(2–3), 311–365. Martinetz, T. M., Ritter, H. J., & Schulten, K. J. (1990). Three-dimensional neural net for learning visuomotor coordination of a robot arm. IEEE Trans. Neural Networks, 1(1), 131–136. Martinoli, A., Ijspeert, A. J., & Mondada, F. (1999). Understanding collective aggregation mechanisms: From probabilistic modelling to experiments with real robots. Robotics and Autonomous Systems, 29(1), 51–63. Mendenhall, W., & Sincich, T. (1994). Statistics for engineering and the sciences (4th ed.). Upper Saddle River, NJ: Prentice Hall. Michel, O. (2004). Cyberbotics Ltd. WebotsTM: Professional mobile robot simulation. International Journal of Advanced Robotic Systems, 1(1), 39–42. Mill´an, J. del R., Posenato, D., & Dedieu, E. (2002). Continuous-action Q-learning. Machine Learning, 49(2–3), 249–265. Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Comput., 1(2), 281–294. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Comput., 3(2), 213–225. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proc. IEEE, 78(9), 1481–1497. Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Comput., 3(1), 88–97. Port, R. F., & van Gelder, T. (1995). Mind as motion: Explorations in the dynamics of cognition. Cambridge, MA: MIT Press.
1444
K. Low, W. Leow, and M. Ang, Jr.
Rao, R. P. H., & Fuentes, O. (1998). Hierarchical learning of navigational behaviors in an autonomous robot using a predictive sparse distributed memory. Machine Learning, 31(1–3), 87–113. Rimon, E., & Koditschek, D. E. (1992). Exact robot navigation using artificial potential functions. IEEE Trans. Robot. Automat., 8(5), 501–518. Ritter, H., & Schulten, K. (1986). Topology conserving mappings for learning motor tasks. In J. S. Denker (Ed.), Neural networks for computing (pp. 376–380). Snowbird, UT: American Institute of Physics. Rohanimanesh, K., & Mahadevan, S. (2003). Learning to take concurrent actions. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1619–1626). Cambridge, MA: MIT Press. Rosenblatt, J. K. (1997). DAMN: A distributed architecture for mobile navigation. J. Expt. Theor. Artif. Intell., 9(2–3), 339–360. Russell, S., & Norvig, P. (1995). Artificial intelligence: A modern approach. Upper Saddle River, NJ: Prentice Hall, Rybski, P. E., Stoeter, S. A., Gini, M., Hougen, D. F., & Papanikolopoulos, N. P. (2002). Performance of a distributed robotic system using shared communications channels. IEEE Trans. Robot. Automat., 18(5), 713–727. Santam´aria, J. C., Sutton, R., & Ram, A. (1998). Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior, 6(2), 163–218. Schaal, S., & Atkeson, C. G. (1998). Constructive incremental learning from only local information. Neural Comput., 10(8), 2047–2084. Schaal, S., Atkeson, C. G., & Vijayakumar, S. (2002). Scalable techniques from nonparametric statistics for real time robot learning. Applied Intelligence, 17(1), 49–60. Sharkey, A. J. C., & Sharkey, N. E. (1997). Combining diverse neural nets. Knowledge Engineering Review, 12(3), 231–247. Sharkey, N. E. (1998). Learning from innate behaviors: A quantitative evaluation of neural network controllers. Machine Learning, 31(1–3), 115–139. Shastri, S. V. (Ed.). (1999). Field and service robotics [Special issue]. International Journal of Robotics Research, 18(7). Smart, W. D., & Kaelbling, L. P. (2000). Practical reinforcement learning in continuous spaces. In Proc. 17th International Conference on Machine Learning (ICML-00) (pp. 903–910). San Francisco: Morgan Kaufmann. Smith, A. J. (2002). Applications of the self-organising map to reinforcement learning. Neural Networks, 15(8–9), 1107–1124. Sutton, R. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Tani, J., & Fukumura, N. (1994). Learning goal-directed sensory-based navigation of a mobile robot. Neural Networks, 7(3), 553–563. Touzet, C. (1997). Neural reinforcement learning for behavior synthesis. Robotics and Autonomous Systems, 22(3–4), 251–281. van der Smagt, P., Groen, F. C. A., & van het Groenewoud, F. (1994). The locally linear nested network for robot manipulation. In Proc. International Conference on Neural Networks (Vol. 5, pp. 2787–2792). New York: IEEE Press. Versino, C., & Gambardella, L. M. (1995). Learning the visuomotor coordination of a mobile robot by using the invertible Kohonen map. In J. Mira & F. Sandoval
Ensemble of Cooperative EKMs for Complex Robot Motion Tasks
1445
(Eds.), Proc. International Workshop on Artificial Neural Networks (pp. 1084–1091). Berlin: Springer. Walter, J., & Ritter, H. (1996). Investment learning with hierarchical PSOM. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 570–576). Cambridge, MA: MIT Press. Walter, J. A., & Schulten, K. J. (1993). Implementation of self-organizing neural networks for visuo-motor control of an industrial robot. IEEE Trans. Neural Networks, 4(1), 86–95. Wettschereck, D., & Dietterich, T. (1992). Improving the performance of radial basis function networks by learning center locations. In J. E. Moody, S. J. Hanson, & R. P. Lipmann (Eds.), Advances in neural information processing systems, 4 (pp. 1133– 1140). San Mateo, CA: Morgan Kaufmann. Zalama, E., Gaudiano, P., & Coronado, J. L. (1995). A real-time, unsupervised neural network for the low-level control of a mobile robot in a non-stationary environment. Neural Networks, 8(1), 103–123. Zelek, J. S., & Levine, M. D. (1996). SPOTT: A mobile robot control architecture for unknown or partially known environments. In I. Nourbakhsh (Ed.), Planning with incomplete information for robot problems: Papers from the AAAI Spring Symposium (pp. 129–140). Toronto: AAAI Press.
Received March 9, 2004; accepted November 19, 2004.
NOTE
Communicated by Boris Gutkin
Categorization of Neural Excitability Using Threshold Models A. Tonnelier
[email protected] Cortex Project, INRIA Lorraine, Campus Scientifique, Vandoeuvre-l`es-Nancy, France
A classification of spiking neurons according to the transition from quiescence to periodic firing of action potentials is commonly used. Nonbursting neurons are classified into two types, type I and type II excitability. We use simple phenomenological spiking neuron models to derive a criterion for the determination of the neural excitability based on the afterpotential following a spike. The crucial characteristic is the existence for type II model of a positive overshoot, that is, a delayed afterdepolarization, during the recovery process of the membrane potential. Our prediction is numerically tested using well-known type I and type II models including the Connor, Walter, & McKown (1977) model and the Hodgkin-Huxley (1952) model. 1 Introduction Despite the large number of ionic mechanisms underlying the initiation of action potentials, a broad class of nonbursting neurons presents two types of excitability (Hodgkin, 1948; Rinzel & Ermentrout, 1998; Izhikevich, 2000). The properties of membrane excitability are determined according to the emerging frequency of repetitive firing. Type I is obtained when repetitive action potentials are generated with an arbitrarily low frequency, whereas in type II, spike trains emerge at a nonzero frequency. The frequency response of a single cell is crucial since it models the input-output relation, that is, the gain function, commonly used in firing rate description of neural networks. The dynamics of the membrane excitability determines the spike train statistics (Gutkin & Ermentrout, 1998) and is fundamental to understanding how nontrivial dynamics emerge when neurons are coupled in networks (Hansel, Mato, & Meunier, 1995). Previous work on the classification of excitability used bifurcation theory (Ermentrout, 1996; Rinzel & Ermentrout, 1998; Izhikevich, 2000). The bifurcation resulting in the apparition of a stable limit cycle determines the type of excitability. Typically, type I and II are related to a saddle node bifurcation on an invariant circle and an Andronov-Hopf bifurcation, respectively. However this classification is not perfect, and one has to distinguish between the bifurcation of the resting state and the bifurcation of the limit cycle leading to a complex classification (Izhikevich, 2000). Neural Computation 17, 1447–1455 (2005)
© 2005 Massachusetts Institute of Technology
1448
A. Tonnelier
The purpose of this note is to derive a simple criterion for the classification of neural excitability. Tonnelier and Gerstner (2003) showed that type I and type II neurons can be obtained as a generalization of integrate-and-fire neurons. However, the question of the classification based on the firing rate was not addressed, and the neural mechanisms that distinguish between the two types of excitability were not established. We present an easy and intuitive way to characterize the neural excitability of spiking neurons using the analytical framework of the spike response model. The result is surprisingly simple: type I is obtained when the afterpotential following a spike has a monotonic recovery process, whereas type II membranes present a small depolarization during the recovery. We check the validity of this classification on more complex models and derive some qualitative and quantitative predictions. 2 Type I vs. Type II Excitability of the Spike Response Model The spike response model allows a phenomenological description of spiking neurons (Gerstner & Kistler, 2002). This model approximates the dynamics of biophysical detailed models with great accuracy (Kistler, Gerstner, & van Hemmen, 1997; Jolivet, Lewis, & Gerstner, 2004) and yields to a transparent discussion of various neural dynamics (Gerstner, van Hemmen, & Cowan, 1996). In this model, the membrane potential v(t) in response to a constant stimulation is given by v(t) =
η(t − t f ) + ustat (I ),
t f ∈F
where η(t − t f ) describes the form of a spike and the afterpotential following it. The second term ustat (I ) models the response of the membrane potential to a constant input current I , that is, the steady-state I − V relationship of the model (to simplify the notations, we will drop the dependence on I ). It is convenient (Tonnelier & Gerstner, 2003) to split the kernel η into two parts, η(t − t f ) = η f (t − t f ) − ηr (t − tr ), where η f and ηr are two pulse-shaped kernels. The first term, η f , describes the spike, that is, the abrupt depolarization of the membrane potential, and −ηr models the recovery period that follows the spike, that is, the spike afterpotential. The action potential is triggered at time t = t f ∈ F, where the set F gives the spike events that are to be taken into account. A spike event occurs if the membrane potential crosses a threshold ϑ from below. The recovery kernel acts at the so-called resetting time tr = t f + where includes the spike duration and an absolute refractory period. The kernel η f operates on a fast timescale, and we approximate it by a Dirac delta
Categorization of Neural Excitability Using Threshold Models
1449
function. Since the membrane trajectory during a spike reflects the membrane properties and not the input, the kernel ηr is independent of the input. To reproduce the recovery processes of neurons, we consider the two following kernels, ηrI (t) = µr e −t/τr sinh ωr t,
(2.1)
ηrII (t) = µr e −t/τr
(2.2)
sin ωr t,
for t > 0 and 0 otherwise, which we call type I and type II recovery kernels, respectively. The parameter µr is a scale factor, τr is the recovery time constant, and ωr determines the global shape of the kernel. We require ωr τr < 1 for type I recovery kernel in order to ensure a decay to 0. Note that these two kernels could be written in a general formalism using complex values of ωr . We will show the following result: type I and type II recovery kernels describe type I and type II membrane models, respectively. The existence for type II kernel of a negative part (i.e., a positive overshoot of the corresponding membrane potential) is the main difference between the two kernels. The precise form of ηr is not important; a similar result holds for different choices. 2.1 Analytical Treatment. A constant input current generates an indefinite train of spikes if t f = n/ν where n is the index of the nth spike and ν is the mean firing rate. Let v∞ (t) be the membrane potential in the repetitive spiking regime. For clarity, we take = 0 in our analytical treatment. We calculate v∞ (t) =
δ(t − n/ν) − ηr,∞ (t) + ustat ,
(2.3)
n
where ηr,∞ is the periodic recovery kernel that we calculate using a summation formula. We find I ηr,∞ (t) =
T−t t µr e τr sinh ωr t + e − τr sinh ωr (T − t) 2(cosh T/τr − cosh ωr T)
II ηr,∞ (t) =
T−t t µr e τr sin ωr t + e − τr sin ωr (T − t) 2(cosh T/τr − cos ωr T)
and
for 0 ≤ t ≤ T, where T = 1/ν is the interspike interval. The frequency of the periodic firing is obtained from the requirement v∞ (T) = ϑ, which we
1450
A. Tonnelier
rewrite as F (x) = ϑe , where x = ωr ν −1 , F I (x) =
sinh x , 2(cosh x − cosh αx)
(2.4)
F II (x) =
sin x , 2(cos x − cosh αx)
(2.5)
ϑe is the effective dimensionless threshold defined by ϑe = (ϑ − ustat )/µr and α = 1/ωr τr . For type II kernel, α is the ratio between its oscillatory period and its decaying time constant. Type I kernel could be expressed as a difference between two exponentials with two timescales (a rising time τ1 and a decaying time τ2 ), and α represents (τ1 + τ2 )/(τ2 − τ1 ). In addition to equations 2.4 and 2.5, the neuronal voltage should exceed ϑ only once during one period, v∞ (t) < ϑ, t ∈ (0, T).
(2.6)
A necessary but not sufficient condition reads −dηr,∞ (T)/dt > 0; the voltage increases just before the spike. First, we consider the type I function, equation 2.4. Using the requirement α = 1/τr ωr > 1 for the type I kernel, it is straightforward to show that a necessary and sufficient condition for the existence of a repetitive spiking regime is ϑe < 0, or equivalently ustat > ϑ, which states that the stationary potential crosses the threshold; that is, the stationary state disappears. Note that the periodic solution is unique since F I is monotonic increasing with respect to x. At the critical regime, ϑe → 0, periodic firing appears with an arbitrarily low frequency, ν I → 0, and from equation 2.4, we derive the following logarithmic law for the emerging frequency: ν I = (ωr − 1/τr )[ln −2ϑe ]−1 .
(2.7)
Determination of the critical current is obtained, giving the dependence of the stationary state ustat with respect to I . For instance, if we consider the steady-state I − V curve of the standard integrate-and-fire model, we have ustat = RI , and we find the well-known critical current Ic = ϑ/R for the emergence of repetitive spiking. Note that the logarithmic law, equation 2.7, of the frequency current relationship is closely related to the exponential decrease of the recovery kernel ηr . A square root law is obtained when considering a quadratic decay of the recovery kernel. We now investigate the type II system: F I I (x) = ϑe . Using equation 2.5, we show in Figure 1B the locus of existence of the repetitive firing regime. There exists a critical threshold ϑe∗ (α) > 0 such that the existence of a periodic spiking solution is obtained for ϑe < ϑe∗ . Solutions appear by pair, but one violates condition 2.6. Periodic firing appears before the vanishing
Categorization of Neural Excitability Using Threshold Models
1
-40
1
α
v (mV)
α -50
0.75
0.75 0.5
0.5
Repetitive Spiking
0.25
0
-0.5
0.5
1
α HH
0.25
ϑe
0 -1
1.5
0 -1
-60
Repetitive Spiking
ϑe
0
-0.5
A -40
1451
0.5
1
-70 0
1.5
t (ms) 20
40
B
v (mV)
10 mV
-50
0.2
10 mV
10 mV
20
10 ms
20 ms
10 ms
60
C
v(t)
v(t) -60 -70
t (ms) -80 0
20
40
D
t
t
t
t
60
E
F
Figure 1: (A, B) Locus of existence of periodic solutions obtained for (A) the type I and (B) the type II recovery kernel of the spike response model in the (ϑe , α) plane. In B, the dotted line indicates if −ηr (t) crosses the effective threshold (left). The parameter α H H is derived from the HH model (see below). (C, D) The periodic subthreshold potential of the spike response model (solid lines) that approximates (C) the Connor et al. model and (D) the HH model. The dotted lines represent the short-term memory approximation. The neuron fires a spike (vertical line) when the voltage membrane hits the firing threshold (solid line). The stationary state (dashed line) is shown. Note that in D, the model exhibits bistability between a stable steady state and stable oscillation. (E, F) Action potential corresponding to (E, left) the Connor et al model; (E, right) the Morris-Lecar model; (F, left) the HH model; and (F, right) the FitzHugh-Nagumo model. These models have been stimulated by a short but strong current pulse before t = 0. A subthreshold DC current has also been applied. The horizontal line represents the stationary potential. Models of E are known as type I models, whereas models of F are referred to as type II models. We idealize the spike (dotted lines) with a Dirac delta function, and we fit (dashed lines) the recovery part with two generic kernels (see the text). The recovery part acts after a delay . Numerical values are (E, left) µr = 17 mV, τr = 0.1985 ms, ωr = 4.691 MHz, = 9.5 ms; (E, right) µr = 40 mV, τr = 6, ωr = 0.1 MHz, = 40 ms; (F, left) µr = 28 mV, τr = 6 ms, ωr = 0.3 MHz, = 5 ms; (F, right) µr = 0.88, τr = 12, ωr = 0.11, = 20.5 (dimensionless units).
of the stationary state and a bistability regime exists for 0 < ϑe < ϑ ∗ , where a stationary state and a periodic solution coexist. Qualitatively, the explanation is based on the existence of a depolarized afterpotential that drives the membrane potential into the superthreshold regime, and therefore repetitive firing appears before the vanishing of the stationary state. At the critical regime, ϑe = ϑe∗ , it is clear from equation 2.5 that small values of
1452
A. Tonnelier
ν (x infinite) are not solutions; that is, periodic firing emerges with a nonzero frequency. The exponential decay of the recovery kernel implies that the summation over the firing time is dominated by the most recent firing event. Hence, the periodic kernel ηr,∞ is well approximated at time t by ηr (t). This approximation, reported as the short-term memory approximation (Gerstner & Kistler, 2002), leads to an accurate determination of the emergence of repetitive firing given by −ηr (ν −1 ) = ϑe (see Figure 1B) and fits the exact periodic solution (see Figures 1C and 1D). In this approximation, the frequency of the emerging periodic firing for a type II kernel is given by −1 ν II = + ωr −1 (π + arctan(ωr τr )) .
(2.8)
In other words, the location of the delayed afterdepolarization of type II models could be used as an approximation of the period of the emerging repetitive spiking regime. 2.2 Type I vs. Type II Neural Excitability. Our analysis of the spike response model suggests a simple observable criterion for determining the excitability of spiking neurons based on the recovery process following an action potential elicited by a brief current pulse: the afterpotential of type I models is hyperpolarized, whereas type II models present a delayed afterdepolarization (DAD), also reported as a prolonged depolarized afterpotential. In this section, we aim at establishing some connections with detailed models and thus deriving some quantitative predictions. In Figures 1E and 1F, we look at the time course of the action potential of popular type I and type II models. In Figure 1E, we show the action potential of models reported as type I: the Connor, Walter, and McKown (1977) model and the Morris-Lecar (1981) model in the type I regime (see Ermentrout, 1996). Figure 1F shows the voltage trajectories of type II models: the Hodgkin-Huxley (HH 1952) model, and FitzHugh-Nagumo (Fitzhugh, 1961; Nagumo, Arimoto, & Yoshizawa, 1962) model. Many papers and books describing the equations and the dynamics of these models are available (Koch, 1999; Gerstner & Kistler, 2002). Since they represent different aspects of nerve cell excitability, these models are widely used as paradigmatic models of action potential generation. Numerically, parameters µr , τr , ωr , and are adjusted such that the kernels 2.1 and 2.2 fit the time course of the membrane afterpotential of the detailed neural models (see Figures 1E and 1F). We see that type II kernels numerically fit the afterpotential of the HH and FitzHugh-Nagumo models, whereas type I kernels approximate the afterpotential of the Connor et al. and Morris-Lecar models. The correspondence between the kernels and the detailed models is made near the onset of repetitive firing. Different parameter values do not affect the categorization; the type of kernels
Categorization of Neural Excitability Using Threshold Models
1453
remains unchanged, provided that we work away from the hyperexcitable regime, since at highly depolarized potentials, both type I and type II models can show afterdepolarization. Let us now numerically illustrate some quantitative predictions. We mainly examine the HH model, but a similar analysis can be carried out for the other models. Parameters for the recovery kernel of the HH model are given in Figure 1. Using equation 2.5, solving F I I (x) = ϑe , we find an emerging frequency ν I I = 51 Hz, and using equation 2.8, we find the approximate value 52 Hz. These results fit the exact value of the complete model (about 53 Hz at 6.3◦ C). To go further and find the critical current, one needs to elucidate (1) the steady-state I − V relationship and (2) the threshold behavior. Despite the nonlinearity of the full model, the steady-state membrane depolarization in the subthreshold regime depends linearly on the applied membrane current (Koch, 1999). The numerical fit (simulations not shown) of the steady-state I − V curve is given by ustat = u0 + RI where u0 = −65.0 mV, R = 0.7 M cm2
(2.9)
for ustat < ϑ (the stable branch of the I − V curve for the Connor et al. model is also well fitted with equation 2.9 using u0 = −68 mV). It has been observed that the HH equations exhibit a threshold behavior (Koch, 1999; Gerstner & Kisler, 2002). For the extraction of the threshold, we used a rapid and strong input current—a certain amount of electrical charge is instantaneously delivered to the membrane—and we find a voltage threshold for spike initiation given by ϑ = −58.2 mV (see also Noble & Stein, 1966). Then, we predict that the stationary state destabilizes at I = 9.4 µA/cm2 (using ustat = ϑ), whereas the critical current of emerging repetitive firing is given by Ic = 6.6 µA/cm2 (using equation 2.5) and Ic = 5.9 µA/cm2 in the short memory approximation. Our result emphasizes the difference between voltage threshold and current threshold (Koch, Bernander, & Douglas, 1995). If we define the current threshold, also referred to as the rheobase, as the critical value of a sustained current initiating action potentials, the type II model leads to a corresponding membrane potential (given by the steady-state I − V curve) below the voltage threshold ϑ, whereas these two thresholds coincide for type I models. 3 Discussion One possible classification of neurons uses the discontinuity of the firing rate curve. Following this categorization and in the framework of the spike response model, we have shown that type I and type II recovery kernels account for the difference between type I and type II spiking models. Regarding this result, we state that the spike afterpotential could be used as a characteristic of membrane excitability. Our approach has some connections
1454
A. Tonnelier
with the classification based on the bifurcation theory. Indeed, the oscillations generated by the Hopf bifurcation produce a DAD that leads to the type II response predicted by the bifurcation theory and by our criterion. However, we stress that the existence of a DAD does not require the existence of subthreshold oscillations. In fact, damped oscillations are related to the bifurcation of the resting state that can be different from the bifurcation of the limit cycle as it is the case when bistability occurs, that is, the coexistence of a stable limit cycle and a stable resting state. The existence of the DAD is related to the limit cycle bifurcation. Our approach is highly simplified and neglects many aspects of neuronal dynamics. We used a threshold for spike initiation and neglected the other nonlinear processes for the generation of spikes. However, our result could be used as a criterion for the classification of neural excitability of detailed models away from their hyperexcitable regime. In conductance-based models, the depolarized spike afterpotential appears because of an interplay between the subthreshold voltage-gated currents. The membrane potential during this stage is mainly driven by the dynamics of potassium current(s). Therefore, we suggest that the type I or type II excitability is mainly determined by the voltage-dependent potassium current(s). This intuition is corroborated by the transition between type I and type II excitability when changing the potassium dynamics of the Morris-Lecar model (Rinzel & Ermentrout, 1998). More precisely, it can be shown that the transition from a type I to type II Morris-Lecar model can be monitored only by changing the potassium activation curve. Therefore, as suggested by our analysis, the neural excitability of the Morris-Lecar model could be characterized observing its spike recovery. Note also that the main difference between the HH model and the Connor et al. model is the existence of an additional A-type potassium current leading to a type I response in the Connor et al. model. We stress that other mechanisms can be put forward to explain excitability changes, such as the existence of a transient calcium conductance, but the potassium currents are probably the most commonly encountered mechanism that determines the neural excitability. Acknowledgments I thank Wulfram Gerstner for valuable suggestions. References Connor, J. A., Walter, D., & McKown, R. (1977). Neural repetitive firing: Modifications of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. J., 18, 81–102. Ermentout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001.
Categorization of Neural Excitability Using Threshold Models
1455
Fitzhugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane. Biophys. J., 1, 445–466. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996). What matters in neuronal locking. Neural Computation, 8, 1653–1676. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Computation, 10, 1047–1065. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–335. Hodgkin, A. L. (1948). The local electric changes associated with repetitive action in a non-medullated axon. J. Physiol. (Lond.), 107, 165–181. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its applications to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Izhikevich, E. M. (2000). Neural excitability, spiking, and bursting. Int. J. Bifurca. Chaos, 10, 1171–1266. Jolivet, R., Lewis, T., & Gerstner, W. (2004). Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy. J. Neurophysiol., 92, 959–976. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Computation, 9, 1015–1045. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. ¨ & Douglas, R. J. (1995). Do neurons have a voltage or a Koch, C., Bernander, O., current threshold for action potential initiation? J. Comp. Neurosci., 2, 63–82. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Nagumo, J. S., Arimoto, S., & Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proc. IRE, 50, 2061–2070. Noble, D., & Stein, R. B. (1966). The threshold conditions for initiation of action potentials by excitable cells. J. Physiol., 187, 129–162. Rinzel, J. M., & Ermentrout, G. B. (1998). Analysis of neuronal excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed.). Cambridge, MA: MIT Press. Tonnelier, A., & Gerstner, W. (2003). Piecewise linear differential equations and integrate-and-fire neurons: Insights from two-dimensional membrane models. Phys. Rev. E, 67, 021908.
Received March 4, 2004; accepted December 6, 2004.
LETTER
Communicated by George Gerstein
Theory of the Snowflake Plot and Its Relations to Higher-Order Analysis Methods Gabriela Czanner
[email protected] Neuroscience Statistics Research Laboratory, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, U.S.A.
Sonja Grun ¨
[email protected] Institute for Biology–Neurobiology, Free University, Berlin, 14195, Germany
Satish Iyengar
[email protected] Department of Statistics, University of Pittsburgh Pittsburgh, PA 15260, U.S.A.
The snowflake plot is a scatter plot that displays relative timings of three neurons. It has had rather limited use since its introduction by Perkel, Gerstein, Smith, and Tatton (1975), in part because its triangular coordinates are unfamiliar and its theoretical properties are not well studied. In this letter, we study certain quantitative properties of this plot: we use projections to relate the snowflake plot to the cross-correlation histogram and the spike-triggered joint histogram, study the sampling properties of the plot for the null case of independent spike trains, study a simulation of a coincidence detector, and describe the extension of this plot to more than three neurons. 1 Introduction The snowflake plot (or joint configuration scatter diagram) of Perkel, Gerstein, Smith and Tatton (1975) is a scatter plot that displays the relative timings of firings from three simultaneously recorded neurons. It assumes that the relationships between the neurons can be fully described by the differences between their firing times. The plot uses a triangular coordinate system that treats the neurons symmetrically, and it is bounded by a hexagonal box—hence, its name. A point in the plot corresponds to the three differences between the firing times of the three neurons. Neural Computation 17, 1456–1479 (2005)
© 2005 Massachusetts Institute of Technology
Theory of the Snowflake Plot
1457
Perkel et al. (1975) studied the geometry of the plot and did many simulations to show the patterns in the snowflake plot under different dependence structures; however, there has been no rigorous theoretical study of the properties of this plot yet. One possible reason for the limited interest in the snowflake plot is that it is rather computer intensive; only recently have computers become powerful enough to allow quick construction of this plot and to do simulation studies to understand its properties better. There are also concerns about whether the snowflake plot can be helpful in analyzing data. One reason is that it can contain artifacts that complicate interpretations and can make it more difficult to understand and use. Next, the triangular system used to plot the snowflake is not as familiar as the Cartesian system, so what patterns one should expect given a certain circuit of three neurons are not immediately clear. This is especially hard for the more complex circuits. Nevertheless, the snowflake plot can be helpful (Perkel et al., 1975) as a potential screening device for higher-order correlations, which are hard to detect by other displays. An important example is coincidence detection (Perkel et al., 1975), in which a neuron fires in response to receiving (nearly) coincident spike inputs (Abeles, 1982b; see section 5 below). The simplest circuit that resembles a coincidence detector is a neuron that receives inputs from two neurons, and the threshold of the receiving neuron is such that two coincident input spikes suffice to reach threshold. Such a relation of two synchronous spikes followed (with some small delay) by an output spike cannot be detected by mere pairwise analysis, for example, using a crosscorrelation histogram (CCH). Instead, methods are required that allow the study of more than two neurons at a time as provided in the snowflake plot. How the snowflake plot works for excitatory or inhibitory connections has been studied using simulations (Perkel et al., 1975). If neuron A excites neuron B and if a third neuron C is independent of the others, then there will be an excess of cases when a spike from neuron A is followed by a spike from neuron B. This will be reflected in the snowflake plot as a narrow band of increased density of the points. Note, however, that for each pair of spikes from neurons A and B, we need a spike (although unrelated) from neuron C to draw the point in the snowflake. Thus, we need neuron C to fire persistently to bring out the relationship between A and B in the snowflake plot (Perkel et al., 1975). For this reason, the authors suggest the use of the snowflake plot in addition to the CCH and autocorrelation histogram. A related development is the joint peristimulus time scatter diagram of Gerstein and Perkel (1969, 1972) and joint peristimulus time histogram (JPSTH) of Aertsen, Gerstein, Habib, and Palm (1989) for two neurons that are subject to repeated stimulation. The same concept is used by Prut et al. (1998) for detecting synchrony among three neurons. Here the trigger event is a spike emitted by a neuron instead of a stimulus. Prut et al. (1998) call the resulting matrix of raw counts a threefold correlation matrix, or the counts matrix instead of JPSTH. Since the term counts matrix is a rather
1458
G. Czanner, S. Grun, ¨ and S. Iyengar
generic name and since it is a histogram, we will call it the spike-triggered joint histogram (STJH) instead. Finally, in their work, the three neurons are treated asymmetrically, with one neuron playing the role of a trigger and the other two neurons being the reference units. In that case, if two neurons fire after the third neuron fires and if we do not have prior knowledge about the connections between neurons, we may need three separate STJHs to detect the dependence structure. While this is not difficult for just three neurons, it can become cumbersome when studying many more neurons in groups of three. Many of the ideas that we explore in this article about projection to explain the relationship between different plots and extensions to more than three neurons were qualitatively stated in earlier work (Perkel et al., 1975; Gerstein & Perkel, 1972; Kristan & Gerstein, 1970; Abeles & Gerstein, 1988). This article provides quantitative results that put the early qualitative work on a rigorous mathematical basis. After reviewing some basics in section 2, we show in section 3 that the CCH and the snowflake plot are results of orthogonal projections—the CCH from two- to one-dimensional space and the snowflake plot from three- to two-dimensional space. In section 4, we show that the snowflake plot is a generalization of the STJH in the sense that it treats neurons symmetrically because the STJH can be viewed as a result of a nonorthogonal projection from three- to two-dimensional space; finally, we show that the snowflake is the union of all three projections (STJHs of three neurons) so it gives us the same information as the three STJHs at once. Hence, the snowflake plot can bring simplification in computations. In section 5, we will study in detail the example of a simulated coincidence detector convergent circuit using the snowflake. In section 6, we derive analytically the distribution of the points on the snowflake plot under the null assumption that the neurons fire independently from three homogeneous Poisson processes, and we discuss its properties. In section 7 we discuss the geometry of a higher-dimensional version of the snowflake plot for four or more neurons and end with a discussion in section 8. 2 Snowflake Plot Basics We begin with a brief account of the geometry and construction of the snowflake plot; this material is extracted from Perkel et al. (1975). Let A, B, and C denote the times of spikes from three neurons. Since (B − A) + (C − B) + (A − C) = 0, we can plot the point (A, B, C) on a plane using a triangular coordinate system (see point P in Figure 1A). The Cartesian coordinates of the plotted point P are
2C − (A + B) , B−A . √ 3
(2.1)
Theory of the Snowflake Plot
1459
A
B L
B-A axis
C-B=0
C-A=0 P ACB
B-A
CAB
ABC B-A=0
A-C A-C axis
BAC
CBA
C-B C-B axis
BCA
2 L 3
Figure 1: Geometry of the snowflake plot. (A) The triangular coordinate system for the snowflake plot. (B) The coincidence lines (solid) divide the hexagon into six triangles. Each triangle corresponds to a particular set of timing sequences: the top triangle contains sequences ACB, then to the right ABC, BAC, BCA, CBA, and CAB.
Figure 1B shows again the axis of the triangular system. The lines that are perpendicular to them (solid) are called coincidence lines; for example, at the B − A = 0 line, neurons A and B fire at the same time. The solid lines (see Figure 1B) divide the hexagon into six sectors. Each sector of the snowflake is characterized by the order of firings; for example, ABC refers to the sector representing triples in which A precedes B, which in turn precedes C. Furthermore, each point on the snowflake plot corresponds to a precise sequence of the three neurons A, B, and C (see Figure 1B) defined by the order of the three neurons and two time differences between them. Following Perkel et al. (1975), we often consider only those triples (A, B, C) that satisfy |B − A| < L , |C − B| < L , and |A − C| < L ,
(2.2)
for some time interval, or span L, that is shorter than the recording period τ . The plot of such points will again form a snowflake; that is, they will all lie inside a smaller hexagon centered at the origin. Such a restriction is useful when the dependence among the neurons that is of interest is over small time intervals.
3 Snowflake Plot Generalizes the CCH Aertsen et al. (1989) show the correspondence between the CCH and JPSTH. The CCH can be viewed as an orthogonal projection from the
1460
G. Czanner, S. Grun, ¨ and S. Iyengar A
B
C time
15
y P′
10
4 4
P
1 1 2
8
3
12
x
-15
8 8
15
0
2 2 1
4 0
-10
0
-15
Figure 2: Orthogonal projection of points of the snowflake plot onto the vertical axis. (A) P = ( √13 (2C − B − A), B − A), P = B − A. (B) The snowflake plot and the histogram of all projected points. Neuron A fires at times 1, 5, 9; B fires at 2, 4, 7, 9; C fires at 1, 2, 5, 9. The total time is τ = 10. (C) The raw CCH of A and B. The histogram from B and the raw CCH have the same shape; they are equal up to a constant multiple, nC = 4, which is the number of times neuron C fired.
two-dimensional space spanned by times when neurons A and B fired onto the one-dimensional space spanned by the vector (−1, 1)—that is, onto the hyperplane x + y = 0—up to a scale constant √12 . The projection is simply B − A = (−1, 1)
A . B
Similarly, the snowflake plot can be considered an orthogonal projection from the cube [0, τ ]3 onto the plane x + y + z = 0:
− B − A) B−A
√1 (2C 3
=
− √13 − √13 −1
1
A B . 0 C
√2 3
Thus, the snowflake plot is a generalization of the CCH, from two to three neurons. This approach suggests that if we can find the joint spike density function in the cube, we can—at least in principle—then find the induced density of the projection on any hyperplane, for example, in the snowflake plot; in section 6, we use this approach to study the null case of three independent Poisson spike trains. Moreover, we can project the snowflake plot onto, for example, the vertical axis (see Figure 2A) and create a histogram of the points in the projection. Then the number of points occurring in the interval (t1 , t2 ] is equal to the number of points in the cross-correlation histogram multiplied by nC , the number of times neuron C fired (which is independent on t1 , t2 ). The reason is that each possible difference (B − A) is represented by nC points in the
Theory of the Snowflake Plot
1461
15
5 7 -15
15
6 5
-15
Figure 3: The projection when L = 5 < τ = 10. Here we are interested only in the points within the span L = 5, which is illustrated by the hexagon. All points within the hexagon define the snowflake plot. All other points should not be plotted; however, for illustrative purposes, we draw them here. Neuron A fires at times 1, 5, 9; B fires at 2, 4, 7, 9; C fires at 1, 2, 5, 9; the total time is τ = 10. This histogram is not equal to the raw CCH in Figure 2C.
snowflake plot. For illustration, see Figure 2B, where neuron C fired four times: the histogram of the projection is hence equal to raw counts CCH (of A and B) multiplied by 4. However, the details are different when we work with the plot of points within the span L, with L < τ . Consider again the projection of the points onto the vertical axes and a histogram of the points in that projection. Then the projection of the snowflake plot is not equal to the CCH (up to a constant multiple) (see Figure 3). The reason is that each difference (B − A) is plotted in this smaller snowflake only if there is a close spike (within span L) from the third neuron C. 4 Three STJHs Embedded in the Snowflake Plot Consider now the case of three neurons, where one of the three plays the role of a stimulus. Prut et al. (1998) used the idea of JPSTH for detecting the synchrony among three neurons. Here we will show that the snowflake plot can be divided into three parts, each corresponding to one STJH. This suggests that the plot can be considered as a generalization of the STJH by treating the three neurons symmetrically. This can be beneficial if we do not have prior information about which neuron is a trigger, for example, when many neurons are recorded and examined in groups of three. The analysis for spatiotemporal patterns by Prut et al. (1998) was motivated by predictions of the synfire chain model (Abeles, 1982a, 1991): “In such a model, every time the chain is activated, the participating cells will fire in a sequential manner. The temporal spacing between the spikes will
1462
G. Czanner, S. Grun, ¨ and S. Iyengar
t2 t1 S1 S2 S3 Figure 4: Precise firing sequence (PFS). Here, S1 , S2 , and S3 are three different neurons.
Figure 5: The geometry in STJH where neuron A is a trigger. For P, we have C − A > B−A; hence, C > B; hence, the PFS is ABC. For Q, we have C − A < B − A; hence, C < B; hence, the PFS is ACB.
correspond to the relative location of the parent cells along the chain. Such a sequence of spikes is termed a precise firing sequence (PFS)” Prut et al., 1998, pp. 2857–2858), Studies have detected PFSs in cortical activity of the frontal lobe (Abeles, Bergman, Margalit, & Vaadia, 1993; Villa & Fuster, 1992). We will use Prut et al.’s notation. A precise firing sequence is defined by its unit composition and time delays between spikes. The three units are denoted by S1 , S2 , S3 . Unit S1 fires first, unit S2 fires t1 time units after S1 fires, and unit S3 fires t2 time units after S1 fires, where t1 < t2 (i.e. t1 = S2 − S1 , t2 = S3 − S1 ) (see Figure 4). So the PFS can be written as (S1 , S2 , S3 ; t1 , t2 ). Call the first unit the trigger and the other two units reference units. Units S1 , S2 , S3 are either three different neurons, or they can represent one neuron. For three neurons A, B, and C, assume that Ais a trigger (i.e., S1 = A). We can plot C − A on the x-axis and B − A on the y-axes (see Figure 5). All such points create a spike-triggered type of diagram (scatter plot). If we create a grid above it and count the number of points in each square, we get the STJH. Moreover, the triangle to the right of the main diagonal represents the PFSs: (A, B, C; t1 = B − A, t2 = C − A) (since C − A > B − A so B < C; see point P in Figure 5); analogously, the other triangle represents the PFSs: (A, C, B; t1 = C − A, t2 = B − A) (see point Q in Figure 5). The PFSs also
Theory of the Snowflake Plot
1463
A C B
Figure 6: Solid lines divide the snowflake plot into three parts, each defined by a trigger unit (e.g., B means that B is a trigger).
divide the snowflake plot into triangles (see the six triangles, ABC, ACB, BAC, BCA, CAB, CBA, in Figure 1B). In this respect, the snowflake plot contains the same information as the three STJHs—where each STJH is defined by a different trigger neuron—at once. We now show that there is a nice transformation from the snowflake plot onto the STJHs. Let us divide the snowflake plot into three parts (see parts A, B, and C in Figure 6). Each part is determined by the trigger neuron, the one that fired first in the sequence. From the definition of the snowflake plot, there is a one-to-one correspondence between part A of the snowflake plot and the STJH from Figure 5. Note that each point in our STJH has coordinates [xSTJH , ySTJH ] = [C − A, B − A]. Then 2 1 1 xSNOW = √ [2(C − A) − (B − A)] = √ xSTJH ,A − √ ySTJH ,A 3 3 3
(4.1)
ySNOW = B − A = ySTJH ,A
(4.2)
and
since 1 1 √ (2C − B − A) = √ [(C − B) + (C − A)] 3 3 1 = √ [(C − A) − (B − A) + (C − A)] 3 1 = √ [2(C − A) − (B − A)] , 3 where (xSTJH ,A , ySTJH ,A ) are coordinates of the point in the STJH with A being a trigger, and (xSNOW , ySNOW ) are coordinates of the same point in the
1464
G. Czanner, S. Grun, ¨ and S. Iyengar
A
B
C
2
2
2
1
1
1
0
0
0
-1
-1
-1
-2
-2 -2
-1
0
1
-2
-2
2
-1
0
1
2
-2
-1
0
1
2
Figure 7: Transformation (see equation 4.3) of the points in the STJH to the snowflake. (A) Points (0, 0), (1, 0), (0, 1) and (1, 1). (B) The four points after scaling. (C) The four points after scaling and shearing.
snowflake plot. In the same way we can get a transformation between parts B, C, and corresponding STJHs. Note that our transformation matrix (in equations 4.1 and 4.2) is
√2 3
− √13
0
1
=
√2 3
0
0
1
1
− 12
0
1
,
(4.3)
the composition (product) of two affine transformations: first, a shear matrix that preserves the horizontal axis with shear factor −1/2, followed by a √ scaling (in this case, rescaling the horizontal axis by the factor 2/ 3). Hence, it is also an affine transformation. Figure 7 provides an example. Figure 7A shows four points (0, 0), (0, 1), (1, 0), and (1, 1). Figure 7B shows these points after applying the scaling, where the point (1, 0) transforms into (1.155, 0), because
√2 3
0
0
1
1 0
=
1.155 0
.
Similarly, the points (0, 0), (0, 1), (1, 1) transform into (0, 0), (0, 1) and (1.155, 1), respectively. Figure 7C shows the four points after applying the second transformation: it is a shearing, which transforms into (1.155, 0). Also, the points (0, 0), (0, 1), (1.155, 1) transform into (0, 0), (−0.5, 1), and (0.655, 1), respectively. In summary, each of the three STJHs is represented by one-third of a snowflake plot; hence, the three STJHs give exactly the same information as the snowflake plot.
Theory of the Snowflake Plot
1465
Spike trains of neurons A and B
A B 5msec
Coincidence induced spikes
C1 After applying latency
3.5ms
C2 After jittering
1msec
C3 Background spike train
C0 Superposition of C0 and C3
C
Figure 8: Simulation of the circuit with the coincidence detector neuron.
5 Example In this section, we show the use of the snowflake with an example: a simulated circuit with a coincidence detector neuron. Such a circuit has already been discussed by Perkel et al. (1975). However they used an integrateand-fire (IF) model to simulate the data. We use a different approach: we inject the coincidence-induced spikes into the spike train of the coincidence detector, as described below and depicted in Figure 8. We define a coincidence detector as a neuron (neuron C) that fires (with a latency 3 to 4 ms) if it sees an A spike followed by a B spike or vice versa within 5 ms. Also, we assume that the coincidence detector is not perfect; that is, it can fire (with some low probability) even when there is no coincidence. In our simulations, we generated A and B independently as Poisson processes with the same intensity—15 spikes per second. Neuron C was made to fire (with a latency 3 to 4 ms) if it saw a near coincidence (within 5 ms) between A and B spikes. Finally, we injected (see Grun, ¨ Diesmann, & Aertsen, 2002; Kuhn, Aertsen, & Rotter, 2003) low-intensity background
1466
G. Czanner, S. Grun, ¨ and S. Iyengar
spikes into the C spike train. (For more details about the simulation, see appendix A.) The simulated data are shown in Figure 9. The coincidence detector is an example of a PFS in the following sense. This circuit can be described by a set of two kinds of PFSs: (A, B, C; t1 , t2 ), and (B, A, C; t3 , t4 ). In (A, B, C; t1 , t2 ), neuron A fires first; in (B, A, C; t3 , t4 ), neuron B fires first. Figure 9 shows the raw CCHs, raw STJHs, and the corresponding snowflake plot. The first STJH (see Figure 9E) shows that the PFSs (A, B, C; t1 , t2 ) take place where 0 < t1 < 5 and t2 ∈ (t1 + 3, t1 + 4) ms. The second STJH (Figure 9F) says that (B, A, C; t3 , t4 ) also takes place with t3 < 5 and t4 ∈ (t3 + 3, t3 + 4). The first STJH corresponds to part A (see Figure 6) of the snowflake plot (contains the upper part of the chevron), the second corresponds to part B (contains the lower part of the chevron), the third to part C. This example illustrates the correspondence among the three JSTHs and the snowflake plot. 6 The Null Distribution In this section, we analyze the null case of the three neurons firing independently according to homogeneous Poisson processes, with possibly different rates. In practice, the null distribution can then be subtracted from data to detect deviations from it in the snowflake plot. In appendix C, we calculated the mean and variance of counts in bins of the cube [0, τ ]3 , which allows us to study the variability of the snowflake. In appendix B, we show that the joint spike intensity function over the snowflake plot is λ Aλ B λC f X,Y (x, y), where λ A, λ B , and λC are the intensities of neurons A, B, and C, respectively: √3 (τ − y) 2τ 3 √ √ 3 3 2τ 3 (τ − 2 x − √ √ 33 (τ − 3 x + 2 f X,Y (x, y) = 2τ √ 3 (τ + y) 2τ 3 √ √ 3 (τ + 23 x + 2τ 3 √ √3 (τ + 23 x − 2τ 3
in sector ACB 1 y) 2
in sector ABC
1 y) 2
in sector BAC in sector BCA
1 y) 2
in sector CBA
1 y) 2
in sector CAB,
τ is the recording period, and (x, y) are the Cartesian coordinates used in equation 2.1. The density f X,Y has a tent-like shape (see Figure 10A). To illustrate this result, we simulated three independent neurons with individual intensities of 10 spikes per second and recording period 15 seconds. The neurons fired
Theory of the Snowflake Plot
E
1467
A
B
C
D
F
G
Figure 9: Simulated coincidence detector. Neurons A and B are from two independent Poisson processes with intensities 15 spikes per sec. If neuron C sees an A spike followed by a B spike within 5 ms, then C fires within 3 to 4 ms after the B spike. C also fires if it sees a B spike followed by an A spike (see appendix A). The total time of simulation is 1 minute; neuron A fired 897 times and neuron B fired 932 times. Neuron C fired 125 times, with 12 background and 113 coincidence-induced spikes. (A) Raw CCH of A and B. (B) Raw CCH of B and C. (C) Raw CCH of A and C. (D) Snowflake plot of the simulated data. (E) Spike-triggered scatter diagram of B and C (A is a trigger). STJH can be obtained as a histogram above the plot. (F) Spike-triggered scatter diagram of A and B (C is a trigger). (G) Spike-triggered scatter diagram of A, and B (C is a trigger).
1468
G. Czanner, S. Grun, ¨ and S. Iyengar A 4
f(x,y)
3 2 1 0 15 15
0
0
y
−15
x
−15
B
C
4
2 B−A
relative frequency
15
3
1
0
0 15 15 −15
0 B−A
−15
−15
0 (C−B+C−A)/sqrt(3)
−15
0 (C−B+C−A)/sqrt(3)
15
Figure 10: (A) The density f X,Y for the independence case with τ = 15 sec. The value at the center is 0.003849. (B) The 3D histogram for the independence case with τ = 15 sec. Data are obtained from a simulation of three independent Poisson processes with intensities of 10 spikes per second. The neurons A, B, and C fired 146, 147, and 152 times, respectively. Bin size is 1 ms2 . The height of each bin is equal to number of points in the snowflake plot divided by the total number of points, 146 × 147 × 152. The value at the center is 0.003774. (C) Contour plot of the 3D histogram in B.
146, 147, and 152 times. We created bins over the snowflake plot and plotted the relative counts (See the histogram and contour plot in Figures 10B and 10C). We also derived the joint spike intensity function over the span L,
f X,Y/L (x, y) = λ Aλ B λC
3L 3 τ3
τ 2 − L 3
f X,Y (x, y) I {(x, y) ∈ span L},
where I is the indicator function. Figure 11 shows the intensity function for
Theory of the Snowflake Plot
1469
0.3021
0.4
0.2820 f(X,Y)
0.3 0.2 0.1 0 1 1
0 Y
0 −1
−1
X
Figure 11: The density function f X,Y/L for the independence case; τ = 15 sec, L = 1 sec.
the case of L = 1 sec and total recording time 15 sec. It is not constant, and it also has a maximum at origin. However, the shorter L is or the larger τ is, the more nearly constant the density appears near origin. This explains the observation in Perkel et al. (1975) that the distribution of the points in the snowflake plot should have approximately constant density over the hexagon. To assess the applicability of this null distribution result to trains other than the Poisson, we also simulated three independent spike trains, this time with gamma (α, β) distributed interspike intervals, of orders α = 2 and 3 (the Poisson train has intervals that are gamma with order 1), and scale parameter β = 100/α so that the mean interspike interval for all three cases is 100 ms. The three-dimensional histograms and the corresponding contour plots are given in Figure 12. Note that both resemble those from the Poisson process. Note that as the order increases, the contours appear smoother; √ this may reflect the fact that the coefficient of variation of the gamma, 1/ α, decreases as α increases. Furthermore, our theoretical calculations in appendix C show that for Poisson spike trains, the expected number of points in any small cube contained in [0, τ ]3 depends on intensities only through their product λ Aλ B λC , which is an overall measure of the density of points in the cube, and that the variance depends on the intensities separately. We also show in appendix C that for a fixed value of this product, the variance of the points in any small cube is minimized if the three intensities are equal. Because the snowflake plot is a linear projection of the points in the cube onto a plane, these qualitative properties are inherited by the plot. This supports the observation in Perkel et al. (1975) that the points in the snowflake appear more random over the hexagon if the neurons’ spiking intensities are equal. It also suggests that large differences between intensities cause larger variability across samples.
1470
G. Czanner, S. Grun, ¨ and S. Iyengar
A relative frequency
4
15
3 2
0 1 0 15 15
0 B−A
−15
0 −15
1−5 (C−B+C−A)/sqrt(3)
−15
0
15
−15
0
15
B relative frequency
4
15
3 2
0 1 0 15 15
0 B−A
0 −15
−15
(C−B+C−A)/sqrt(3)
−15
Figure 12: Two simulations of three independent gamma trains and the corresponding 3D histograms and contour plots on the snowflake plot. Total simulation time is τ = 15 sec. The two simulations have different parameters of scale and order; however, they give the same mean interspike interval, 100 ms (as was also the case in Figure 10). (A) Trains simulated from gamma (2,100/2). The neurons fired 159, 148, and 136 times. (B) Trains simulated from gamma (3,100/3). The neurons fired 152, 148, and 138 times.
7 Geometry for More Than Three Neurons Simultaneous recordings of more than three neurons are increasingly common; see, for example Beggs and Plenz (2004), Buzsaki (2004), Hoffman and McNaughton (2002), and Warren, Fernandez, and Normann, (2001). Hence, we explored the idea of generalizing the snowflake for more neurons. In this section, we show how the data from k neurons can be projected onto a (k − 1)-dimensional space, where k ≥ 4, assuming that relative spike timings contain all information about the dependencies among neurons. Although we cannot visualize the projection (except for four neurons), an analysis that is similar to the three-neurons case is possible. In particular, one can derive the joint density in the projection and subtract it from the projected data to search for patterns. This can be especially useful for exploratory analysis.
Theory of the Snowflake Plot
1471
In the case of three neurons (i.e., in the snowflake plot), the angle between the axes of triangular system is arccos (−1/2) = 120◦ . From section 3, we have √ that the snowflake plot is the result of projecting (up to a constant, 1/ 2) the cube onto the plane x + y + z = 0 using the projection matrix given there. For the case of k neurons, the projection matrix has vectors √ √2(1, 2(1,
1, 1,
1, 1,
(1, (1, (1,
1, 1, −1,
1, −2, 0,
..., ..., ... −3, 0, 0,
1, 1,
1, −(k − 2),
0, 0, 0,
..., ..., ...,
−(k −1)) / k(k − 1) 0)/ (k − 2)(k − 1) √ 0)/√6 0)/ 3 0).
√ Once again, they are perpendicular, and their √ √norm is again 2. These veck, form a Helmert matrix (see tors, together with vector 2(1, 1, 1, . . . . , 1)/ √ Mardia, Kent, & Bibby, 1979) up to a constant 2. Note that the point (1, 1, . . . , 1, 1, 0) projects onto √ (k − 1) , 0, . . . , 0 , 2 k(k − 1) and (1, 1, . . . , 1, 0, 1) projects onto √ (k − 2) −1 , , 0, . . . , 0 . 2 k(k − 1) (k − 1)(k − 2) In general, the points (1, 1, . . . , 1, 1, 0) and (1, 1, . . . , 1, 0, 1) represent the (k − 1)-way coincidences; that is, some (k − 1) of the neurons fire simultaneously, and the remaining one fire at some different time. The projections of these points give us information about the coincidence lines in the projection. For example, the angle between these two coincidence lines is arccos
1 = arccos − . √ 1+k(k−2) k−1 √
k−1 − k(k−1)/2
√
k−1 k(k−1)/2
k(k−2)/2
So for k = 3 neurons, the angle between coincidence lines is arccos (−1/2) = 120◦ ; for k = 4, arccos (−1/3) = 109.47◦ ; and for large k, lim arccos (−1/(k − 1)) =
k−→∞
π ; 2
1472
G. Czanner, S. Grun, ¨ and S. Iyengar
thus, the angle between coincidence lines becomes perpendicular as k → ∞. Finally, we show how to derive the axes for the case of four neurons. Note that the coincidence line ABC must be perpendicular to the plane where the values B − A are varying; also, the coincidence line AB D must be perpendicular to the plane where values B − A are varying. This gives us enough information to find the axes B − A as being perpendicular to the two coincidence lines: v B−A = v A−C = v D−A = vC−B = v B−D = v D−C =
( 0, 0, 1.414) ( 0, −1.225, −0.707) ( 1.155, 0.408, 0.707) ( 0, 1.225, −0.707) (−1.155, −0.408, 0.707) ( 1.155, −0.815, 0).
Notice that v A−C = −(v B−A + vC−B ), so the vectors v B−A, vC−B , v A−C lie in a common plane (x = 0) and have an equal angle, 120◦ . Hence the projection onto the plane x = 0 is again the (snowflake) hexagon. Furthermore, it can be shown that angles among v B−A, v A−D , v D−B are 120◦ ; among v A−C , vC−D , and v D−A, they are 120◦ . Finally, the angle between v B−A and v D−C is 90◦ ; the same holds for v A−C and v B−D and for v D−A and vC−B . 8 Discussion In this article, we have begun a quantitative study of the snowflake plot of Perkel et al. (1975). Our findings can be summarized as follows. First, we used the unifying idea of projections of spike trains to show how the snowflake plot relates to the cross-correlogram (CCH) and the joint peristimulus time histogram (JPSTH). We treated the CCH and the snowflake plot as results of orthogonal projections—the CCH from twoonto one-dimensional space, the snowflake plot from three- onto twodimensional space. We also showed how the snowflake plot generalizes the spike-triggered joint histogram (STJH) symmetrically: The snowflake is the union of all three STJHs of the three neurons. Second, we analytically derived the distribution of points on the snowflake plot under the assumption that the neurons fire independently from homogeneous Poisson processes. The joint spike density function over the snowflake plot has a tent-like shape. However, if we are interested in the firings that lie within some span, then the analytical density of the points in the snowflake converges to a constant density as total observation time increases and span decreases. Third, we studied the geometry of the snowflake for more than three neurons. Although one cannot easily visualize the projections, except for four neurons, an analysis similar to that for three neurons can be done. One
Theory of the Snowflake Plot
1473
can derive the joint density in the projection in the null case and subtract it from the projected data to search for patterns. This can be useful for exploratory analysis to analyze several spike trains simultaneously. How does the snowflake help us in analyzing data? It can be used for screening purposes to find potential PFSs and as a graphical device to visualize the PFSs. The idea is to create a grid above the snowflake plot and construct a histogram. The advantage is that all three STJHs (i.e., the raw counts matrices in Prut et al., 1998) are in one histogram, but the disadvantage is that the display is triangular rather than orthogonal. However, to find the raw counts is only the beginning of the process of detecting spatiotemporal patterns. One then needs to calculate the corrected counts (corrected for the expected counts) and derive a test to identify them as significant events. Prut et al. (1998) developed a method that allows the identification of significant events, which is based on counting. However, there is still a long way to go in order to evaluate significance, since higherorder correlations need to be evaluated, and therefore one needs to correct for by chance coincidence of lower-order correlations (see Schneider & Grun, ¨ 2003; Nakahara & Amari 2002; Gutig, ¨ Aertsen, & Rotter, 2003). If the interest is only to identify deviations from full independence, one can apply the same procedure as in the unitary events analysis (see the joint surprise in Grun ¨ et al., 2002). Analytical work will be of limited value because significance calculations for such models are quite complicated; it is more likely that simulation approaches that use shuffled or jittered trains (see Brown, Kass, & Mitra, 2004) will be needed, but these must be executed with care (Gerstein, 2004). There are other avenues of research besides of the evaluation of significance in the snowflake plot. A thorough study about the artifact of the snowflake plot—the woven pattern (Perkel et al., 1975)—is needed. It is caused by multiple entries, where one firing of a neuron is represented by multiple points in the plot. The artifact is always present, but it is less visible if the span L is small. Also of interest would be to generalize the independence assumption and find the distribution of the points in the snowflake plot. Another generalization can be made in modifying the Poisson assumption. Furthermore, the derivations of the joint spike density can be generalized for more than three neurons. Appendix A: Simulation of a Coincidence Detector To simulate the homogeneous Poisson process with intensity λ, one can use a direct method. Consider a spike at time t, t ∈ [0, τ ]. Then we can draw a random number, E, from exponential distribution with mean time interval 1/λ. This defines the next spike at time t + E. This method can give the times of spikes with any precision in principle. An equivalent way of simulating a Poisson process is to first draw the total number, n, of spikes from Poisson distribution with parameter λτ .
1474
G. Czanner, S. Grun, ¨ and S. Iyengar
Then we can draw a random sample of size n from uniform distribution over the interval [0, τ ]. Finally, we sort the sample in increasing order to get the Poisson process. We took the following steps to simulate a coincidence detector based on Poisson trains (see also Figure 8): 1. We simulated neurons A and B as two independent realizations of Poisson processes with intensities λ A = λ B = 15 spikes per second. In order to simulate the Poisson process, we used the second algorithm above. 2. We created the spike train C1 by looking for coincidences between A and B. If an A spike is followed by a B spike (within 5 ms), we create a C1 spike at the time of the B spike. Analogously, if a B spike is followed by an A spike (within 5 ms), we create a C1 spike at the time of the A spike. 3. We applied a latency of 3.5 ms to each C1 spike to get the C2 spike train. 4. The train C3 is a noisy version of C2 . It is obtained by jittering each spike of C2 independently and uniformly over a time window of ±0.5 ms around its original position. 5. We simulated the background process C0 as Poisson process with intensity 0.2 spike per second, independent of the spike trains A and B. This will be the background process for neuron C. 6. We injected the spikes of C0 into the train C1 to create the spike train C. In the theory of point processes, this is also called the superposition of point processes. Appendix B: Derivation of Null Distribution of Snowflake Plot We now derive the joint density function f X,Y (x, y) (see section 6) of the points on the snowflake plot when the three neurons are firing independently according to homogeneous Poisson processes with possibly different rates. We showed in section 3 that the snowflake plot is an orthogonal projection of the observations from the cube [0, τ ]3 onto the plane x + y + z = 0. Hence, we first compute the joint spike density function in the cube; then we derive the induced probability distribution of the projection. We first recall the following fact about a homogeneous Poisson process with n points in the interval [0, τ ] (Karlin & Taylor, 1975): conditional on n, the points have the same distribution as uniform order statistics of size n in the interval [0, τ ]. Hence, conditional on the number of spikes in each of the three spike trains, the distribution of the points in the cube is the
Theory of the Snowflake Plot
1475
same as the distribution of a random vector (U1 , U2 , U3 ), where U1 , U2 , U3 are independent and identically distributed with uniform distribution on [0, τ ]. We next project this distribution onto the plane x + y + z = 0. Now construct the random vector (X, Y, Z) =
U1 + U2 2 , U2 − U1 , U3 . √ U3 − 2 3
(B.1)
To find the distribution of the snowflake plot (X, Y), we first find the distribution of (X, Y, Z) and then integrate out Z. The inverse transformation of equation B.1 is √ √ (U1 , U2 , U3 ) = (−X 3/2 − Y/2 + Z, −X 3/2 + Y/2 + Z, Z), √ which has Jacobian determinant = − 3/2; hence, the joint probability distribution function of (X, Y, Z) is
f XY Z (x, y, z) =
√ √ √ 3 I {0 ≤ −x 3/2 − y/2 + z ≤ τ, 0 ≤ −x 3/2 + y/2 3 2τ + z ≤ τ, 0 ≤ z ≤ τ },
where I is the indicator function: I {0 ≤ x ≤ τ } = 1 if 0 ≤ x ≤ τ and zero otherwise. Next, we integrate out z from f XY Z (x, √y, z). This is equivalent √ to finding the intersection of the three intervals [x 3/2 + y/2, τ + x 3/2 + √ √ y/2], [x 3/2 − y/2, τ + x 3/2 − y/2], and [0, τ ] for each combination of (x, √ a sector: for example, if (x, y) ∈ sector ACB, then 0 ≤ y ≤ τ , √ y). Now fix x 3 ≤ y, −x 3 ≤ y. Since 0 ≤ y, we have √ √ √ √ x 3/2 − y/2, τ + x 3/2 − y/2 x 3/2 + y/2, τ + x 3/2 + y/2 √ √ = x 3/2 + y/2, τ + x 3/2 − y/2 . √ √ √ Next, √ since −x 3 ≤ y and x 3 ≤ y, we have x 3/2 + y/2 ≥ 0 and τ + x 3/2 − y/2 ≤ τ . Hence, √ √ x 3/2 + y/2, τ + x 3/2 − y/2 [0, τ ] √ √ = x 3/2 + y/2, τ + x 3/2 − y/2 .
1476
G. Czanner, S. Grun, ¨ and S. Iyengar
Finally, the length of the last interval is τ − y, so f XY (x, y) = the sector ACB. The proof for the other sectors is similar.
√ 3 (τ 2τ 3
− y) in
Appendix C: Null Mean and Variance of Counts in Cubes Suppose that the three neurons are firing according to independent Poisson processes with intensities λ A, λ B , and λC . Denote the firing times of the three neurons by (A i , B j , Ck ), with 1 ≤ i ≤ n A, 1 ≤ j ≤ n B , and 1 ≤ k ≤ nC . Next, define the statistic N{(a , b] × (c, d] × (e, f ]} =
nC nB nA
I {(Ai , B j , Ck ) ∈ (a , b] × (c, d] × (e, f ])}
i=1 j=1 k=1
=
nA
I {Ai ∈ (a , b]}
i=1
nB j=1
I {B j ∈ (c, d]}
nC
I {Ck ∈ (e, f ]},
k=1
which counts the number of points in the cube (a , b] × (c, d] × (e, f ]. Thus, conditional on the number of spikes, this count is the product of three independent binomial variates, say, X ∼ Bin(n A, b−a ), Y ∼ Bin(n B , d−c ), and τ τ f −e Z ∼ Bin(nC , τ ). Thus, the conditional mean of N = N{(a , b] × (c, d] × (e, f ]} is E(N|n A, n B , nC ) =
1 n An B nC (b − a )(d − c)( f − e), τ3
and since ni are independent Poisson variates with means τ ni for i = A, B, C, the unconditional mean is E(N) = λ Aλ B λC (b − a )(d − c)( f − e). Thus, the conditional mean depends on the number of spikes n A, n B , nC through their product only, and the unconditional mean depends on intensities λ A, λ B , λC through their product only. In both cases, they are proportional to the volume of the box (a , b] × (c, d] × (e, f ]. Next, the computation and study of the conditional variance of N is rather lengthy, so we sketch the details here and refer to Czanner (2004) for the technical details. The conditional variance is var(N|n A, n B , nC ) = var(XY Z) = E (XY Z)2 − [E(XY Z)]2 . To simplify notation, suppose that b − a = d − c = f − e = h, so the box is a cube. Using the independence of X, Y, and Z and binomial moments, we
Theory of the Snowflake Plot
1477
get var(N|n A, n B , nC ) is h n An B nC h 3 (1 − h)3 1+ (n A + n B + nC ) τ6 1−h
2 h + (n An B + n B nC + nC n A) , 1−h which depends on n A, n B , nC separately, not just on their product. This conditional variance is smallest if n A = n B = nC (assuming (b − a ) = (d − c) = ( f − e) = h and n An B nC = K 0 where K 0 is a given positive constant that measures the overall density of points in the cube). First, notice that the variance has the form H(n A, n B , nC ) = n An B nC [K 1 + K 2 (n An B + n B nC + nC n A) +K 3 (n A + n B + nC )] , for some positive constants K 1 , K 2 , K 3 when the counts are all nonzero and h is not 0 or 1. To minimize H subject to the constraint n An B nC = K 0 , we treat the counts as continuous variables and use Lagrange multipliers to show that the minimum occurs when n A = n B = nC . Finally, to get the unconditional variance, we use the fact that for any two random variables or vectors U and V, var[U] = E (Var[U|V]) + var (E[U|V]). Now if we let U = N and V = (n A, n B , nC ), and use the moments of the Poisson distribution, we get that the unconditional variance of N is var(N) = λ Aλ B λC [K 4 + K 5 (λ A + λ B + λC ) + K 6 (λ Aλ B + λ AλC + λ B λC )] where K 4 , K 5 , and K 6 are all positive constants depending on τ and h only. Note that the unconditional variance has the same form as the conditional variance, so that for a given value of the product λ Aλ B λC , the unconditional variance is minimized when λ A = λ B = λC . Acknowledgments We thank the referees for their helpful comments. References Abeles, M. (1982a). Local cortical circuits: An electrophysiological study, New York: Springer. Abeles, M. (1982b). Role of cortical neuron: Integrator or coincidence detector? Israel J. Medical Science, 18, 83–92.
1478
G. Czanner, S. Grun, ¨ and S. Iyengar
Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiology, 70, 1629– 1638. Abeles, M., & Gerstein, G. L. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. J. Neuroscience, 60, 909–924. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” J. Neurophysiology, 61, 900–917. Beggs, J. M., & Plenz D. (2004). Neuronal avalanches are diverse and precise activity patterns that are stable for many hours in cortical slice cultures. J. Neuroscience, 24, 5216–5229. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nature Neuroscience, 7, 456–461. Buzsaki, G. (2004). Large-scale recording of neuronal ensembles. Nature Neuroscience, 7, 446–451. Czanner, G. (2004). Applications of statistics in neuroscience. Unpublished doctoral dissertation, University of Pittsburgh. Gerstein, G. (2004). Searching for significance in spatiotemporal firing patterns. Acta Neurobiologiiae Experimentalis, 64, 203–207. Gerstein, G. L., & Perkel, D. H. (1969). Simultaneously recorded trains of action potentials: Analysis and functional interpretations. Science, 164, 828–830. Gerstein, G. L., & Perkel, D. H. (1972). Mutual temporal relationships among neuronal spike trains: Statistical techniques for display and analysis. Biophysical Journal, 12, 453–473. Grun, ¨ S., Diesmann, M., & Aertsen, A. (2002). Unitary events in multiple singleneuron spiking activity: I. Detection and significance. Neural Computation, 14, 43–80. Gutig, ¨ R., Aertsen, A., & Rotter, S. (2003). Analysis of higher-order neuronal interactions based on conditional inference. Biological Cybernetics, 88, 352–359. Hoffman, K. L., & McNaughton, B. L. (2002). Coordinated reactivation of distributed memory traces in primate neocortex. Science, 297, 2070–2073. Karlin, S., & Taylor, H. M. (1975). A first course in stochastic processes. New York: Academic Press. Kristan W. B., & Gerstein G. L. (1970). Plasticity of synchronous activity in a small neural net. Science 169, 1336–1339. Kuhn, A., Aertsen, A., & Rotter, S. (2003). Higher-order statistics of input ensembles and the response of simple model neurons. Neural Computation, 15, 67–101. Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. New York: Academic Press. Nakahara, H., & Amari, S. (2002). Information-geometric measure for neural spikes. Neural Computation, 14, 2269–2316. Perkel, D. H., Gerstein, G. L., Smith, M. S., & Tatton, W. G. (1975). Nerve-impulse patterns: A quantitative display technique for three neurons. Brain Research, 100, 271–296.
Theory of the Snowflake Plot
1479
Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiology, 79, 2857–2874. Schneider, G., & Grun, ¨ S. (2003). Analysis of higher-order correlations in multiple parallel processes. Neurocomputing, 52–54, 771–777. Villa, A. E., & Fuster, J. M. (1992). Temporal correlates of information processing during visual short-term memory. Neuroreport, 3, 113–116. Warren, D. J., Fernandez, E., & Normann, R. A. (2001). High-resolution two-dimensional spatial mapping of cat striate cortex using a 100-microelectrode array. Neuroscience, 105, 19–31.
Received November 4, 2003; accepted December 1, 2004.
LETTER
Communicated by Naftali Tishby
Asymptotic Theory of Information-Theoretic Experimental Design Liam Paninski
[email protected] Gatsby Computational Neuroscience Unit, University College London, London, WC1N 3AR, U.K.
We discuss an idea for collecting data in a relatively efficient manner. Our point of view is Bayesian and information-theoretic: on any given trial, we want to adaptively choose the input in such a way that the mutual information between the (unknown) state of the system and the (stochastic) output is maximal, given any prior information (including data collected on any previous trials). We prove a theorem that quantifies the effectiveness of this strategy and give a few illustrative examples comparing the performance of this adaptive technique to that of the more usual nonadaptive experimental design. In particular, we calculate the asymptotic efficiency of the information-maximization strategy and demonstrate that this method is in a well-defined sense never less efficient—and is generically more efficient—than the nonadaptive strategy. For example, we are able to explicitly calculate the asymptotic relative efficiency of the staircase method widely employed in psychophysics research and to demonstrate the dependence of this efficiency on the form of the psychometric function underlying the output responses. 1 Introduction Many experiments are undertaken with the hope of elucidating some kind of “input-output” relationship: the experimenter presents some stimulus to the system under study and records the response. More generally, the experimenter places some observational apparatus in some state—for example, by pointing a microscope to a given location or selecting some subfield in a database stream—and records the subsequent observation. If the system is simple enough and a sufficient number of observations are made, the resulting collection of data should provide an acceptably precise description of the system’s overall behavior. Given this basic paradigm, in which the experimenter has some kind of control over what stimulus is chosen or what kind of data are collected, how do we design experiments to be as efficient as possible? How can we learn the most about the system under study in the least amount of time? This question becomes especially pressing in the context of high-dimensional, Neural Computation 17, 1480–1507 (2005)
© 2005 Massachusetts Institute of Technology
Asymptotic Theory of Information-Theoretic Experimental Design
1481
complex systems, where each input-output pair typically provides a small amount of information about the behavior of the system as a whole and opportunities to record responses are rare or expensive (or both). In such cases, good experimental design can play an essential role in making the benefits of the experiment worth the cost. How can we precisely define this intuitive concept of the efficiency of an experiment? First, we have to define what exactly we mean by experiment. We use the following simple model of experimental design here (we have neurophysiological experiments in mind, but our results are all general with respect to the identity of the system under study). The basic idea is that we have some set of models , where each model θ indexes a given probabilistic input-output relationship. More precisely, a model is a set of regular conditional probability distributions p(y|x, θ ) on Y, the set of possible output responses, given any input stimulus x in some space X. Therefore, if we know the identity of the model θ , we know the probability of observing any output y given any input x. Of course, we do not know θ exactly (otherwise we would not need to perform any experiments); our knowledge of the system is summarized in the form of a prior probability measure, p0 (θ ), on , and our goal is to reduce the uncertainty of this distribution as much as possible. To put everything together, the joint probability of θ , x, and y is given by the following simple equation: p(x, y, θ ) = p0 (θ ) p(x) p(y|x, θ).
Now we can define the “design” of our experiment in a straightforward way: on any given trial, the design is specified completely by the choice of the input probability p(x), the only piece of the above equation over which we have control. One common approach is to fix some p(x) at the beginning of the experiment and then sample from this distribution in an independent and identically distributed (i.i.d.) manner for all subsequent trials, independently of which input-output pairs might have been observed on any previous trial. Alternatively, we could try to design our experiment— choose p(x)—optimally in some sense, updating p(x) online, on each trial, as more input-output data are collected and our understanding of the system increases. (The simplest special case of this would be to choose p(x) to put all probability mass on a single x, where x is optimized on each new trial.) One natural idea would be to choose p(x) in such a way that we learn as much as possible about the underlying model, on average. Information theory (Cover & Thomas, 1991) thus suggests we choose p(x) to optimize the following objective function, I ({x, y}; θ ),
(1.1)
1482
L. Paninski
where I (.; .) denotes mutual information. In other words, we want to choose p(x) adaptively to maximize the information provided about θ by the pair {x, y}, given our current knowledge of the model as summarized in the posterior distribution given N samples of data: p N (θ) = p(θ|{xi , yi }1≤i≤N ). We will take this information-theoretic concept of efficiency as our starting point. We note, however, that similar ideas have seen application in a wide and somewhat scattered literature: in statistics (Lindley, 1956), computer vision (Denzler & Brown, 2000; Lee & Yu, 1999), machine learning (Luttrell, 1985; Mackay, 1992; Cohn, Ghahramani, & Jordan, 1996; Sollich, 1996; Freund, Seung, Shamir, & Tishby, 1997; Axelrod, Fine, Gilad-Bachrach, Mendelson, & Tishby, 2001), conceptual psychology (Nelson & Movellan, 2000), psychophysics (Watson & Pelli, 1983; Pelli, 1987; Watson & Fitzhugh, 1990; Kontsevich & Tyler, 1999), medical applications (Parmigiani, 1998; Parmigiani & Berry, 1994), and neuroscience (Sahani, 1997). These references all discuss, to some degree, the motivation behind various different design criteria, of which the information-theoretic criterion is well motivated but certainly not unique. For more general reviews of the theory of experimental design, see, for example, Chaloner and Verdinelli (1995) and Fedorov (1972). In addition, several attempts have been made to devise algorithms to find the “optimal stimulus” of a neuron, where optimality is defined in terms of firing rate (Tzanakou, Michalak, & Harth, 1979; Nelken, Prut, Vaadia, & Abeles, 1994; Foldiak, 2001), but we should emphasize that the two concepts of optimality are not related in general and turn out to be typically at odds (maximizing the firing rate of a cell does not maximize— and in fact often minimizes—the amount we can expect to learn about the cell; see sections 3 and 4). Most recently, Machens (2002) proposed the maximization of the mutual information between the stimulus x and response y; again, though, this procedure does not directly maximize the amount of information we gain about the underlying system θ. Somewhat surprisingly, we have not seen any applications of the information-theoretic objective function, equation 1.1, to the design of neurophysiological experiments (although see the abstract by Mascaro & Bradley, 2002, who seem to have independently implemented the same idea in a simulation study). One major reason for this might be the computational demands of this kind of design (particularly for real-time applications), although these problems certainly do not appear to be intractable given modern computing power (see, e.g., Kontsevich & Tyler, 1999, for a real-time application in which is two-dimensional). We hope to address these important computational questions elsewhere. The primary goal of this letter is to elucidate the asymptotic behavior of the a posteriori density p N when we choose x according to the recipe outlined above; in particular, we want to compare the adaptive case to
Asymptotic Theory of Information-Theoretic Experimental Design
1483
the more usual (i.i.d. x) case. Our main result (in section 2) states that under acceptably weak conditions on the models p(y|x, θ), the informationmaximization strategy leads to consistent and efficient estimates of the true underlying model, in a natural sense. In particular, the informationmaximization strategy is never less efficient, in a well-defined sense—and is generically more efficient—than the simpler, nonadaptive, i.i.d. x strategy. We also give a few examples to illustrate the applicability of our results (see sections 3 and 4), including a couple of surprising negative examples that demonstrate the nontriviality of our mathematical results (see section 5). We close by briefly noting the relevance of our results to noninformationtheoretic (e.g., mean-square-error based) design and describing a few open avenues for further research. 2 Results First, we note that the problem as posed in section 1 turns out to be slightly simpler than one might have expected, because I ({x, y}; θ) is linear in p(x): I ({x, y} ; θ) = X
Y
p(x, y, θ ) log
p(x, y, θ ) p(x, y) p N (θ)
p(x, y, θ ) log
p(x) p N (θ ) p(y|x, θ) p(y|x) p(x) p N (θ)
p(x, y, θ ) log
p(y|x, θ) p(y|x)
= X
Y
=
X
=
Y
p(x) X
Y
p N (θ ) p(y|x, θ) log
p(y|x, θ) . p N (θ) p(y|x, θ)
This, in turn, implies that the optimal p(x) must be degenerate, concentrated on the points x where I is maximal. Thus, instead of finding optimal distributions p(x), we need only find optimal inputs x, in the sense of maximizing the conditional information between θ and y, given a single input x: I (y; θ|x) ≡ Y
p N (θ ) p(y|x, θ) log
p(y|x, θ) . p N (θ ) p(y|x, θ)
(We will assume throughout the article that this function attains its supremum in X—a condition guaranteeing that this is so will be given below— and that some reasonable, though possibly nondeterministic, tie-breaking stategy exists when this maximum is not unique.) Our main result is a Bernstein–von Mises type of theorem (van der Vaart, 1998). The classical form of this kind of result says, basically, that if the posterior distributions are consistent (in the sense that p N (U) → 1 for any neighborhood U of the true parameter θ0 ) and the likelihood ratios are sufficiently smooth on average, then the posterior distributions p N (θ ) are
1484
L. Paninski
asymptotically normal, with easily calculable asymptotic mean and variance. In particular, it is well known that a result of this type holds in the i.i.d. x case: under the smoothness conditions stated below, the posterior 2 distribution p N is asymptotically normal, with covariance matrix σiid /N, σiid ≡
−1
2
X
dp(x)Iθ0 (x)
,
and a mean that itself is a normal random variable with mean θ0 and co2 variance σiid /N. Here we have denoted the Fisher information matrices, Iθ (x) =
Y
p˙ (y|x, θ ) p(y|x, θ)
t
p˙ (y|x, θ) dp(y|x, θ), p(y|x, θ)
where the differential p˙ is taken with respect to θ. In other words, the asymp2 totic variance decays as 1/N, with the exact rate σiid defined as the inverse of the average Fisher information, where the average is taken over p(x). We adapt this result to the present case, where x is chosen according to the information-maximization recipe. Our main result will allow us to 2 compute the asymptotic variance σinfo /N and in particular will demon2 2 strate that |σinfo | ≤ |σiid |, with |.| denoting the determinant of a matrix; that is, the information-maximization strategy is more efficient in general than i.i.d. sampling, at least in the sense measured by the determinant |.|. It turns out that the hard part is proving consistency (see section 5); we give the basic consistency lemma (interesting in its own right) first, from which the main theorem follows fairly easily. The proofs appear in appendix A. Lemma 1 (Consistency). Assume the following conditions: 1. The parameter space is a compact metric space. 2. The log likelihood log p(y|x, θ ) is uniformly Lipschitz in θ with respect to some dominating measure on Y. 3. The prior measure p0 assigns positive measure to any neighborhood of θ0 . 4. The maximal Kullback-Leibler divergence, p(y|x, θ0 ) sup DKL (θ0 ; θ|x) ≡ sup dp(y|x, θ0 ) log x x p(y|x, θ) Y is positive for all θ = θ0 . Finally, assume that the set of log likelihood functions log p(y|x, θ), indexed by x, is compact in the sup-norm topology on θ-continuous functions on Y × . Then the posteriors are consistent: p N (U) → 1 in probability for any neighborhood U of θ0 .
Asymptotic Theory of Information-Theoretic Experimental Design
1485
Theorem 1 (Asymptotic normality). Assume the conditions of lemma 1, strengthened as follows: 1. has a smooth, finite-dimensional manifold structure in a neighborhood of θ0 . 2. The log likelihood log p(y|x, θ ) is uniformly C 2 in θ . In particular, the Fisher information matrices Iθ (x) are well defined and continuous in θ , uniformly in (x, θ) in some neighborhood of θ0 . 3. The prior measure p0 is absolutely continuous in some neighborhood of θ0 , with a continuous positive density at θ0 . max C∈co(Iθ0 (x)) |C| > 0 ,
4.
where co (Iθ0 (x)) denotes the convex closure of the set of Fisher information matrices Iθ0 (x). Then || p N − N µ N , σ N2 || → 0 in probability, where ||.|| denotes variation distance, N (µ N , σ N2 ) denotes the normal density with mean µ N and covariance σ N2 , and µ N is asymptotically normally distributed, with mean θ0 and variance σ N2 . Here –1 2 Nσ N2 → σinfo ≡ argmaxC∈co(Iθ (x)) |C| . 0
The maximum in the above expression is well defined and unique. Corollary 1. If, in addition, the prior p0 is absolutely continuous, with density bounded on the parameter space , then the maximum a posteriori (MAP) estimator is consistent almost surely, with asymptotic distribution N (θ0 , σ N2 ). Thus, under these conditions, the information-maximization strategy 2 works; moreover, since the asymptotic i.i.d. variance σiid is inversely related 2 to an average over x and the information-maximization variance σinfo to a maximum over co(Iθ0 (x)), we have by the definition of co(Iθ0 (x))—the closure of the set of all possible averages over Iθ0 (x) with respect to arbitrary 2 2 2 p(x)—that |σinfo | is never larger than |σiid |. For one-dimensional θ, σinfo is 2 strictly smaller than σiid except in the somewhat exceptional case that Iθ0 (x) is constant almost surely in p(x). Thus, information maximization is in a rigorous sense asymptotically more efficient than the i.i.d. sampling strategy. A few words about the assumptions are in order. Most should be fairly self-explanatory: the conditions on the priors, as usual, are there to ensure that no matter how mistaken our original prior beliefs are, in the face of sufficient posterior evidence, we will come around to agreeing that the data
1486
L. Paninski
are in fact generated by the true underlying model θ0 . The smoothness assumptions on the likelihood permit the local expansion that is the source of asymptotic normality, and the condition on the maximal divergence function supx DKL (θ0 ; θ|x) ensures that distinct models θ0 and θ are identifiable (that is, for any θ = θ0 , there is some input x that will reliably distinguish between θ and θ0 given enough output samples yi ). The assumption that the set of log likelihood functions log p(y|x, θ) is compact will guarantee that the objective function I (y, θ |x) always attains its maximum in x. Finally, some form of monotonicity or compactness on is necessary, mostly to bound the maximal divergence function supx DKL (θ0 ; θ |x) and its inverse away from zero (the lower bound, again, is to uniformly ensure identifiability; the necessity of the upper bound will become clear in section 5). Also, compactness is useful (though not necessary) for adapting certain Glivenko-Cantelli bounds (van der Vaart, 1998) for the consistency proof. It should also be clear that we have not stated the results as generally as possible. We have chosen instead to use assumptions that are simple to understand and verify and to leave the technical generalizations to the interested reader. Our assumptions should be weak enough for most neurophysiological and psychophysical situations, for example, by assuming that parameters take values in bounded (though possibly large) sets and that tuning curves are not infinitely steep.
3 Applications 3.1 Psychometric Model. As noted in section 1, psychophysicists have employed versions of the information-maximization procedure for some years (Watson & Pelli, 1983; Pelli, 1987; Watson & Fitzhugh, 1990; Kontsevich & Tyler, 1999). References in Watson and Fitzhugh (1990), for example, go back four decades, and while these earlier investigators usually couched their discussion in terms of variance instead of entropy, the basic idea is the same (note, for example, that in the one-dimensional θ case, minimizing entropy is asymptotically equivalent to minimizing variance, by our main theorem). Our results above allow us to quantify the effectiveness of this stategy precisely. One general psychometric model is as follows. The response space Y is binary, corresponding to subjective yes or no detection responses. Let f be sigmoidal: a uniformly smooth, monotonically increasing function on the line, such that f (0) = 1/2, limt→−∞ f (t) = 0 and limt→∞ f (t) = 1 (this function represents the detection probability when the subject is presented with a stimulus of strength t). Let f a ,θ = f ((t − θ )/a ); θ here serves as a location (“threshold”) parameter, while a sets the scale (we assume a is known for now, although this can be relaxed; (Kontsevich & Tyler, 1999). Finally, let p(x) and p0 (θ ) be some fixed sampling and prior distributions, respectively, both equivalent to Lebesgue measure on some interval .
Asymptotic Theory of Information-Theoretic Experimental Design
1487
Now, for any fixed scale a , we want to compare the performance of the information-maximization strategy to that of the i.i.d. p(x) procedure. We have by theorem 1 that the most efficient estimator of θ is asymptotically 2 unbiased with asymptotic variance σinfo /N, with −1 2 = sup Iθ0 (x) , σinfo x
while the usual calculations show that the asymptotic variance of any effi2 cient estimator based on i.i.d. samples from p(x) is given by σiid /N, with 2 = σiid
−1 X
dp(x)Iθ0 (x)
.
The Fisher information is easily calculated here to be Iθ =
( f˙ a ,θ )2 . f a ,θ (1 − f a ,θ )
We can immediately derive two easy but important conclusions. First, there is just one function f ∗ satisying the assumptions stated above for which the i.i.d. sampling strategy is as asymptotically efficient as the information-maximization strategy; for all other f , information maximization is strictly more efficient. This extremal function f ∗ is the unique solution of the following differential equation, derived by setting Iθ to a constant (and therefore making the expected Fisher information equal to the maximal Fisher information), df ∗ =c dt
1/2
f ∗ (t)(1 − f ∗ (t))
where the auxiliary constant c = calculus, we obtain f ∗ (t) =
, √
I θ uniquely fixes the scale a . After some
sin(ct) + 1 2
on the interval [−π/2c, π/2c] (and defined uniquely, by monotonicity, as 0 or 1 outside this interval). Since the support of the derivative of this function is compact, this result is not independent of the sampling density p(x); if 2 p(x) places any of its mass outside the interval [−π/2c, π/2c], then σiid is 2 always strictly greater than σinfo (since f˙ , and therefore Iθ0 (x), is zero outside this interval). This recapitulates a basic theme from the psychophysical literature comparing adaptive and nonadaptive techniques. When the scale
1488
L. Paninski
of the nonlinearity f is either unknown or smaller than the scale of the i.i.d. sampling density p(x), adaptive techniques are generally preferable. Second, a crude analysis shows that as the scale of the nonlinearity a 2 2 shrinks, the ratio σiid /σinfo grows approximately as 1/a . This gives quantitative support to the intuition that the sharper the nonlinearity with respect to the scale of the sampling distribution p(x), the more we can expect the information-maximization strategy to help. In fact, in the limit as a → 0, samples from the model become perfectly deterministic (with the response curve f 0,θ changing discontinuously from 0 to 1 at θ), and the informationmaximization strategy becomes infinitely more efficient than i.i.d. sampling. Information-maximal sampling is a version of the “twenty questions” game here, with each query x decreasing the entropy of p N (θ) by one bit, which in turn leads to exponential convergence in N instead of the N−1/2 rate guaranteed in the smoothly varying f case. 3.2 Linear-Nonlinear Cascade Model. We now consider a model that has received growing attention from the neurophysiology community (see, e.g., Simoncelli, Paninski, Pillow, & Schwartz, 2004, for a recent review). The model is of cascade form, with a linear stage followed by a nonlinear stage: the input space X is a compact subset of d-dimensional Euclidean space (take X to be the unit sphere, for concreteness), and the firing rate of the model cell, given input x ∈ X, has the simple form x ). E(y| x , θ ) = f ( θ, Here the linear filter θ is some unit vector in X , the dual space of X (thus, is isomorphic to X, as in the previous example), while the nonlinearity f is some nonconstant, nonnegative function on [−1, 1]. We assume that f is uniformly smooth, to satisfy the conditions of theorem 1; we also assume f is known, although, again, this can be relaxed. The response space Y— the space of possible spike counts, given the stimulus x —can be taken to be some large, bounded set of the nonnegative integers. For simplicity, let the conditional probabilities p(y| x , θ ) be parameterized uniquely by the mean firing rate f ( θ , x ); the most convenient model, as usual, is to assume x ). Finally, we assume that the that p(y| x , θ) is Poisson with mean f ( θ, sampling density p(x) is uniform on the unit sphere (this choice is natural for several reasons, mainly involving symmetry; see, e.g., (Chichilnisky, 2001; Simoncelli et al., 2004), and that the prior p0 (θ ) is positive and continuous (and is therefore bounded above and away from zero by the compactness of ). The Fisher information for this model is easily calculated as Iθ (x) =
x ))2 ( f˙ ( θ, Px ,θ , f ( θ , x )
Asymptotic Theory of Information-Theoretic Experimental Design
1489
where f˙ is the usual derivative of the real function f and Px ,θ is the projection operator corresponding to x , restricted to the (d − 1)-dimensional tangent space to the unit sphere at θ . (We have assumed that the bounded set Y of allowed spike counts has been taken sufficiently large to ignore the deviations from exact Poisson behavior due to the finite spike count cutoff.) Theorem 1 now implies that σinfo = 2
f˙ (t)2 g(t) max t∈[−1,1] f (t)
−1 ,
while 2 = σiid
dp(t) [−1,1]
f˙ (t)2 g(t) f (t)
−1 ,
where g(t) = 1 − t 2 , p(t) denotes the one-dimensional marginal measure induced on the interval by the uniform measure p(x) on the unit sphere, and σ 2 in each of these two expressions multiplies (d − 1)Id−1 , with Id−1 denoting the (d − 1)-dimensional identity matrix. 2 2 Clearly, the arguments of section 3.1 apply here as well: the ratio σiid /σinfo grows roughly linearly in the inverse of the scale of the nonlinearity. The more interesting asymptotics here, though, are in d. This is because the unit sphere has a measure concentration property (Milman & Schechtman, 1986; Talagrand, 1995): as d → ∞, the measure p(t) becomes exponentially concentrated around 0. In fact, it is easy to show directly that in this limit, p(t) converges in distribution to the normal measure with mean zero and variance d −2 . The most surprising implication of this result is seen for nonlinearities f such that f˙ (0) = 0, f (0) > 0; we have in mind, for example, symmetric nonlinearities like those often used to model complex cells in visual cortex. For these nonlinearities, 2 σinfo 2 σiid
= O(d −2 ) :
that is, in this case, the information-maximization strategy becomes infinitely more efficient than the usual i.i.d. approach as the dimensionality of the spaces X and grows.
4 Illustrations Next we give some illustrations of the behavior of the informationoptimization strategy, as compared to the nonadaptive i.i.d. case.
1490
L. Paninski
0
p(y = 1 | x, θ )
x 1
a.
0.5
0
b.
Iθ (x)
150 100 50 0 x 10-3
c.
I(y ; θ | x)
6 4 2
d.
p(θ)
0.2
trial 100
0.1 0 0
optimized i.i.d. 0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
θ
Figure 1: Snapshot of behavior of information-maximizing and i.i.d. experimental designs. (a) True underlying conditional response probabilities given input x. Model space includes all translates of firing rate curve shown here. (b) Fisher information Iθ0 (x) as a function of x. (c) Mutual information between the response y and the underlying location parameter θ, as a function of x, after 100 samples. (d) Posterior distributions p N (θ ) after N = 100 samples. The asterisk indicates the location of true model parameter θ0 . The dashed lines give gaussian approximations to true observed posteriors, though the dashed curves are obscured by the quality of the fit.
4.1 One-Dimensional Example. For clarity, we start with a simple example, for which the stimulus and model spaces X and are both onedimensional and the outputs are again binary. We illustrate the model in Figure 1. The system responds positively with high probability when the stimulus x and model preference θ agree; this probability decays smoothly and symmetrically as the difference |x − θ| increases. (We could think of this response probability curve as a sensory neuron’s receptive field for some
Asymptotic Theory of Information-Theoretic Experimental Design
1491
one-dimensional stimulus, for example, or as the place field of a cell in the hippocampus of a rat constrained to run along a one-dimensional track.) The experimenter’s goal is to determine the optimal θ0 , given the response curves p(y = 1|x, θ ). We begin with no knowledge of the true θ0 except that 0 ≤ θ0 ≤ 1; thus, we take the prior p0 (θ ) to be uniform on [0, 1]. In the bottom panel of Figure 1, we show the results of an experiment in which we draw 100 input samples i.i.d. from the uniform distribution p(x) = 1 on [0, 1], and then compare to the results given 100 samples drawn adaptively, following the information-maximization strategy. After 100 samples, we find that the mutual information curve I (y; θ|x), as a function of the input x (see Figure 1c), closely resembles the Fisher information curve Iθ0 (x) (see Figure 1b); in particular, the two curves reach their maxima for the same values of x, indicating that after 100 samples, the information-maximizing strategy is indeed sampling from the x that maximizes the Fisher information Iθ0 (x), as predicted by theorem 1. Note that sampling from the x that maximizes the firing rate, x = θ0 = 0.2, asymptotically minimizes the information gain I (y; θ |x), as emphasized in section 1. Also as predicted, the posterior distributions p N (θ ) are quite well approximated as gaussian, with means near θ0 and with the posterior under the information-maximization strategy more concentrated near θ0 than in the i.i.d. x case. We look more quantitatively at the evolution of the posteriors in Figure 2. The top two panels show the posteriors p N (θ ) as a function of N, while the bottom three panels show the posterior mean, standard deviation, and probability mass in a small neighborhood of the true parameter θ0 , respectively. In each case, we again see that the posterior under the informationmaximization strategy converges more rapidly than under the i.i.d. strategy, as predicted. Moreover, the predicted standard deviations of the posterior density and of the posterior mean, σiid N−1/2 and σinfo N−1/2 , accurately match the true observed behavior. 4.2 A V1 Simple-Cell Example. Our second example is somewhat more realistic in that the neuron we are simulating responds to stimuli that have many degrees of freedom; that is, the parameter and input spaces and X are multidimensional. We take what is perhaps the standard model of the response properties of a simple cell in primary visual cortex (a version of the cascade model discussed in the last section; Dayan & Abbott, 2001): p(spike| x , θ ) = f ( k 0 , x ), where the true receptive field k 0 is taken to be a Gabor function (the product of a two-dimensional sinusoid and a gaussian whose mean determines the location of the receptive field in space; see Figure 3, top left), and the monotonic nonlinear function f enforces the positivity of the firing rate and can also model the cell’s saturation properties (see Figure 3, bottom left). For simplicity, we assume the nonlinearity f and the spatial frequency of the Gabor θ0 = k 0 to be known; thus, the model space is three-dimensional (two dimensions for the location of the receptive field
1492
L. Paninski
a.
θ
0 0.2 0.4 10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
b.
θ
0 0.2
d.
0.4 0.2
σ(p)
c.
E(p)
0.4
10-2
e.
P(θ0)
1
102
101
0.5 0
10
20
30
40
50 60 trial number
70
80
90
100
Figure 2: Evolution of posterior densities in model from Figure 1, as a function of trial number N. (a, b) Evolution of posteriors under information-optimizing and i.i.d. strategy, respectively. White level indicates height of probability density. Recall from Figure 1 that the true parameter is located at θ0 = 0.2. (c) Evolution of the posterior mean. The dotted line indicates true parameter location (θ0 = 0.2). The solid black and gray indicate mean given information-optimizing and i.i.d. strategies, respectively. Dashed lines show predicted 95% confidence intervals, θ0 +/ − 2σinfo N−1/2 and +/ − 2σiid N−1/2 . (d) Evolution of posterior standard deviation. Solid traces are observed standard deviations; dashed traces are predicted, σinfo N−1/2 and σiid N−1/2 . (e) Evolution of posterior mass contained in a small neighborhood of true parameter, p N ([θ0 − 0.05, θ0 + 0.05]).
and one for the orientation). As in the previous example, we start with no knowledge of the true parameter other than the fact that the center of the receptive field lies within the square shown in Figure 3, so our prior p0 (k) is uniform over orientation and spatial location. The series of panels on the
Asymptotic Theory of Information-Theoretic Experimental Design true kernel
N=20
N=40
N=60
N=80
1493 N=100
0
info
0.5
1
iid
p(spike | )
-0.5
0.5
0 -50
0 50
Figure 3: Evolution of posteriors in a simulated V1 simple cell experiment. (Left) True model: the simulated simple cell responds to the stimulus image x according to p(spike| x , k 0 ) = f ( k 0 , x ), with k 0 , x indicating the dot product with the Gabor kernel k 0 shown in the top-left panel and f the nonlinear rectification function shown in the bottom-left panel. (Right) Evolution of posteriors as a N (k). function of trial number N. Each panel shows the posterior mean kdp
via the posterior mean right displays the evolution of the posteriors p N (k) as p N (k) becomes concentrated around the true receptive field N (k); kdp k 0 , the image of the posterior mean resembles k 0 more and more closely. We took the stimuli x here to be Gabors of varying orientations and locations, but similar results are seen if white- or colored-noise stimuli are used instead (data not shown). As before, the information-optimizing strategy leads to more rapid convergence than i.i.d. sampling from x.
5 Negative Examples Our next two examples are more negative and perhaps more surprising: they show how the information-maximation strategy can fail, in a certain sense, if the conditions of the consistency lemma are not met. (Note that as emphasized above, these consistency conditions are fairly weak; therefore, the fact that they fail in the following examples implies that these examples might be interesting from a mathematical point of view but might have less practical negative relevance for psychophysical or neurophysiological situations.) In each case, the method can be fixed using ad hoc methods; it is unclear at present whether a generally applicable modification of the basic information-maximization strategy exists.
1494
L. Paninski
5.1 Two-Threshold Model. Let be multidimensional, with coordinates that are “independent” in the sense that the responses of the model given one coordinate do not depend on the value of the other coordinate, and assume the expected information obtained from one coordinate remains bounded strictly away from the expected information obtained from one of the other coordinates. For instance, consider the following binary model: .5 f −1 p(1|x, θ ) = .5 f1
−1 < x ≤ θ−1 , θ−1 < x ≤ 0, 0 < x ≤ θ1 , θ1 < x ≤ 1,
where 0 ≤ f −1 , f 1 ≤ 1, | f −1 − .5| > | f 1 − .5|, are known and −1 < θ−1 < 0 and 0 < θ1 < 1 are the parameters we want to learn. Let the initial prior p0 (θ ) be absolutely continuous with respect to the Lebesgue measure; this implies that all posteriors p N will have the same property. Then, using the inverse cumulative probability transform and the fact that mutual information is invariant with respect to invertible mappings, it is easy to show that the maximal information we can obtain by sampling from the left is strictly greater than the maximal information obtainable from the right, uniformly in N. Thus, the information-maximization strategy will sample from the left side forever, leading to a linear information growth rate (and easily proven consistency) for the left parameter and nonconvergence on the right. Compare the performance of the usual i.i.d. approach for choosing x (using any Lebesgue-dominating measure on the parameter space), which leads to the standard algebraic convergence rate for both parameters (i.e., is strongly consistent in posterior probability). Note that this kind of inconsistency problem does not occur in the case of sufficiently smooth p(y|x, θ), by our main theorem. Thus, one way of avoiding this problem would be to fix a finite sampling scale for each coordinate (i.e., discretizing). Below this scale, no information can be extracted; therefore, when the algorithm hits this “floor” for one coordinate, it will switch to the other. However, the next example shows that the lack of consistency is not necessarily tied to the discontinuous nature of the conditional densities. 5.2 White Noise Models. We present two models of a slightly different flavor: the basic mechanism of inconsistency is the same in each case. The samples x take values on the positive integers. The models live on the positive integers as well: θ is given by a standard discrete (1) normal and (2) binary white noise process (that is, p(θ) is generated by an infinite
Asymptotic Theory of Information-Theoretic Experimental Design
1495
sequence of standard normals and independent fair coins, respectively). The conditionals are defined as follows. For the first model, the observations y are gaussian-contaminated versions of θ (x), that is, y ∼ N (θ(x), 1). For the second model, let y be drawn randomly from q θ(x) , where q 0 and q 1 are nonidentical measures on some arbitrary space. Then it is not hard to show, for either model, that an experimenter using the information-maximization strategy will never sample from any x infinitely often. As soon as we learn something about θi (by sampling from xi ), θi+1 will become more interesting, and we will begin to sample from xi+1 instead. (For the second model, in fact, if the densities of q 0 and q 1 with respect to some dominating measure are unequal almost surely, and I (y, θ(1)) reaches its unique maximum in p(θ (1)) at the midpoint p(θ (1) = 0) = p(θ (1) = 1), then we will sample from each x just once, almost surely.) This again implies a lack of consistency of the posterior (although, as above, we have a linear growth of information). The basic idea is that there will always be a more informative part of the sample space X to measure from, and the experimenter will never spend enough time in one place x to sufficiently characterize θ (x). This emphasizes the necessity of something like the compactness condition we imposed in the statement of lemma 1. As in the last section, the standard i.i.d. approach (using any measure p(x) that does not assign zero mass to any of the integers) is consistent here. Note that in contrast to the last example, the smoothness of the conditionals p(y|x, θ) (in the gaussian model) does not rescue consistency. Nor is the inconsistency due to some pathology of differential entropy (the measures q i can be discrete, even binary). The “floor” trick suggested for the last example can be modified here by sequentially restricting our search for optimal x over compacta that are allowed to grow slowly toward infinity. More generally, we can probably salvage consistency in general by not sampling exclusively from information-maximizing points (perhaps by sampling “passively” with a frequency that decreases as N → ∞; this would restore consistency in many cases without a sacrifice in the asymptotic information growth rate). We leave the general formulation of such a result to the reader.
6 Directions We have presented a rigorous theoretical framework for adaptive design of experiments using the information-theoretic objective function (see equation 1.1). Most important, we have offered some asymptotic results that clarify the effectiveness of this information-maximizing strategy; in addition, we expect that our results should find applications in approximative computational schemes for optimizing stimulus choice during this type of online experiment. For example, our theorem 1 might suggest the use of a
1496
L. Paninski
mixture-of-gaussians representation as an efficient approximation for the posteriors p N (θ) (Deignan, Meckl, Franchek, Abraham, & Jaliwala, 2000). We briefly describe a few more open research directions. 6.1 Nuisance Parameters and Hypothesis Testing. Perhaps the most obvious such open question concerns the use of noninformation-theoretic objective functions. Concerning the question of which objective function is “best” in general, our results should be useful in clarifying the exact form 2 of the asymptotic covariance matrix σinfo : for the information-maximization case, this matrix asymptotically minimizes the log-determinant function on the class of feasible asymptotic covariance matrices (co(Iθ0 (x)))−1 . However, we are typically interested in some parameters more than others; as has been noted elsewhere (Mackay, 1992), mutual information as an objective function (and, by extension, its log-determinant asymptotic form) leaves little flexiblity for focusing our resources on these more interesting parameters. Alternative objective functions include weighted sums of entropies or Bayes mean-square errors. It turns out that many of our results apply with only modest changes if the experiment is instead designed to optimize these alternative objective functions: in this case, the results in sections 3 and 4.1 remain completely unchanged, while the statement of our main theorem requires only slight changes in the asymptotic covariance formula (see appendix A). The task of choosing a good objective function on input distributions p(x) in the multidimensional θ case is thus reduced asymptotically to the simpler problem of choosing a suitable objective function on covariance matrices; in the one-dimensional θ case, the asymptotic variance does not depend on whether we choose to minimize entropy or variance. An alternative approach that has received less attention involves mapping irrelevant “nuisance” parameters out of . In a sense, this is a special case of the weighted sum of entropies idea, for which some of the weights are set to zero, but can be defined in a slightly more general setting. We define our new objective function in a simple way as I ({x, y}; T(θ)), where T is a surjective map from to some new, reduced parameter space (obviously this new definition corresponds to the original equation 1.1 if T is bijective). This approach thus integrates over nuisance parameters in a completely direct way. Clearly, much of our asymptotic theory will go through under some continuity on T, as long as the assumptions on the maximal divergence supx DKL (θ0 ; θ |x) and on the positivity of the Fisher information matrices are unharmed. It is worth addressing an extreme specialization of the above idea: the case for which T maps to the two points {0, 1} corresponds to compound hypothesis testing. Again, as long as the gap on supx DKL (θ0 ; θ |x) is respected
Asymptotic Theory of Information-Theoretic Experimental Design
1497
by T, consistency will go through, but asymptotic normality stops making sense: the more relevant concept now becomes the large deviations behavior of p N (0) and p N (1), as described by the Chernoff information (Cover and Thomas, 1991; Dembo & Zeitouni, 1993). We have not addressed the optimal rates of convergence in this case, but note only that even simple hypothesis testing (i.e., the case that = {0, 1}) is aided by adaptive stimulus design, in the sense that a given x that optimizes I (y, θ|x) for one value of p N (0) is not necessarily optimal for all other p N (0); thus, as before, the optimal sampling strategy varies in general with N. 6.2 “Batch Mode” and Stimulus Dependencies. Perhaps our strongest assumption here is that the experimenter will be able to freely choose the stimuli on each trial. This might be inaccurate for a number of reasons: for example, computational demands might require that experiments be run in batch mode, with stimulus optimization taking place not after every trial, but perhaps only after each batch of k stimuli, all chosen according to some fixed distribution p(x). Another common situation involves stimuli that vary temporally, for which the system is commonly modeled as responding not just to a given stimulus x(t), but also to some or all of its time-translates x(t − τ ). Finally, if there is some cost C(x0 , x1 ) associated with changing the state of the observational apparatus from the current state x0 to x1 , the experimenter may wish to optimize an objective function that incorporates this cost: I (y; θ|x1 ) − C(x0 , x1 ), for example. Each of these situations is clearly ripe for further study. Here we restrict ourselves to the first setting and give a simple conjecture, based on the asymptotic results presented above and inspired by results like those of Berger, Bernardo, and Mendoza (1989), Clarke and Barron (1994), and Scholl (1998). First, we state more precisely the optimization problem inherent in designing a batch experiment: we wish to choose some sequence, {xi }1≤i≤k , to maximize I ({xi , yi }1≤i≤k ; θ). The main difference here is that {xi }1≤i≤k must be chosen nonadaptively, that is, without sequential knowledge of the responses {y j } j 0. This follows by exactly the usual proof (Schwartz, 1967; van der Vaart, 1998; Barron et al., 1999). The key step is to approximate log f N (θ) ≈
DKL (θ0 ; θ|xi )
i
by a uniform law of the large numbers argument (van der Vaart, 1998) (the term on the right is the expectation of that on the left, where the expectation is taken under the true parameter θ0 ). Once this is done, the statement is proven by using the fact that DKL (θ0 ; θ|x) → 0
1500
L. Paninski
as θ → θ0 , uniformly in x (by assumption 2 of lemma 1), and that p0 (U) > 0 for any neighborhood U (assumption 3). The next step in the usual proof is to demonstrate that ∩U c dp0 (θ ) f N (θ ) decreases at an exponential rate, that is, 1 lim sup log N N
∩U c
dp0 (θ ) f N (θ ) < −(U) < 0
almost surely for some positive that depends on U. Unfortunately, this exponential decay of p N ( ∩ U c ) does not necessarily hold in the informationmaximization case. An example of this nonexponential decay is given after the proof. To deal with this, first note that the set of functions DKL (θ0 ; θ |x) is compact in the sup-norm topology on functions on . This follows from the ArzelaAscoli theorem (Rudin, 1973), given the equicontinuity of the log likelihoods log p(y|x, θ ) in θ and the assumption that the set of these likelihoods is closed in the sup-norm topology. (A similar compactness argument guarantees that the maximum of I (y, θ|x) in x is always attained in X.) Thus, the full set of stimuli X may be replaced by a finite subset, X = {x j }0< jδ = {θ : DKL (θ0 ; θ|x) > δ ∀ x ∈ X } for any δ > 0, since the posterior mass of such a set decays exponentially, by the standard proof. This leaves us with the compact subset Z0 = {θ : min DKL (θ0 ; θ|x) = 0}, x∈X
the set of θ where DKL (θ0 ; θ|x) = 0 for at least one x ∈ X . Clearly, θ0 ∈ Z0 . If Z0 = θ0 , the proof is complete; thus, assume otherwise. To complete the proof, we just need to show that the informationmaximizing sampler does not asymptotically “ignore” any set Z within Z0 ∩ ∩ U c such that p0 (Z ) > 0 (with Z an arbitrary -neighborhood of Z); that is, it does not so frequently from x with DKL (θ0 ; θ |x) = 0 for sample θ ∈ Z that q N (Z) ≡ Z e − i DKL (θ0 ;θ |xi ) dp0 (θ ) does not decrease more quickly
Asymptotic Theory of Information-Theoretic Experimental Design
1501
than q N (U) = U e − i DKL (θ0 ;θ |xi ) dp0 (θ). The main idea preventing this is the smoothness condition on the log likelihood functions log p(y|x, θ); this condition guarantees that the information gain associated with increasing the concentration of p N at θ0 (the only point at which DKL (θ0 ; θ |x) = 0 for all x, by assumption) will decrease to zero, roughly as
log
q N (U) q N (U) − q N+1 (U) ≈ . q N+1 (U) q N (U)
Meanwhile, the gain associated with testing between θ0 ∈ U and the alternative hypothesis will remain large whenever the posterior mass on Z remains comparable to that on U, falling as q N (Z)/q N (U). Thus, the informationmaximizing sampler will prefer to increase the concentration at θ0 over attempting to distinguish between the hypotheses θ0 ∈ U and θ0 ∈ Z accordingly as q N (Z) < q N (U) − q N+1 (U) or otherwise, respectively. The definition of the information-maximization strategy now implies that either q N (Z) decays exponentially (which would again complete the proof) or, alternatively, q N (Z) q N (U) − q N+1 (U) ∼ , q N (U) q N (U) that is, q N (U) − q N+1 (U) ∼ q N (Z). Since the posterior mass on Z falls exponentially with the number of samples devoted to this hypothesis test and the posterior mass on U cannot fall exponentially (as discussed above), the posterior mass on Z must therefore decrease to zero relative to q N (U), because q N (U) − q N+1 (U) = o(q N (U)) for any sequence q N (U) that decays at a subexponential rate. The above logic is further illustrated in the example below. We should also note that a similar result can be stated in the case that θ0 ∈ / , that is, when the data are not generated by a member of the hypothesized parameter space. Here, as usual, the posterior may be approximated as p N (θ) ∼ e −
i
DKL (θ0 ;θ |xi )
p0 (θ ).
In the i.i.d. setting, this posterior will asymptotically concentrate around θ, which are closest to the true θ0 in the sense of average DKL distance; however, in the information-maximization case, this notion of closenessto the true θ0 depends strongly on the stimuli x, and it is not clear that e − i DKL (θ0 ;θ|xi ) will even have a well-defined limit in general. Thus, to generalize lemma 1 to this out-of- case, we would have to impose further conditions on the functions DKL (θ0 ; θ|x). For example, the above proof holds if we stipulate some fixed “closest” element θ ∗ such that θ ∗ is in the set of minimizers of DKL (θ0 ; θ|x) for all x and is the only such member of (just as in the setting
1502
L. Paninski 1 x1 x2
p(y = 1 | x,θ)
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
θ
Figure 4: An example of a model in which the posterior mass off neighborhoods of the true parameter θ0 does not decay exponentially. θ0 = 0 here, as marked by an asterisk.
of lemma 2, θ0 is the unique member of that minimizes DKL (θ0 ; θ |x) for all x). In this case, p N will asymptotically concentrate on θ ∗ . A.2 An Example of Subexponential Decay. As discussed above, one major difference between the asymptotic behavior of the posteriors in the information-maximization and i.i.d. sampling cases is that the posterior mass p N ( ∩ U c ) generically decays exponentially in the i.i.d. setting (where U, again, is any neighborhood of the true parameter θ0 ); however, in the information-maximization case, this exponential decay does not necessarily hold. We give a simple example of this phenomenon here. We choose both X and Y to be binary; the conditional distributions p(y = 1|x1 , θ) and p(y = 1|x2 , θ) are shown in Figure 4. The main important features to note are that p(y|x1 , θ0 ) = p(y|x1 , θ = 1), and p(y|x2 , θ0 ) = p(y|x2 , θ ∈ U), with U a sufficiently small neighborhood of θ0 = 0. Thus, x1 cannot distinguish between θ = 0 and 1, and x2 gives no information when p(θ ) is sufficiently concentrated about θ0 . This in turn implies that asymptotically, the information-maximization strategy will be to sample preferentially from x1 , as indicated by theorem 1. However, if all our samples are drawn from x1 , then significant posterior mass will remain on θ = 1. More precisely, the posterior p N (θ ) may be asymptotically approximated by a mixture of gaussians, one with mean at θ = 0 and the other with mean at θ = 1. (This approximation will hold asymptotically, by the usual argument, whenever the prior p0 (θ) has a positive, continuous density with respect to the Lebesgue measure on the shown.) Both gaussians have variance of order 1/n, where n = n(N) is the number of samples from x1 in the first
Asymptotic Theory of Information-Theoretic Experimental Design
1503
N samples; the asymptotic ratio of masses between the two gaussians, on the other hand, behaves as e −c(N−n) , with c = DKL (θ0 ; θ |x2 ). This implies that I (y; θ|x1 ) scales as 1/n, while I (y; θ|x2 ) scales as e −c(N−n) . This, in turn, means that n(N) satisfies the scaling n(N)−1 ∼ e −c(N−n(N)) , leading to the conclusion that N − n(N), the number of samples from x2 , grows sublinearly in N, and therefore that the posterior mass off neighborhoods of the true parameter θ0 does not decay exponentially in N. Conversely, it is easy to demonstrate exponential decay for any i.i.d. sampling strategy that places positive mass p(x) on both x1 and x2 . A.3 Asymptotic Normality. The proof of asymptotic normality is fairly standard, and is therefore omitted (see, e.g., Schervish, 1995; van der Vaart, 1998). The only new part is the computation of the asymptotic variance σ N2 . The classical result tells us that
Nσ N2
−1
→
N 1 Iθ (xi ). N i=1 0
To obtain our result, we need to understand the tail behavior of this sum. We proceed by analyzing the dynamical system AN−1 → AN =
1 ((N − 1)AN−1 + B N ), N
with AN denoting the negative Hessian of a suitable gaussian approximation to the posterior likelihood after N trials at θ N , its maximizer; B N denotes the negative Hessian of log p(yN |xN , θ ) at θ N . This map gives a rough (but asymptotically accurate) approximation of the effect of the Nth sample on the (near-gaussian) posterior, after a suitable variance stabilizing. It is clear that the range of this map is asymptotically contained in co(Iθ0 (x)), which we have defined as the closure of all convex combinations of the available Fisher information matrices, Iθ0 (x), with x ranging over the full sample space X. It is equally clear, when we examine the above dynamical system over multiple trials (thus, roughly, averaging over multiple B N ), that the information-maximization strategy is asymptotically doing something like gradient ascent on log |AN | (the asymptotic negative entropy of the gaussian posterior, up to an irrelevant scale factor), with the allowed ascent directions taking values within co(Iθ N (x)), which in turn, by the consistency lemma and the continuity of Iθ (x), converges to co(Iθ0 (x)). Our asymptotic variance formula now follows from the strict concavity of the function log |C| in C, where C ranges over the symmetric, positive semidefinite (covariance)
1504
L. Paninski
matrices (Cover & Thomas, 1991; Lewis, 1996), the fact that |C| and log |C| have identical maximizers, and the compactness and convexity of co(Iθ0 (x)). It is worth noting that this proof goes through essentially unchanged if we sample to optimize something like a weighted mean-square error instead of mutual information. In this case, the sampler will asymptotically attempt to minimize a matrix function of the form tr V t σ N2 V ≈ tr V t A−1 N V , where V is some weight matrix. Since the above function is convex in AN (Lewis, 1996), the only change we need to make is in the final form of the asymptotic variance formula: for this problem, Nσ N2 → argminC∈co(Iθ
0
(x)) tr(V
t
−1 C −1 V) ,
where once again the optimum is well defined (and unique when V is of full rank). When is one-dimensional, these two approaches clearly lead to the same asymptotic result. Appendix B: Kuhn-Tucker Optimality for Bayesian D-Optimal Design We briefly describe the necessary and sufficient conditions for optimality in the batch experiment setting discussed in section 6.2. We follow Cover and Thomas (1991). We are trying to maximize the function 6.1, which is concave in p(x) (implying that the maximizers we seek form a nonempty convex set). We need to compute the derivative of this function along convex lines through p(x), as follows: ∂ V( p, q ) ≡ E θ log Iθ (x)d(tq (x) + (1 − t) p(x)) ∂t t=0 ∂ = Eθ log Iθ (x)d(tq (x) + (1 − t) p(x)) ∂t t=0 −1 ∂ log (1 − t)I + t = Eθ Iθ (x)dq (x) Iθ (x)dp(x) ∂t t=0 −1 Iθ (x)dp(x) = E θ tr − dim . Iθ (x)dq (x) The interchange of derivative and expectation can be justified by dominated convergence under the conditions of theorem 1.
Asymptotic Theory of Information-Theoretic Experimental Design
1505
In the discrete X case, Kuhn-Tucker now implies that for optimal p(x) (and only optimal p(x)), −1 1 =1 Iθ (x) E θ tr Iθ (x)dp(x) ≤1 dim
if if
p(x) > 0 p(x) = 0.
Similar results can be derived for more general X by the usual approximation techniques. (For further discussion, see, e.g., Bell & Cover, 1980; Cover & Thomas, 1991; Clyde & Chaloner, 1996.) Acknowledgments We thank E. Simoncelli, C. Machens, and D. Pelli for helpful conversations. This work was partially supported by a predoctoral fellowship from HHMI and by funding from the Gatsby Charitable Trust. A brief account of this work appeared in the conference proceedings of the 16th Annual NIPS meeting, Vancouver, B.C., 2003. References Axelrod, S., Fine, S., Gilad-Bachrach, R., Mendelson, S., & Tishby, N. (2001). The information of observations and application for active learning with uncertainty (Tech. Rep.). Jerusalem: Leibniz Center, Hebrew University. Available online: citeseer.nj.nec.com/axelrod01information.html. Barron, A., Schervish, M., & Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Annals of Statistics, 27, 536–561. Bell, R., & Cover, T. (1980). Competitive optimality of logarithmic investment. Mathematics of Operations Research, 5, 161–166. Berger, J., Bernardo, J., & Mendoza, M. (1989). On priors that maximize expected information. In J. Klein & H. J. Lee (Eds.), Recent developments of statistics and its applications (pp. 1–20). Seoul: Freedom Academy. Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10, 273–304. Chichilnisky, E. (2001). A simple white noise analysis of neuronal light responses. Network: Computation in Neural Systems, 12, 199–213. Clarke, B., & Barron, A. (1990). Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, 36, 453–471. Clarke, B., & Barron, A. (1994). Jeffreys’ prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning Inference, 41, 37–60. Clyde, M., & Chaloner, K. (1996). The equivalence of constrained and weighted designs in multiple objective design problems. Journal of the American Statistical Association, 91, 1236–1244. Cohn, D., Ghahramani, Z., & Jordan, M. (1996). Active learning with statistical models. Journal of Artificial Intelligence Research, 4, 129–145. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley.
1506
L. Paninski
Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Deignan, P., Meckl, P., Franchek, M., Abraham, J., & Jaliwala, S. (2000). Using mutual information to pre-process input data for a virtual sensor. Paper presented at the American Control Conference 2000, Chicago. Dembo, A., & Zeitouni, O. (1993). Large deviations techniques and applications. New York: Springer. Denzler, J., & Brown, C. (2000). Optimal selection of camera parameters for state estimation of static systems: An information theoretic approach. (University of Rochester Tech. Rep. 732). Rochester, NY: University of Rochester. Fedorov, V. (1972). Theory of optimal experiments. New York: Academic Press. Foldiak, P. (2001). Stimulus optimisation in primary visual cortex. Neurocomputing, 38–40, 1217–1222. Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28(2–3), 133–168. Kontsevich, L., & Tyler, C. (1999). Bayesian adaptive estimation of psychometric slope and threshold. Vision Research, 39, 2729–2737. Lee, T., & Yu, S. (1998). An information-theoretic framework for understanding saccadic behaviors. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing, 12. Cambridge, MA: MIT Press. Lewis, A. (1996). Convex analysis on the Hermitian matrices. SIAM Journal on Optimization, 6, 164–177. Lindley, D. (1956). On a measure of information provided by an experiment. Annals of Mathematical Statistics, 29, 986–1005. Luttrell, S. (1985). The use of transinformation in the design of data sampling schemes for inverse problems. Inverse Problems, 1, 199–218. Machens, C. (2002). Adaptive sampling by information maximization. Physical Review Letters, 88, 228104–228107. Mackay, D. (1992). Information-based objective functions for active data selection. Neural Computation, 4, 589–603. Mascaro, M., & Bradley, D. (2002). Optimized neuronal tuning algorithm for multichannel recording. Unpublished abstract. Available online: http://www. compscipreprints.com/. Milman, V., & Schechtman, G. (1986). Asymptotic theory of finite dimensional normed spaces. Berlin: Springer-Verlag. Nelken, I., Prut, Y., Vaadia, E., & Abeles, M. (1994). In search of the best stimulus: An optimization procedure for finding efficient stimuli in the cat auditory cortex. Hearing Research, 72, 237–253. Nelson, J., & Movellan, J. (2000). Active inference in concept learning. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing, 13. Cambridge, MA: MIT Press. Parmigiani, G. (1998). Designing observation times for interval censored data. Sankhya A, 60, 446–458. Parmigiani, G., & Berry, D. (1994). Applications of Lindley information measure to the design of clinical experiments. In A. F. M. Smith & P. Freeman (Eds.), Aspects of uncertainty: A tribute to D. V. Lindley (pp. 333–352). New York: Wiley. Pelli, D. (1987). The ideal psychometric procedure. Investigative Ophthalmology and Visual Science (Suppl.), 28, 366.
Asymptotic Theory of Information-Theoretic Experimental Design
1507
Rudin, W. (1973). Functional analysis. New York: McGraw-Hill. Sahani, M. (1997). Interactively exploring a neural code by active learning. Poster session presented at NIC97 meeting, Snowbird, Utah. Available online: http://www. gatsby.ucl.ac.uk/∼maneesh/conferences/nic97/poster/home.html. Schervish, M. (1995). Theory of statistics. New York: Springer-Verlag. Scholl, H. R. (1998, June). Shannon optimal priors on i.i.d. statistical experiments converge weakly to Jeffreys’ prior. Test, 7(no. 1). Available online: citeseer.nj.nec.com/104699.html. Schwartz, L. (1967). On Bayes procedures. Z. Wahrsch. Verw. Gabiete, 4, 10–26. Simoncelli, E., Paninski, L., Pillow, J., & Schwartz, O. (2004). Characterization of neural responses with stochastic stimuli. In M. Gazzaniga (Ed.), The cognitive neurosciences. (3rd ed.). Cambridge, MA: MIT Press. Sollich, P. (1996). Learning from minimum entropy queries in a large committee machine. Physical Review E, 53, R2060–R2063. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Publ. Math. IHES, 81, 73–205. Tzanakou, E., Michalak, R., & Harth, E. (1979). The alopex process: Visual receptive fields by response feedback. Biological Cybernetics, 35, 161–174. van der Vaart, A. (1998). Asymptotic statistics. Cambridge: Cambridge University Press. Watson, A., & Fitzhugh, A. (1990). The method of constant stimuli is inefficient. Perception and Psychophysics, 47, 87–91. Watson, A., & Pelli, D. (1983). QUEST: A Bayesian adaptive psychophysical method. Perception and Psychophysics, 33, 113–120.
Received June 2, 2003; accepted January 5, 2005.
LETTER
Communicated by Liam Paninski
Maximum Likelihood Set for Estimating a Probability Mass Function Bruno M. Jedynak
[email protected] D´epartement de Math´ematiques, Universit´e des Sciences et Technologies de Lille, France, and Department of Applied Mathematics, and Center for Imaging Science, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Sanjeev Khudanpur
[email protected] Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
We propose a new method for estimating the probability mass function (pmf) of a discrete and finite random variable from a small sample. We focus on the observed counts—the number of times each value appears in the sample—and define the maximum likelihood set (MLS) as the set of pmfs that put more mass on the observed counts than on any other set of counts possible for the same sample size. We characterize the MLS in detail in this article. We show that the MLS is a diamond-shaped subset of the probability simplex [0, 1]k bounded by at most k × (k − 1) hyperplanes, where k is the number of possible values of the random variable. The MLS always contains the empirical distribution, as well as a family of Bayesian estimators based on a Dirichlet prior, particularly the wellknown Laplace estimator. We propose to select from the MLS the pmf that is closest to a fixed pmf that encodes prior knowledge. When using Kullback-Leibler distance for this selection, the optimization problem comprises finding the minimum of a convex function over a domain defined by linear inequalities, for which standard numerical procedures are available. We apply this estimate to language modeling using Zipf’s law to encode prior knowledge and show that this method permits obtaining state-of-the-art results while being conceptually simpler than most competing methods. 1 Introduction Let p be a probability mass function (pmf) over a set {1, . . . , k} of finite cardinality. This may represent a set of numerical values for a quantitative Neural Computation 17, 1508–1530 (2005)
© 2005 Massachusetts Institute of Technology
Maximum Likelihood Set
1509
variable or a set of indices for a qualitative variable. The latter situation is often qualified as nonmetric, as will be the case in section 4, where the indices will refer to words in the English vocabulary. Suppose that we observe n samples x1 , . . . , xn , that are independent and identically distributed (i.i.d.) with common pmf p, which is unknown and needs to be estimated from the observed samples. Prior information may be available about p and, in particular, a specific estimate, or an estimate of a certain form, may be preferred when n = 0. For the case when n k, a very satisfactory answer is the empirical distribution or type pˆ , namely: pˆ (X = i) = pˆ i =
n 1 ni 1(xt = i) ≡ , n t=1 n
i ∈ {1, . . . , k},
(1.1)
where 1(·) is an indicator function and, hence, ni is the number of times the value i is observed in the sample. When n is small, the pioneering work of Laplace (for k = 2) has led to the well-known Bayesian estimates as alternatives to the type. During World War II, while working on cracking German cryptographic systems, Jack Good and Alan Turing invented a method for regularizing the type (Good, 1953; Orlitsky, Santhanam, & Zhang, 2003). In their case, k = 26 was the number of letters in the Latin alphabet, and n ≈ 100 − 1000. In section 4, we consider a case where k is the number of words in the English vocabulary, which is set to about 105 , and the training sample is n ≈ 106 words. Many smoothing techniques, most being variations on the Good-Turing idea, have been compared for such a case by Chen and Goodman (1996) and Chen and Rosenfeld (1999). Excellent empirical performance is obtained by using Good-Turing–like estimators. With the exception of the Bayesian estimates, however, there is often only a heuristic justification and no principled derivation of the estimation formulae. There have, of course, been numerous studies of the pmf estimation problem since Laplace, and it is not our intention to present a comprehensive survey of the literature here, which begins at least as far back as Lidstone (1920) and continues to be an active area of investigation (Ristad, 1995; Poschel, Ebeling, Froemmel, & Ramirez, 2003). We propose the following new method for estimating p. We consider the counts—the number of times each value appears—and define the maximum likelihood set (MLS) as the set of probability mass functions that put more mass on the observed counts than on any other set of counts possible for the given n. In a second step, an element is chosen from this set. It can be the one with maximum entropy or another based on available prior information. This view of the problem, we believe, is very natural—indeed, so much so that when we first arrived at this view, we expected that someone had already investigated it. We have not found any evidence of this in the literature.
1510
B. Jedynak and S. Khudanpur
1.1 The Empirical Distribution. The empirical distribution, or type, of a sample x1 , . . . , xn , as briefly mentioned earlier, is pˆ =
n
1
n
,...,
k nk ni , , with n = n i=1
(1.2)
where ni , 1 ≤ i ≤ k, are the counts, that is, the number of times the value i appeared in the sample. We write P k the set of pmfs over a set of cardinality k and Pnk the set of types with denominator n over a set of cardinality k. The probability, under p ∈ P k , of observing x1 , . . . , xn is p(x1 , . . . , xn ) =
k
pini ,
(1.3)
i=1
where ni are the counts as above. The right-hand side of equation 1.3, viewed as a function of the pmf p, is called the likelihood function and may be rewritten as k
pini = 2−n(D( pˆ , p)+H( pˆ )) ,
(1.4)
i=1
where D( p, q ) =
k i=1
pi log2
pi , qi
(1.5)
p 0
= ∞ for p > 0, is the Kullback-Leibler dis-
with 0 log2 q0 = 0 and p log2 tance of p from q , and H( p) = −
k
pi log2 pi ,
(1.6)
i=1
with 0 log2 0 = 0, is the Shannon entropy of p. It is clear from equation 1.4 that the type pˆ is a sufficient statistic for estimating p. Also note that pˆ is the maximum likelihood estimate (MLE) of p, that is, the choice of p for which the likelihood equation 1.3 of x1 , . . . , xn , is maximum. Indeed, D( pˆ , p) ≥ 0, with equality iff p = pˆ (cf, e.g., Cover & Thomas, 1991). For k fixed and n → ∞, the type is a strongly consistent and efficient estimate of the pmf. However, the type may not be the best possible estimate for finite n. For example, one may have prior information about the true distribution that is captured in the type only for very large n. There is also
Maximum Likelihood Set
1511
a more structural objection: when k is large, there might be many values 1 ≤ i ≤ k, for which pi n1 . In this case, with high probability, we will observe ni = 0. Hence, low-probability events tend to be underestimated and high-probability events overestimated by pˆ . One manifestation of this effect is that the expected entropy of the type underestimates the entropy of the original pmf. Indeed, E [H( pˆ )] = −E
k i=1
pˆ i pˆ i log pi pi
= −E [D( pˆ , p)] + H( p) ≤ H( p).
In section 2, we therefore construct a set of pmfs that contains the type as well as other pmfs that are close to it. In particular, it contains pmfs with larger entropy than the type. We will then choose an estimate from this set based on available prior knowledge. 1.2 Bayesian Estimates. Bayesian analysis offers an alternative to MLE. The Dirichlet family, indexed by a parameter β, is a family of prior distributions over pmfs given by πβ ( p) =
k 1 β−1 p , Z(β) i=1 i
p ∈ Pk,
β ∈ R,
(1.7)
where Z(β) is a normalizing constant. Note that for β = 1, equation 1.7 reduces to the uniform distribution over P k . Now, if the Bayesian cost function is quadratic, that is, L( p, q ) =
k
( pi − q i )2 ,
(1.8)
i=1
then the Bayesian estimate corresponding to the Dirichlet prior is the posterior expectation of p given x1 , . . . , xn , which can be shown to be pˆ β (i) =
ni + β , n + βk
∀ 1 ≤ i ≤ k.
(1.9)
This is often referred to as an add-β rule. The special case of β → 0 yields the MLE pˆ , and β = 1—the so-called Laplace rule (cf. e.g. Lidstone, 1920). Estimators with β = 0.5 and β = k1 have also been considered (see Nemenman, Shafee, & Bialek, 2002). Note that all such estimators with β > 0 assign a strictly positive mass to every value in {1, . . . , k}, and they all converge to the type as n → ∞. We will see that the set from which we will choose our estimate contains all add-β rules in equation 1.9 for 0 ≤ β ≤ 1.
1512
B. Jedynak and S. Khudanpur
1.3 Minimax Estimates. An alternative to Bayesian analysis is minimax analysis where one seeks an estimate that would be optimal in the worst case over the underlying model and in average over the observations. More precisely, if p is the underlying model and q an estimate of p, one builds the functional R(q ) =
sup
p=( p1 ,..., pk ) n ,...,n ;k n =n 1 k i=1 i
n! p n1 . . . pknk L( p, q ). n1 ! . . . nk ! 1
(1.10)
For the quadratic cost, equation 1.8, as well as for the standardized quadratic cost, L( p, q ) =
k ( pi − q i )2 , pi i=1
(1.11)
√ the minimum of R(q ) is achieved by an add-β rule, with β = k −1 n (Steinhaus, 1957) and β = 0 (Olkin & Sobel, 1979) respectively. 1.4 Maximum Entropy Estimates. Maximum entropy estimation is another standard solution to data sparseness. Instead of estimating pˆ , the maximum entropy method first estimates pˆ (Aj ) = aˆ j for select sets Aj ⊂ {1, . . . , k}, for which we have sufficient evidence in the n samples. Fixing the probability of some subsets of {1, . . . , k} in this manner typically underspecifies the pmf of interest, leading to a set M of admissible pmfs, M = p ∈ P k : p(Aj ) = aˆ j , j = 1, . . . , J ,
(1.12)
in which the estimate pˆ is but one member. From this admissible set, the pmf with the highest Shannon entropy is then chosen as the estimate of p. It is well known (see, Berger, Della Pietra, & Della Pietra, 1996) that the pmf with the maximum entropy has an exponential form:
J 1 pˆ ME (i) = λ j 1(i ∈ Aj ) , exp Z() j=1
∀ 1 ≤ i ≤ k,
(1.13)
where the parameters = (λ1 , . . . , λ J ) are chosen to satisfy the constraints of equation 1.12. It can be shown that for every i, as long as at least one p ∈ M satisfies pi > 0, it follows that pˆ ME (i) > 0. Thus, the maximum entropy estimate is inherently smooth. There are several heuristics but few principles for selecting the sets Aj or even J. In language modeling, some Aj ’s are typically singleton, specifying, for instance, the probability of words that have been seen sufficiently
Maximum Likelihood Set
1513
often in the sample; some Aj ’s may contain all words that can take on a certain grammatical part of speech (e.g., adjectives), and some Aj ’s may overlap with others, for example. Therefore, while maximum entropy estimation eliminates the need for some of the ad hoc assumptions made by other techniques, it leaves open the problem of selecting the sets used to define M. Another weakness of the classical maximum entropy method, as others have pointed out, is that the specification of M via equality constraints leads to an ad hoc choice for any candidate Aj : one must either constrain its probability to be exactly aˆ j or leave it completely unconstrained. This is unsatisfactory. For instance, if one were considering as candidate sets Aj all singleton sets, then the naive act of including all of them in the definition of M leads to M = { pˆ }. On the other hand, leaving out all i for which, say, ni = 1 from the definition of M may result in an estimate under which ni > 0 and ni = 0, but pˆ ME (i) = pˆ ME (i ). Maximum entropy estimation has therefore been proposed with inequality constraints (cf. Khudanpur, 1995; Kazama & Tsujii, 2003): M = p : a j ≤ p(Aj ) ≤ b j , j = 1, . . . , J .
(1.14)
To the best of our knowledge, there has not been much discussion in the literature of a principled way to make the choice of a j and b j , particularly of a way that depends on only the observed sample, and not on other ad hoc assumptions about p. Yet another variation on maximum entropy consists of minimizing a functional of the form J
µ j d p(Aj ), aˆ j − H( p) ,
(1.15)
j=1
where d(., .) is some metric of deviation from the constraints of equation 1.12 and the parameters µ = (µ1 , . . . , µ J ) are estimated, usually, from held-out data. Yet another way to relax the constraints in equation 1.12 is to note, using convex duality (Berger et al., 1996), that the parameters that satisfy the constraints are exactly the parameters for which the model of equation 1.13 assigns maximum likelihood to the observed sample. One may then choose a penalized likelihood approach with a regularizing function of . Still, several parameters need to be estimated from held-out data in either case. Several such methods are compared in Chen and Goodman (1996) for the estimation of bigram and trigram language models. In section 2, we will seek to provide a principled way of relaxing the linear equality constraints in maximum entropy estimation.
1514
B. Jedynak and S. Khudanpur
1.5 Good-Turing and Other Held-Out Methods. In Jelinek (1998, p. 258), the author asks, “How much larger a probability should be assigned to an event observed once than to one not observed at all, or, in general, whether the ratio of probabilities of events observed n and m times, respectively, should really be n/m?” Considering pmfs that put more mass on the observed counts than on any others, which we do in section 2, will lead to one answer to this question: equation 2.7. The Good-Turing and other held-out methods answer the question in a different way. The basic idea is to divide the data into two parts. The first part, called the development set, is used for the collection of counts {ni }. The second part, called the held-out set, is used to estimate additional parameters. A typical structure is as follows:
p˜ i =
α× qi
ni n
if ni > M, if ni ≤ M,
(1.16)
where the (usually small) threshold M, and smoothed probability estimates q i , i = 0, . . . , M, are the additional parameters. The Good-Turing estimate (Good, 1953; Orlitsky et al., 2003; McAllester & Schapire, 2000) is obtained by setting qi =
rni +1 ni + 1 , rni n
i ∈ {1, . . . , k},
(1.17)
where rc is the number of symbols j ∈ {1, . . . , k} whose count n j = c. Thus, q i for a symbol i depends not just on its count ni and n, but on the counts of all other symbols. Note that if ni > n j , it is not necessarily true that q i ≥ q j , though this frequently holds in practice for symbols with very small counts. In other words, q i may not respect the rank ordering implied by the empirical counts {ni }, particularly for symbols with large counts. For this reason, the threshold M is often chosen to be small enough so as not to have this undesirable effect. In language modeling, for example, M is typically chosen to be 10 or less, depending on n. The parameter α is then computed so that p˜ i sums to unity. The Good-Turing estimate performs remarkably well for pmf on words. However, its derivation is somewhat ad hoc and unsatisfactory. 2 The Maximum Likelihood Set One of the simplest and driving ideas in statistics is as follows: what we observe has to be fairly likely; otherwise we would not have observed it. One way to quantify this is to say that what we observe has to be more likely under the true pmf than any other comparable event. Let’s define the
Maximum Likelihood Set
1515
MLS as the set of pmfs that put more mass on the observed type than on any other type given n. Let p = ( p1 , . . . , pk ) be a pmf over {1, . . . , k}. The p-probability of observing the type pˆ = ( nn1 , . . . , nnk ) is f ( p, pˆ ) =
k n! p ni . n1 ! . . . nk ! i=1 i
(2.1)
The MLS, with these notations, is defined as M( pˆ ) = { p ∈ P k : ∀ qˆ ∈ Pnk , f ( p, pˆ ) ≥ f ( p, qˆ )} .
(2.2)
We will see in section 2.3 that this set always contains the type pˆ , which is the MLE for p, and that it shrinks down to it as n → ∞. For finite n, it contains pmfs that might reflect prior information such as smoothness or other desirable properties in a better way than the type, but still remain close to the observed counts. Moreover, this set is a close convex subset of P k , opening the way to numerical optimization. Using Stirling formulas, as well as equation 1.4, one can check that 1 un . . f ( p, pˆ ) = 2−D( pˆ , p) , where un = vn ⇔ lim log = 0. n→∞ n vn
(2.3)
Hence, for n sufficiently large, the MLS associated with a type pˆ is roughly
p ∈ P k : D( pˆ , p) ≤ D(qˆ , p),
∀ qˆ ∈ Pnk ,
(2.4)
leading to the loose description that the MLS is the set of pmfs that are “closer” to the observed type than to any other. 2.1 Characterization of the Maximum Likelihood Set. The MLS admits a simpler though still implicit representation. Given the observed counts (n1 , . . . , nk ), define a neighborhood relationship on the set of types with denominator n: the neighbors of (n1 , . . . , nk ) are the types obtained by changing a single sample from one value to another one. That is, assume that for a pair of indexes 1 ≤ i, j ≤ k, we have n j > 0 and ni < n; then (n 1 , . . . , n k ), defined by ni = ni + 1,
n j = n j − 1,
and nl = nl
l = i or j,
(2.5)
is a neighbor of (n1 , . . . , nk ). If a pmf is in the MLS, then it has to put more mass on the observed type than on any of its neighbors. It turns out that the converse is also true, which leads to the following result:
1516
B. Jedynak and S. Khudanpur
Proposition 1. A pmf p = ( p1 , . . . , pk ) on the set {1, . . . , k} belongs to the MLS M( pˆ ) associated with the counts (n1 , . . . , nk ) if and only if n j pi ≤ (ni + 1) p j ,
∀ 1 ≤ i = j ≤ k,
(2.6)
or equivalently, pˆ i pˆ j +
1 n
≤
pˆ i + pi ≤ pj pˆ j
where, by convention,
a 0
1 n
,
∀ 1 ≤ i = j ≤ k,
(2.7)
= +∞ whenever a > 0.
The proof uses elementary algebra and is relegated to the appendix. 2.2 Motivating Examples. For k = 2, the MLS is M( pˆ ) = M
n
1
n
,1 −
n1 n
= p = ( p1 , 1 − p1 );
n1 n1 + 1 ≤ p1 ≤ . n+1 n+1
Note that this set contains the type and shrinks down to it as the number of samples goes to infinity. Beside the connection with Dirichlet priors mentioned in section 1, the MLS in this case can be obtained through Bayesian estimation of a proportion with quadratic cost function and a beta(α, β) prior distribution. It is the set of estimators corresponding to the prior parameters (α, β) satisfying α + β = 1 (see Hogg & Craig, 1995, p. 368). The MLSs for k = 3 are illustrated in Figure 1 for two different values of n. The MLSs are convex cells with linear boundaries. They have at most k × (k − 1) boundaries, one corresponding to each neighboring type. In order to select an estimate from the MLS, one could choose the pmf with maximum Shannon entropy. This choice will be motivated further in section 3. We use it here to illustrate properties of the MLS. For example, if the counts (n1 , . . . , nk ) are made of 0s and 1s only, then the pmf selected is the uniform distribution over {1, . . . , k}, since it is of maximum entropy over all pmfs over {1, . . . , k} and it is included in the MLS, as one can check from equation 2.6. In contrast, if there is one value, say the first one, that gets all the counts, then the selected estimate is, for n > 0, p1∗ =
n , n+k −1
and
pl∗ =
1 , n+k −1
∀ 1 < l ≤ k.
(2.8)
Maximum Likelihood Set
1517
1 0.8
p3
0.6 0.4 0.2 0 0 0.2 1 0.4
0.8 0.6
0.6 0.4
0.8
0.2 1
p1
0
p2
A
1 0.8
p3
0.6 0.4 0.2 0 0 0.2
1
0.4
0.8 0.6
0.6 0.4
0.8 p1
0.2 1
0
p2
B
Figure 1: Illustration of the maximum likelihood sets for all the possible types for alphabet size k = 3. (A) n = 3 samples. (B) n = 10 samples. Each “cell” is an MLS containing exactly one type marked with a cross.
1518
B. Jedynak and S. Khudanpur
If n < k, then note that p1∗ ≤ 0.5, which stands in sharp contrast with the estimate pˆ 1 = 1 given by the type. Equation 2.8 is a direct consequence of the property 3.4. 2.3 Properties of the Maximum Likelihood Set. We now present some insightful and useful properties of the MLS. n
Proposition 2. Let pˆ = ( n1 , . . . , nnk ) be a type. The elements p = ( p1 , . . . , pk ) of the MLS M( pˆ ) defined by pˆ satisfy the following: pˆ p
i.e.
ni > 0 ⇒ pi > 0, ni < n j ⇒ pi ≤ p j
∀ 1 ≤ i ≤ k, ∀ 1 ≤ i, j ≤ k,
1 n pˆ i ≤ pi ≤ pˆ i + n+k n
p − pˆ 1 =
k
| pi − pˆ i | ≤
i=1
2(k
∀ 1 ≤ i ≤ k,
− 1) , n
pˆ ∈ M( pˆ ),
(2.9) (2.10) (2.11) (2.12) (2.13)
but no other type with denominator n is an element of M( pˆ ). If x1 , . . . , xn are independent samples with common pmf q ∈ P k , then the MLS defined by their type pˆ is such that sup p − q 1 → 0
as n → ∞
with probability 1.
(2.14)
p∈M( pˆ )
Proposition 2 is essentially a corollary of proposition 1. Details of the proof are in the appendix. Properties 2.9 and 2.10 are desirable for any estimate of the pmf generating x1 , . . . , xn . Properties 2.11 and 2.12 show how the elements of the MLS may deviate from the underlying type. Property 2.14 shows that for a fixed k, as n gets large, all the elements in the MLS get closer to the pmf generating the samples. It is easy to see, by comparing equation 2.11 and 1.9, that the MLS contains the Bayesian estimates for 0 ≤ β ≤ 1. 3 Selecting an Element from the Maximum Likelihood Set Every pmf in the MLS satisfies a number of properties, as outlined above, that one would consider desirable in an estimate of the pmf generating the samples x1 , . . . , xn , and we advocate M( pˆ ) as an admissible set from which a particular pmf may be selected using secondary criteria. One such criterion is outlined next.
Maximum Likelihood Set
1519
Proposition 3. Let pˆ = ( nn1 , . . . , nnk ) be a type and M( pˆ ) its associated MLS. Let q = (q 1 , . . . , q k ) be a pmf such that pˆ 0, let qˆ =
nj − 1 n1 ni + 1 nk ,..., ,..., ,..., n n n n
.
(A.1)
By definition, f ( p, pˆ ) ≥ f ( p, qˆ ), and hence n n! p l n1 ! · · · ni ! · · · n j ! · · · nk ! l l n n! n −1 ≥ pl l pini +1 p j j n1 ! · · · (ni + 1)! · · · (n j − 1)! · · · nk ! l=i, j 1 1 pj ≥ pi . nj ni + 1 Property 2.6 follows. If n j = 0, then equation 2.6 follows trivially. Next, we establish that if p satisfies equation 2.6, then p ∈ M( pˆ ). Toward this end, again, let qˆ =
n˜ 1 n˜ k ,..., n n
(A.2)
be an empirical pmf associated with any other set of counts (n˜ 1 , . . . , n˜ k ) for an n-length sample. We construct a sequence of pmfs qˆ (0) , . . . , qˆ (n) such that qˆ (0) = qˆ ,
f p, qˆ (0) ≤ f p, qˆ (1) ≤ . . . ≤ f p, qˆ (n)
and qˆ (n) = pˆ . (A.3)
1526
B. Jedynak and S. Khudanpur
In particular, we begin with qˆ (0) defined by the counts (0) (0) n1 , . . . , nk = (n˜ 1 , . . . , n˜ k ) ,
(A.4)
and, for m = 1, . . . , n,
r r
If qˆ (m−1) = pˆ , then we set qˆ (m) = qˆ (m−1) . (m−1)
Otherwise, choose i and j such that ni define qˆ (m) by the counts (m)
= ni
(m)
= nl
ni nl
(m−1)
− 1,
(m)
(m−1)
for all other l.
nj
(m−1)
= nj
+ 1,
(m−1)
> ni and n j
< n j , and
and (A.5)
Note that a suitable pair i, j is guaranteed to exist whenever qˆ (m−1) = pˆ . It is clear that for m = 1, . . . , n, if qˆ (m−1) = pˆ , then by construction, 2 (m) qˆ − pˆ = qˆ (m−1) − pˆ − 1 1 n 2m = · · · = qˆ (0) − pˆ − . 1 n Since qˆ − pˆ 1 ≤ 2, it follows that qˆ (n) = pˆ . Finally, note that for m = 1, . . . , n, if qˆ (m−1) = pˆ , (m−1)
n1 f ( p, qˆ (m) ) n! = f ( p, qˆ (m−1) ) n(m) ! · · · n(m) ! k
1
= ≥
(m−1) ni
1 (m−1)
nj
+1
1
(m−1)
! · · · nk n!
k !
(m)
n
pl l
(m−1)
−nl
l=1
pj pi
ni + 1 p j n j pi
≥ 1, (m−1)
> ni and where the first inequality holds by construction, since ni (m−1) < n j , and the second inequality holds due to equation 2.6. nj Proof of Proposition 2. Let us suppose that there is an index 1 ≤ j ≤ k such that n j > 0 and p j = 0. Replacing in equation 2.6, it implies that ∀1 ≤ i ≤ k, i = j, pi = 0 which is impossible since p j = 0. This proves equation 2.9. Equation 2.10 is also a consequence of equation 2.6, as the reader can check.
Maximum Likelihood Set
1527
We remark that equation 2.6 still holds for indexes i = j. Then, summing out, we obtain, for any subset A ⊂ {1, . . . , k}, k
pˆ j pi ≤
i∈A j=1 k
k
pˆ i +
1 n
pˆ i +
1 n
i∈A j=1
pˆ j pi ≤
j∈A i=1
k j∈A i=1
p j and
(A.6)
pj,
(A.7)
from which we obtain ∀A ⊂ {1, . . . , k}, pˆ (A)
#A n ≤ p(A) ≤ pˆ (A) + , n+k n
(A.8)
where #Ais the number of elements in A. Setting A = {i} gives equation 2.11. Now, from Cover and Thomas (1991, p. 300), p − pˆ 1 = 2( p(A) − pˆ (A)); A = {1 ≤ i ≤ k; pi > pˆ i } .
(A.9)
Using equation A.8, we obtain p − pˆ 1 ≤ 2
#A 2(k − 1) ≤ . n n
(A.10)
Using equation 2.6, one can directly check that pˆ ∈ M( pˆ ). If another type in Pnk is also an element of M( pˆ ), then pˆ has a neighbor that is an element of M( pˆ ), following the argument in the part (⇐) of the proof of proposition 1. n −1 Let’s call qˆ this neighbor. It is such that qˆ i = ni n+1 and qˆ j = jn for some indexes 1 ≤ i, j ≤ k such that ni < n and n j > 0. Now, as an element of M( pˆ ), it satisfies (ni + 1)qˆ j ≥ n j qˆ i .
(A.11)
But this is equivalent to saying that ni ≤ −1, which is impossible. Finally, sup p − q 1 ≤ p∈M( pˆ )
2(k − 1) + pˆ − q 1 , n
(A.12)
using the triangular inequality as well as the bound, equation 2.12. Equation 2.14 follows from the fact that the type converges to the true distribution in .1 .
1528
B. Jedynak and S. Khudanpur
Proof of Proposition 4. Assume, to the contrary, that pi∗ < p ∗j for some i = j with ni = n j and q i ≥ q j . Define a pmf p ∗∗ by
pl∗∗
∗ p j = pi∗ ∗ pl
for l = i, for l = j,
(A.13)
for l = i or j.
In other words, construct p ∗∗ by “switching” the ith and the jth entries of p ∗ . Since p ∗ ∈ M( pˆ ), p ∗ satisfies equation 2.6. But ni = n j then implies that, by construction, p ∗∗ also satisfies equation 2.6. Thus, p ∗∗ ∈ M( pˆ ). Next, note that D( p ∗ q ) − D( p ∗∗ q ) =
k
pl∗ log
l=1
= pi∗ log
k pl∗ p ∗∗ − pl∗∗ log l ql ql l=1
p ∗j p ∗j pi∗ p∗ + p ∗j log − p ∗j log − pi∗ log i qi qj qi qj
qj qj − p ∗j log qi qi
qj = pi∗ − p ∗j log ≥0 qi = pi∗ log
which contradicts proposition 3, since p ∗ is the unique minimizer of D( pq ) in M( pˆ ).
Acknowledgments We thank Ali Yazgan for his valuable assistance in the use of the CFSQP package and in conducting most of the empirical studies in section 4.1. This research was partially supported by the National Science Foundation via grants ITR-0225656 and IIS-9982329, ARO DAAD19/-02-1-0337, and general funds from the Center for Imaging Science at the Johns Hopkins University. Finally, we are grateful to the anonymous referees, who gave several insightful suggestions toward improving this article. References Bazaraa, M. S., Sherali, H. D., & Shetty, C. (1993). Nonlinear programming. New York: Wiley. Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22, 39–71.
Maximum Likelihood Set
1529
Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (pp. 310–318). Santa Cruz, CA: Association for Computational Linguistics. Chen, S. F., & Rosenfeld, R. (1999). A gaussian prior for smoothing maximum entropy models (Tech. Rep.). Pittsburgh, PA: Carnegie Mellon University. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Csiszar, I. (1975). I-divergence geometry of probability distributions and minimization problems. Annals of Probability, 3, 146–158. Fletcher, R. (1981). Practical methods of optimization (Vol. 2). New York: Wiley. Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264. Ha, L. Q., Sicilia, E., Ming, J., & Smith, F. J. (2002). Extension of Zipf’s law to words and phrases. In International Conference on Computational Linguistics (COLING’2002) (pp. 315–320). Taipei, Taiwan. Hogg, R. V., & Craig, A. T. (1995). Introduction to mathematical statistics. Upper Saddle River, NJ: Prentice Hall. Jaynes, E. T. (1994). Probability theory: The logic of science. Cambridge: Cambridge University Press. Jelinek, F. (1998). Statistical methods for speech recognition. Cambridge, MA: MIT Press. Kazama, J., & Tsujii, J. (2003). Evaluation and extension of maximum entropy models with inequality constraints. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (pp. 137–144). Sapporo, Japan. Khudanpur, S. (1995). A method of ME estimation with relaxed constraints. In Proceedings of the Johns Hopkins University Language Modeling Workshop (pp. 1–17). Baltimore: Center for Language and Speech Processing, Johns Hopkins University. Lawrence, C. T., Zhou, J. L., & Tits, A. L. (1997). User’s guide for CFSQP version 2.5: A C code for solving (large scale) constrained nonlinear (minimax) optimization problems, generating iterates satisfying all inequality constraints (Tech. Rep. No. TR-94-16r1). College Park: Institute for Systems Research, University of Maryland. Li, W. (1999). References on Zipf’s law. Available online: http://linkage.rockefeller. edu/wli/zipf/. Lidstone, G. (1920). Note on the general case of the Bayes-Laplace formula for inductive or posterior probabilities. Trans Fac. Actuaries, 8, 182–192. McAllester, D., & Schapire, R. E. (2000). On the convergence rate of Good-Turing estimators. In Proc. 13th Annual Conference on Computational Learning Theory, (pp. 1–6). San Francisco: Morgan Kaufmann. Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy and inference, revisited. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Olkin, I., & Sobel, M. (1979). Admissible and minimax estimation for the multinomial distribution and k independent binomial distributions. Annals of Mathematical Statistics, 7, 284–290. Orlitsky, A., Santhanam, N. P., & Zhang, J. (2003). Always good Turing: Asymptotically optimal probability estimation. Science, 302, 427–431.
1530
B. Jedynak and S. Khudanpur
Poschel, Ebeling, W., Froemmel, C., & Ramirez, R. T., (2003). Correction algorithm for finite sample statistics. Eur. Physics, 12, 531–541. Ristad, E. S. (1995). A natural law of succession (Tech. Rep. No. CS-TR-495-95). Princeton, NJ: Department of Computer Science, Princeton University. Steinhaus, H. (1957). The problem of estimation. Annals of Mathematical Statistics, 28, 633–648.
Received February 17, 2004; accepted January 5, 2005.
LETTER
Communicated by Jonathan Victor
Estimating Entropy Rates with Bayesian Confidence Intervals Matthew B. Kennel
[email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.
Jonathon Shlens
[email protected] Systems Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, and, Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.
Henry D. I. Abarbanel
[email protected] Department of Physics and Marine Physical Laboratory, Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA 92093-0402, U.S.A.
E. J. Chichilnisky
[email protected] Systems Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A.
The entropy rate quantifies the amount of uncertainty or disorder produced by any dynamical system. In a spiking neuron, this uncertainty translates into the amount of information potentially encoded and thus the subject of intense theoretical and experimental investigation. Estimating this quantity in observed, experimental data is difficult and requires a judicious selection of probabilistic models, balancing between two opposing biases. We use a model weighting principle originally developed for lossless data compression, following the minimum description length principle. This weighting yields a direct estimator of the entropy rate, which, compared to existing methods, exhibits significantly less bias and converges faster in simulation. With Monte Carlo techinques, we estimate Neural Computation 17, 1531–1576 (2005)
© 2005 Massachusetts Institute of Technology
1532
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
a Bayesian confidence interval for the entropy rate. In related work, we apply these ideas to estimate the information rates between sensory stimuli and neural responses in experimental data (Shlens, Kennel, Abarbanel, & Chichilnisky, in preparation).
1 Introduction What do neural signals from sensory systems transmit to the brain? An understanding of how neural systems make sense of the natural environment requires characterizing the language neurons use to communicate (Borst & Theunissen, 1999; Perkel & Bullock, 1968; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). As a step in this process, Shannon’s information theory (Cover and Thomas, 1991; Shannon, 1948) provides a means of quantifying the amount of information represented without specific assumptions about what information is important to the animal or how it is represented. Neural systems operate in a time-dependent, fluctuating sensory environment, full of spatial and temporal correlations (Field, 1987; Ruderman & Bialek, 1994; Simonicelli & Olhausen, 2001). Given these correlations, we ask how much novel information per second the neural response provides about the sensory world. This is the average mutual information rate, or the largest amount of new information per second that the brain could use to update its knowledge about the sensory world. Information rates of neural spike trains bound the performance of any candidate model of sensory representation (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Borst & Theunissen, 1999; Buracas, Zador, DeWeese, & Albright, 1998; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998; Warland, Reinagel, & Meister, 1997), without regard to how that information is encoded. Most methods for estimating the information rate from experimental data proceed by estimating entropy rate as an intermediate step (but see Kraskov, Stogbauer, & Grassberger, 2004; Victor, 2002). The entropy rate of a dynamical or stochastic system is the rate that new uncertainty is revealed per unit of time (Lind & Marcus, 1996) and plays a central role in the theory of information transmission (Shannon, 1948). Shannon’s theory showed that any transmission channel must have a capacity above this quantity in order to reproduce a signal without error. The entropy rate also gives the best possible compressed transmission rate for any lossless encoding of typical signals. In dynamical systems theory, the entropy rate (in particular, Kolmolgorov-Sinai entropy) is the principal quantification of the existence and amount of chaos (Gilmore & Lefranc, 2002; Hilborn, 2000; Ott, 2002). Estimating the entropy rate from observed data like spike trains can be surprisingly difficult in practice. The classical definitions of entropy rate do not lead easily to a reliable and accurate estimator, given only an observed data set of finite size. Entropy rate estimators nearly always assume some
Estimating Entropy Rates
1533
underlying probabilistic model for the observed data, although this aspect is often not explicitly recognized. A reliable rate estimator requires addressing two issues:
r r
Accurately estimating entropies from observed data Selecting the appropriate model for time-correlated dynamics
Although the first issue has been vigorously investigated, the second has not. Even with sophisticated methods for entropy estimation (Costa & Hero, 2004; Miller & Madow, 1954; Nemenman, Shafee, & Bialek, 2002; Paninski, 2003; Roulston, 1999; Strong et al., 1998; Treves & Panzeri, 1995; Victor, 2000, 2002; Victor & Purpura, 1997), the final estimate of the rate (as opposed to estimating block entropies) can be dominated by a human-derived choice of some underlying modeling parameter (Schurmann & Grassberger, 1996; Strong et al., 1998). These heuristics are subjective and make constructing a confidence interval for the rate difficult. By explicitly addressing the model selection problem, we have designed an estimator for entropy rate that requires no heuristic decisions, contains no important free parameters, and uniquely provides a Bayesian confidence interval about its estimate. We follow a statistical principle inspired by data compression (Rissanen, 1989) to weight a mixture of models appropriate for a finite time series (Kennel & Mees, 2002; London, Schreibman, Hausser, Larkum, & Segev, 2002; Willems, Shtarkov, & Tjalkens, 1995). These “context-tree” models may be used to estimate many quantities, but here we concentrate on Bayesian estimators of entropy (Nemenman et al., 2002; Wolpert & Wolf, 1995), which are combined to yield a direct estimator of entropy rate. We demonstrate through simulation that these estimates are consistent, exhibit comparatively low bias on finite data sets, outperform common alternative procedures, and provide confidence intervals on the estimated quantities. We calculate confidence intervals about the estimate using a numerical Monte Carlo method (Gamerman, 1997). Although motivated by problems in neural coding, the algorithm presented here is a general estimator of entropy rate for any observed sequence of discrete data. In a related work, we extend these techniques to estimate the mutual information rate from experimentally observed neural spike trains (Shlens, Kennel, Abarbanel, & Chichilnisky, in preparation). This letter proceeds by reviewing the classical and Bayesian estimators of entropy as well as the common issues in extracting entropy rate from entropies. Motivation for the context tree modeling method for a time series and its justification follows. Next, we show the application of the model to entropy rate estimation and empirical results. Finally we discuss potential extensions and limitations for this method of estimating entropy rates, as well as current applications in neural data analysis.
1534
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
2 Entropy and Entropy Rate 2.1 Classical and Bayesian Estimators of Entropy. We begin by outlining a Bayesian procedure for estimating the entropy and its associated confidence interval. The Shannon entropy of a discrete probability distribution P = { pi }, i ∈ [1, A], H(P) = H({ pi }) = −
A
pi log pi ,
(2.1)
i=1
quantifies the uncertainty represented by P, or the average number of yesno questions needed to determine the identity i of a random draw from P (Cover & Thomas, 1991; MacKay, 2003; Shannon, 1948).1 Although equation 2.1 provides a definition of entropy, estimating this quantity (and P) well from finite data sets of observations presents significant statistical and conceptual challenges, which are far greater when estimating the entropy rate. Assume that one observes N independent draws of a discrete random variable from some unknown underlying distribution with alphabet A. The count vector c = [c 1 , . . . , c A] accumulates the observed occurrences of each symbol. The naive entropy estimator assigns the observed frequencies for the probabilities p j = c j /N, Hˆ naive (c) =
A
−
j=1
cj cj log2 . N N
(2.2)
This estimator yields the correct answer as N/A → ∞, but in many practical cases, this estimator is significantly biased. The first-order correction (Miller & Madow, 1954; Roulston, 1999; Victor, 2000), Hˆ MM (c) =
A j=1
−
cj cj A− 1 log2 + log2 e, N N 2N
(2.3)
still retains significant bias when N ≈ A or N A (Paninski, 2003). An alternative approach is a Bayesian estimator of entropy (Wolpert & Wolf, 1995). Consider a hypothetical probability distribution θ, which is a candidate for P. Each component θ j is a guess for the true probability pa rameter p j and, accordingly, must be Aj=1 θ j = 1. Consider the probability
1 All information-theoretic quantities in this letter use base 2 logarithms to provide units of bits.
Estimating Entropy Rates
1535
of drawing one symbol j assuming the underlying distribution is θ. By definition, this is θ j . The likelihood of drawing the particular set of empirically observed counts c is the standard multinomial, P(c|θ) =
A 1 (θ j )c j , Z j=1
(2.4)
where Z is a normalization. A Bayesian entropy estimate averages the Shannon entropy of θ, H(θ) = − j θ j log θ j , over the relative likelihood that θ might actually represent the truth given the observed counts P(θ|c): Hˆ Bayes (c) =
H(θ)P(θ|c)dθ.
(2.5)
By Bayes’ rule, we take P(θ|c) ∝ P(c|θ)P(θ) and recast the estimation as Hˆ Bayes (c) =
H(θ)P(c|θ)P(θ)dθ.
=
H δ [H − H(θ)] P(c|θ)P(θ)dθ d H
=
H P(H|c) d H.
(2.6)
A disadvantage of a Bayesian approach is that one needs a somewhat arbitrary prior distribution, P(θ), on the parameters of the distribution θ. We discuss the common choice in appendix A. Nemenman et al. (2002) investigated the properties of Hˆ Bayes in the modest to small N/A limit. In this regime Hˆ Bayes is dominated by the particular prior P(θ) chosen, not the observations, meaning that the estimate does not reflect properties derived from actual data. They suggest an interesting meta-prior as a correction. In our application, this is not a concern because N/A is typically large when we estimate Hˆ Bayes , and hence we do not apply this correction. There is a conceptual point to consider here. Which estimate should be used if all observations occur in a single bin? An argument can be made that in this circumstance, all of these estimators ( Hˆ Bayes , Hˆ MM ) should be bypassed in favor of an estimate of zero, for example, define Hˆ zero = 0 for deterministic data, Hˆ zero = Hˆ Bayes or Hˆ MM otherwise. The idea is that if the underlying physics reliably produces the same value, then there is probably some underlying mechanistic principle preventing anything else. For instance, in neural data absent of spikes (or completely silent), a better estimate might be to use Hˆ zero rather than the small but positive values that Hˆ MM or Hˆ Bayes would yield. An alternative perspective is that although all
1536
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
observations have been the same value, the future and underlying truth is not necessarily so; thus, Hˆ Bayes or Hˆ MM would be appropriate. Choosing which estimator is appropriate is an externally directed prior on the expected structure of observations. In empirical testing, using Hˆ zero improves bias on some physical systems we have considered. For the results we present later, the effect is very small. 2.2 Bayesian Confidence Intervals for Hˆ Bayes . To make inferences from real data, we need a range of results that appear compatible with the data rather than a single-point estimate of the entropy. Because all estimates are subject to statistical fluctuation, comparisons between any two quantities require comparing any difference to reasonable statistical fluctuations. We need an “error bar” on our estimate, but from experiment, we have only a single data set. Again, we choose the Bayesian perspective and ask what the likelihood is of an estimated entropy given the observed data, P(H|c). A variety of underlying distributions could have been sampled to generate the observed data. Each such distribution has a greater or lesser compatibility with the data (relative likelihood), and we weight their entropies accordingly. The width of the posterior distribution P(H|c), around its mean, Hˆ Bayes , is a Bayesian measure of the uncertainty of the estimate. Wolpert and Wolf (1995) computed Hˆ Bayes and the variance of P( Hˆ Bayes |c) analytically using the same prior as ours. We want to go beyond variance and find confidence intervals. We call the central portion of P(H|c) the Bayesian confidence interval of our estimate. The Bayesian form is sometimes called a credible interval, distinguishing it from frequentist confidence intervals, which have a different definition. Specifically, we find quantiles of P(H|c)—those locations where the cumulative distribution achieves some value 0 < α < 1. With C(H) = H −∞ P(H |c) d H , a quantile Hα is defined as the value where C(Hα ) = α. Then, for example, a 90% confidence interval is [H0.05 , H0.95 ]. Roughly, we might say the observed data have been produced by a distribution whose entropy falls within the interval with 90% likelihood.2 Unlike variance, the quantiles for entropy are not known analytically, but they can be estimated using a Markov chain Monte Carlo (MCMC) technique (Hastings, 1970; Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953). The MCMC algorithm produces a succession of vectors θ k , k = 1, . . . , NMC , which sample P(θ|c). By taking the entropy function of each θ k , we obtain NMC individual samples, Hk = H(θ k ), from the distribution P(H|c). We calculate Hˆ Bayes to be the mean of this empirical distribution approximating equation 2.6 and the Bayesian confidence interval to be the
2 Note that H ˆ Bayes is the mean, distinct from the median, H0.5 , and the mode, the maximum likelihood estimate.
Estimating Entropy Rates
1537
Table 1: Examples of Entropy Estimates (Bits) ± Estimate of Standard Deviation.
Hˆ naive Hˆ MM Hˆ Bayes
(5,5)
(50,50)
(1,9)
(1,99)
1.000 ± 0.000 1.070 ± 0.000 0.937 ± 0.082
1.000 ± 0.000 1.007 ± 0.000 0.993 ± 0.010
0.47 ± 0.27 0.54 ± 0.27 0.51 ± 0.24
0.081 ± 0.065 0.088 ± 0.065 0.105 ± 0.068
Note: The top row shows count vectors c = [# of heads, # of tails] for a binary alphabet.
quantiles [H0.05 , H0.95 ] of the empirical distribution. The number of samples, NMC , is a user-controlled, free parameter chosen solely to balance Monte Carlo fluctuation against computation time. We typically use NMC = 199 so that the the 5% and 95% confidence interval values can be read off directly as the 10th and 190th smallest values in a sorted list of Hk . We outline the specific prior, the analytical formula for Hˆ Bayes , and concrete implementation of the MCMC algorithm in appendix A. 2.3 Comparing Hˆ naive , Hˆ MM , and Hˆ Bayes . We demonstrate the behavior of Hˆ naive , Hˆ MM , and Hˆ Bayes on varying hypothetical count frequencies to show the properties of the estimators. Suppose we count the probability of flipping a tail in a biased coin. Table 1 shows results. If equal counts are observed, say, (5, 5) and (50, 50), the naive estimator gives entropy of exactly one. The classical bias-corrected version Hˆ MM gives value larger than one, which is impossible for the entropy of any binary distribution. This is because the bias correction is calculated in a “frequentist” philosophy. This means that one considers the observed frequencies to reflect a distribution approximately close to truth, and one then pretends one can simulate new finite sets of observations from these probabilities. It is well known that simulating finite data from a distribution gives on average a spikier empirical distribution than truth, leading to lower entropy with a naive estimator. Thus, the naive estimator is deemed to be biased down, and the upward correction is applied to give Hˆ MM . The same philosophy can be applied to estimate a standard deviation (Roulston, 1999). Unfortunately, for equal counts, this estimate is identically zero, which is nonsensical as well. Hˆ Bayes does not show these pathologies. The posterior distribution of likely values has support only in the valid region, and the width of the distribution is always sensible. For nonequal counts (e.g., (1, 9) and (1, 99)) the estimators behave more similarly. Table 1 demonstrates how Hˆ Bayes may be both larger and smaller than Hˆ MM in various circumstances. (See Wolpert & Wolf, 1995, for more demonstrations of properties of Hˆ Bayes .) 2.4 Entropy Rate. The entropy rate, as opposed to entropy, characterizes the uncertainty in a dynamical system. It is a particular asymptotic limit of the entropy of a time series. Unfortunately, as we see below, there exist
1538
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
major empirical quantitative problems in trying to successfully estimate the entropy rate from observed time series. Resolving these problems is the subject of our investigation. We proceed from discrete distributions to time series of discrete symbols. This stream of integers might be a spike train, with the integer being the number of spikes in a small time interval (MacKay & McCulloch, 1952; Strong et al., 1998); a discretized interspike interval (Rapp, Vining, Cohen, Albano, & Jimenez-Montano, 1994); an arbitrary symbolic dynamical system (Lind & Marcus, 1996); or even a computer file on a hard drive (Cover & Thomas, 1991). The resulting time series of integers, R = {r1 , r2 , r3 , . . .}, where the subscript indexes time, is termed a symbol stream with alphabet size A. The underlying process that created the time series is called a symbol source.3 In a symbolic dynamical system, a characteristic of the underlying symbol source is the uncertainty of the next symbol (Cover & Thomas, 1991; Lind & Marcus, 1996). In other words, the uncertainty of the next symbol is an intensive property of a dynamical system. We briefly explore the significance of this property in a symbolic system. With an independent and identically distributed (i.i.d.) symbol source, the uncertainty of the next symbol in R is simply the entropy of the discrete distribution, H({ri }). In contrast to the i.i.d. case, many real dynamical systems exhibit serial correlations due to internal state variables and time dependence. Consider the refractory dynamics of a spiking neuron, or inertia in a mechanical system. Knowledge of recently observed symbols or states alters the estimate of the next observation. To capture these dynamics, consider the distribution of words, length-D blocks of successive symbols from the source, defining the block entropy HD ≡ H({ri+1 , . . . , ri+D }), which quantifies the average uncertainty in observing any pattern of D consecutive symbols. Block entropy is normally extensive in the thermodynamic sense, increasing linearly with D for sufficiently large D (e.g., the heat capacity of a solid increases with the amount of matter in question).4 The quantity that characterizes the underlying source is, however, the new uncertainty per additional symbol. Dividing HD by D gives an intensive quantity, which reflects a characteristic of the underlying system (e.g., the specific heat is a property of the substance). HD /D will approach the asymptotic limit from above, as increasingly long words reveal more potential interactions.
3 By assumption, the symbol stream and underlying source are stationary and ergodic. 4 The
statistical mechanics community is showing increasing interest in complex systems, often defined as having a substantial subextensive term (Bialek et al., 2001), often at the critical point between complete order and conventional chaos.
Estimating Entropy Rates
1539
The entropy rate is defined by three equivalent asymptotic limits (Cover & Thomas, 1991), h ≡ lim HD /D D→∞
(2.7)
≡ lim HD+1 − HD
(2.8)
≡ lim H(ri+1 |ri , ri−1 , . . . , ri−D ),
(2.9)
D→∞ D→∞
where h has units of bits per symbol or possibly bits per second. The limit D → ∞ ensures that we account for all possible temporal correlations or historical dependence. In the following section we highlight existing methods on entropy rate estimation that have focused on definitions 2.7 and 2.8. Subsequently, we return to definition 2.9 to derive a new estimator of entropy rate. 2.5 Estimating Entropy Rates with Block Entropy. Even with a sophisticated estimator of entropy, it is not trivial to calculate an entropy rate. Definitions 2.7 and 2.8 suggest strategies for estimating the entropy rate by first making an estimate Hˆ D (Schurmann & Grassberger, 1996) for a range of D. The difficulty with these approaches is that two competing biases make the final estimation step of the rate a qualitative judgment with minimal rejection criteria (Treves & Panzeri, 1995), which forgoes any attempt at calculating confidence intervals on the estimated quantities. One result of this situation is that bias can significantly contaminate the estimate, and variances are underestimated (Miller & Madow, 1954; Nemenman et al., 2002; Paninski, 2003; Roulston, 1999; Treves & Panzeri, 1995). We illustrate this situation in simulation. Consider estimating the entropy rate by calculating Hˆ D at varying depths D. Estimating the entropy rate amounts to selecting a particular word length D∗ and calculating hˆ block = Hˆ D∗ /D∗ . Given infinite data, selecting the appropriate word length is trivial as the asymptotic behavior lim D→∞ HD /D guarantees an accurate estimate of hˆ for large D, where all temporal dependencies in the symbol stream have been accounted for.5 However, for finite data, selecting an appropriate D∗ , that value where Hˆ D∗ /D∗ is deemed to be the best estimate of the rate, is not trivial because a second but (typically) opposing bias complicates the procedure. Discrete estimators of entropy are often dominated by negative bias at large enough D due to undersampling. Thus, the observed distribution is spikier and appears to have lower entropy than truth. Estimating hˆ by selecting a word length must contend with these two competing biases in finite data sets: Hˆ D+1 − Hˆ D . Empirically, this estimate converges faster to the entropy rate (in D) but is prone to large errors due to statistical fluctuations (Schurmann & Grassberger, 1996). 5 The same qualitative picture arises by plotting
1540
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
1. Negative bias: At large word lengths, undersampling naive probability estimators typically produces downward bias due to Jensen’s inequality (Paninski, 2003; Treves & Panzeri, 1995). 2. Positive bias: At small word lengths, finite D in HD /D produces upward bias due to insufficiently modeled temporal dependencies (Cover & Thomas, 1991). Given this situation, the common assumption is that there exists an intermediate region in D, between the realms of the two biases, to provide a plateau effect. The plateau suggests a converged region in which one can select a word length to successfully estimate hˆ block . Unfortunately, as seen in Figure 1, the distinctiveness, or even existence, of a plateau depends highly on the number of data and the temporal complexity of the underlying symbol source. In experimental data from an unknown symbol source, the location of the plateau, if any, is unknown. In this approach, the rate estimate is dominated by the selection of the region of word length used, which is determined by a qualitative judgment of the slopes of Figure 1. Another similar criterion for selecting an appropriate intermediate range of word lengths is to follow the direct method (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Reinagel & Reid, 2000; Strong et al., 1998). The convergence of block entropies to the entropy can be decomposed into extensive and subextensive terms, h=
HD f (D) + , D D
where f (D) is defined as a monotonic function of word length that grows at less than a linear rate. For large enough D, the leading subextensive term is assumed to be constant. In practice, the data analyst plots HD /D versus 1/D and assumes there is some region of the plot with a linear slope such that : 1. D is large enough so that f (D) ≈ k. 2. D is small enough so that Hˆ D is not contaminated by sampling bias. Strong et al. (1998) give some weak bounds based on coincidence counting to help with the second criterion, but they are often not tight enough to be empirically effective. As before, this entropy rate estimate requires a choice based on a subjective assessment of the slope of block entropy plots. Rather than rely on human-directed estimation, we offer an approach that, in effect, automatically selects the word lengths appropriate for finite data sets. We place this problem in a larger framework, akin to probabilistic model selection.
Estimating Entropy Rates
1541
ˆ Figure 1: Estimating the entropy rate using h(D) = Hˆ D /D or Hˆ D+1 − Hˆ D for a simple two-symbol Markov process and a symbolized logistic map. The horizontal dashed line is the true entropy rate calculated analytically (see section 5). Hˆ D is calculated using Hˆ MM with O(106 ) and O(105 ) symbols in the top and botˆ tom rows, respectively. Estimating h with h(D) becomes difficult as the plateau disappears with fewer symbols and greater temporal complexity (e.g., logistic map) in the underlying information source.
2.6 Model Selection in Entropy Rate Estimation. Selecting a plateau in the block entropies amounts to selecting a word length D∗ at which we believe we have accounted for all temporal dependencies in the underlying ∗ dynamics. Implicit is a probabilistic model with AD parameters for the distribution of words that accounts for the temporal dependencies. In other words, successive (nonoverlapping) words of length D∗ are assumed to be statistically independent. Block entropies of order D∗ model the spike ∗ train as independent draws from a distribution of alphabet size AD . This relationship suggests an equivalent topology for retaining frequency counts for all words—a suffix tree, as diagrammed in Figure 2. This data structure is often used in practice for accumulating frequency distributions to estimate block entropies. Each node in the suffix tree retains the counts for a particular ˆ i+1 , . . . , ri+D ). Selecting a word, and all the nodes at a depth D retain P(r plateau prunes a suffix tree at a single depth D∗ , selecting the structure necessary to model the data (Johnson, Gruner, Baggerly, & Seshagiri, 2001).
1542
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
Figure 2: A suffix tree of maximum depth 3 modeling a binary symbol stream. Each node retains counts of the number of observations of the word corresponding to the node. The root node λ counts the number of symbols observed in the stream. Children of a node prepend one additional symbol earlier in time. The ˆ i+1 , . . . , ri+3 ). counts at all nodes for D = 3, when normalized, provide P(r
Given a fixed amount of data, can we select a tree topology with a more principled, less qualitative criterion? In the following sections, we make this probabilistic model more explicit for entropy rate estimation using definition 2.9. We show techniques to select an appropriate tree topology for a finite data set.
3 Modeling Discrete Time Series 3.1 Markovian Modeling. Our approach to estimating the entropy rate directly is to focus on the conditional entropy formulation h ≡ lim D→∞ H(ri+1 |ri , . . . , ri−D ). This is a time-series model-building approach, as we make explicit estimates for the distribution of the next symbol. A classical example in the continuous case is an autoregressive model whose future is a function of previous values of fixed order. We confine our work, though, to discrete processes. Consider a first-order Markov chain with states σ ∈ , each of which has some transition probability distribution P(r|σ ) for emitting some symbol
Estimating Entropy Rates
1543
r ∈ A and a deterministic transition to a new state σ . This Markov chain is a stationary symbol source, and its entropy rate is h=
H [P(r|σ )] µ(σ ),
(3.1)
σ
with H[·] denoting the Shannon entropy (see equation 2.1) and µ(σ ) the stationary probability of state σ .6 This formula is exact for a Markov chain (Cover & Thomas, 1991). In our problem, we must estimate the set of states and transition probabilities from data. A Markov model–based entropy rate estimator applies a deterministic projection from the semi-infinite, conditioned past into a finite set of states . We can equate the set of finite states ≡ RD , where RD ≡ {ri , . . . , ri−D } is the finite conditioning depth D. If a finite history of recent symbols uniquely determines the next symbol, then this is a Markov model, and we take equation 3.1 as the entropy rate h for the system. To estimate the entropy, we need to first select the right set of states and then estimate the transition and stationary distributions. We may then calculate the entropy rate directly from this model estimated from the observation.7 This is a model of finite-time conditional dependence, which is distinct from the timescale of absolute correlation. To understand the difference, imagine a first-order Markov chain, which has a large probability at staying in the same state. Equivalently, consider a classical autoregressive gaussian process with a single lag. The autocorrelation in the resulting time series of these processes could be quite high at significant time delays, but the conditional correlation—how much history is necessary to make probabilistic predictions—remains just one time step. 3.2 Context Trees. The Markov formulation does not by itself make the problem of the selection of word length go away. In a naive view, there are still up to AD conditioning states with D-order conditional dependence. One could imagine using a statistical model selection criterion to select some intermediate optimal D∗ , balancing statistics and model generality. We adopt a more flexible model, however. Instead of just a single depth D∗ , giving all states with histories of length D∗ , we adopt a variable-depth tree. At any time, the history of recent symbols can be projected down to the longest matching suffix in the tree, called a terminal node. Our model is an elaboration of a suffix tree, called a context tree, where each node also
6 Technically, this assumes the Markov chain is sufficiently mixing, with one strongly connected component. Mixing means that the influence of dynamics long in the past has dissipated. Strongly connected means that eventually all states, even in a multistable system, are visited. 7 Directly means that that no block entropies H ˆ D need be calculated as an intermediate step. Rather, we calculate hˆ in one step using equation 3.1.
1544
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
Figure 3: A context tree for a binary symbol stream. Each terminal node retains a distribution of future symbols that occurred after the word corresponding to the node. This distribution is an explicit representation of emitting the additional symbol, which, along with a buffer of recently emitted symbols, determines the transition to a new state.
stores probability distributions for emitting one future symbol.8 Figure 3 diagrams how these terminal nodes are akin to the Markov conditioning states. The root node λ corresponds to the empty string, and children of any node correspond to strings with one additional symbol prepended as an earlier symbol in time. The stochastic process underlying a context tree is called a tree machine, and its entropy rate can also be given by equation 3.1 (see Kennel and Mees, 2002). In summary, the conditional definition of entropy rate 2.9 implies a Markov model, where the conditioning word length is the order of the Markov process. The generalization of a Markov model to a context tree provides a more flexible framework, where the conditioning word length need not be a single value (e.g., D∗ ) but is variable, specific to each conditioning state. The complexity of the model is specified not by the word length but rather by the tree topology. Compared to a fixed-order Markov model, a context tree can better adapt to the dependencies of an underlying source while maintaining statistical precision. There now exist two remaining questions to address.
r r
How do we select the appropriate tree topology (addressed in sections 3.3 through 3.6)? How do we estimate the quantities in equation 3.1 (addressed in section 4)?
8 Kennel and Mees (2002) detail specific computational issues and optimizations in implementing a context tree.
Estimating Entropy Rates
1545
3.3 Modeling with the Minimum Description Length Principle. Selecting an appropriate tree topology is a statistical inference problem (Duda, Hart, & Stork, 2000). If we allow greater depths and longer conditioning sequences, we can represent more complex dynamics, but in return we must estimate more free parameters. This model complexity trade-off is the same phenomenon previously seen as opposing biases in the conventional estimation via block entropies. We resolve this problem by following the minimum description length (MDL) principle. We detour briefly to discuss how MDL addresses model selection through data compression ideas. We return to symbol streams to calculate a modified log likelihood, or code length, for any observed data. We use these code lengths to select not just a single tree topology but a weighting over all possible tree topologies, which balances these model complexity issues according to Bayesian theory. We may weight any statistic computable over nodes, or trees, derived from the underlying data. In section 4, we apply this weighting to estimate the entropy rate. We briefly review Rissanen’s MDL theory (Rissanen, 1989) as a model selection principle (also see Balasubramanian, 1997). Model selection is treated as a data compression problem and can be viewed as a particular form of Bayesian estimation. Given a transmitter (encoder), a lossless communication channel, and a receiver (decoder) that know only the overall model class (but not any specific model), what is the smallest number of bits that we need to losslessly code and decode the observed data? The MDL principle asserts that the model that requires the smallest number of bits, termed the description length, is the most desirable one. The description length for expressing any data set is the sum of the code lengths to encode the model L(model) and the residuals L(data|model). Larger numbers of model parameters or decreased predictive performance increase the respective code lengths. Models that minimize the sum L(data|model) + L(model) appropriately balance the predictive performance L(data|model) with the complexity L(model). Statistical estimation in MDL selects a single best model and examines functionals on this model. Instead of estimating by selecting a model, we may also estimate by weighting (Solomonoff, 1964). Rather than selecting that one model with the lowest code length (the MDL principle), we may average over models, weighting good models more than bad ones. This is very similar to Bayesian averaging, weighting by the “posterior distribution.” Given code lengths L estimated from our models and data, we weight any quantity Q by P(model|data) ∝ 2−L(data|model) to get an estimator for Q, ˆ w (data) = Q
Q(model)2−L(data|model) . 2−L(data|model)
(3.2)
This can often provide superior performance compared to choosing a single model, especially when no single model is clearly better than the others.
1546
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
When one model dominates, the weighted estimator is very quickly also dominated by the single best model and gives nearly identical results to the MDL principle. 3.4 Calculating Code Lengths of Symbol Streams. The first step in applying the MDL formalism to judge the appropriate model complexity (i.e., tree topology) is to calculate the code length or the number of bits necessary to transmit losslessly a symbol stream. We assume, temporarily, that we have a sequence of symbols {r1 , . . . , r N } that have no additional temporal structure. That is, these symbols are drawn independently, from a fixed probability distribution θ over A bins. Section 2.5 described conventional estimators of entropy for this problem. Now, our task is to estimate a fair minimal code length for these data: How many bits would it take to transmit these N symbols down a channel and reconstruct them exactly at a hypothetical receiver? If θ were known, then it would take NH(θ) bits (also the negative log likelihood). However, θ is unknown, and the extra parametric complexity must be accounted for properly. Generally, it is not permissible to substitute an arbitrary entropy estimator Hˆ and call N Hˆ a code length. Rissanen (1989) proposed stochastic complexity L = − log P(c|θ ) P(θ)dθ as an extended log likelihood (or code length) to properly account for this model complexity. The difference of two such stochastic complexities, Rissanen’s test statistic for choosing the better model, is precisely the logarithm of the Bayes factor used in Bayesian hypothesis testing and model selection (Lehmann & Casella, 1998). Frequently these integrals cannot be done analytically without making large N approximations. For small N, these approximations may result in pathological behavior such as nonmonotonicity or negative code lengths. In our application, such problems would be fatal, as smooth behavior is required down to even N = 1. A simple and effective prescription to obtain a fair code length and avoid such pathologies is to adapt predictive techniques used in sequential data compression (Cover & Thomas, 1991). At time step k + 1, make a probability ˆ k+1 ) conditioned only on previously observed data, s1 , . . . , sk . estimate θ(s ˆ The total code length of an entire symbol stream, is j − log2 θ(s j ). The complexity cost for coding the parameters is included implicitly because the early data are coded with poorer estimates than the later data. One caveat to sequential prediction is that we must provide a small but finite probability for potential symbols not yet observed. Thus, we cannot predict probability zero for any symbol value, because if it should occur, it would give an infinite code length. We use the Krischevsky-Trofimov (KT) estimator (Krichevsky & Trofimov, 1981), which adds to all possible symbols a small “ballast” of weight β > 0. Given count vector c, this estimator predicts a probability for symbol j ∈ [1, A], θˆKT, j (c) =
cj + β . k (c k + β)
(3.3)
Estimating Entropy Rates
1547
The free parameter β > 0 prevents estimating θˆ = 0, which would give an infinite code length should that symbol actually occur.9 We apply the KT estimator sequentially, using the counts of symbols, i ci , which were seen through time i, that is, cij = l=1 δ(rl = j). We may sequentially code the stream with a net predictive code length, L KT (c, β) =
N
− log2 θˆ KT (ri |ci−1 ),
(3.4)
i=1
having seen all symbols c = c N . It is critical that the estimator for symbol ri uses ci−1 and not ci , so that a hypothetical receiver could causally reconstruct the identical θˆ before seeing ri and, given the compressed bits, decode the actual ri . For this particular estimator 3.3, we may conveniently compute the code length equation 3.14 using only the final counts and not the sequence: L KT (c, β) = log2
A (c j + β) (N + Aβ) log2 − . (Aβ) (β) j=1
(3.5)
It is a particularly convenient that for the KT estimator, the sequential predictive formula can be collapsed so that the actual sequence order is unimportant. It turns out that the stochastic complexity integral of the multinomial 2.4 with the Dirichlet prior can be evaluated exactly (Wolpert & Wolf, 1995), and, moreover, the answer is identical to equation 3.5. This three-way coincidence is specific to KT and not generically true for other estimators. In general, we favor sequential predictive computation as the least risky option over analytical approximation to stochastic complexity or Bayesian integrals. For the KT estimator, selecting a priori β = 1/A tends to perform well empirically (Nemenman et al., 2002; Schurmann & Grassberger, 1996). The distribution of Shannon entropy of the Dirichlet prior happens to be widest with β = 1/A as well. In the final section, we outline an optimization approach for β to eliminate this free parameter if desired. 3.5 Context Tree Weighting. Of course, real symbolic sequences are not independent and identically distributed. Our larger model class is that of a
9 We intentionally use the same β as in equation A.1, because the KT estimate is also the Bayesian estimate of θ with the same Dirichlet prior, θˆ KT = θ P(c|θ)PDir (θ)dθ. Note that Hˆ Bayes , is not the same as the estimator plugging in θˆ KT into the Shannon entropy formula. As θˆ KT always smooths the empirical distribution toward being more uniform, plugging it in would result in a larger value for entropy than the naive estimator, just like Hˆ MM . The Bayesian estimate Hˆ Bayes may be larger than, smaller than, or equal to Hˆ naive or Hˆ MM , depending on the observations.
1548
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
context tree, as previously discussed. The issue now is which nodes of the tree (i.e., its topology) one should choose as the best, balancing statistics versus modeling ability. We choose the optimal topology using equation 3.5 as a subcomponent in the calculation. Consider for a moment the simplest model selection problem: compare coding using no history (i.e., the root node only) versus conditioning on one past binary symbol (i.e., descending one level of the tree). Which is a better model: P(ri ), or P(ri |ri−1 )? We calculate the code length L(n) to code the conditional observations foreach node. The MDL model selection criterion is to compare L(parent) to c L(c), the sum of the children’s code length. If the first is smaller, then P(ri ) is the better; otherwise, the more complex model is better. Imagine that the process actually were independent. The distribution of c(parent) would look similar to c(c). Usually L(parent) would be smaller than c L(c) because only one parameter (implicitly) needs to be encoded instead of two. If, however, the distributions at the children were significantly distinguishable from the parent, then the smaller code length would go to the more complex model and the children preferred. Instead of a hard choice, we may weight our relative trust in the two models as proportional to 2−L (Solomonoff, 1964), and form a weighted code length at the node that captures this weighting. This process is continued recursively: at each node we consider the weighting between the model that stops “here” versus the composite of all models that descend from this subtree. This is the key insight of context tree weighting (CTW) (Willems et al., 1995). Considering binary trees D to maximum depth D, CTW is like Bayesian weighting over 22 possible topologies, but is efficiently implementable in time linear in the size of the input data. The original invention of CTW was as a universal sequential source coding (data compression) algorithm, whose performance may be easily bounded by analytical proof. Here, we are interested in statistical estimation, and explicit sequentiality is not required. We now outline the steps in a batch estimation of the weighted context tree and its code length: 1. Form a context tree for the entire data set, retaining at each node n the conditional future counts, c(n). 2. Compute and store, for every n, its local code length estimate, L e = L KT (c(n), β), with equation 3.5. 3. Compute recursively the weighted code length at each node. With the coding probability distributions defined Pe = 2−L e , Pw = 2−L w and child nodes of n denoted as c, the fundamental formula of context tree weighting is (Willems et al., 1995) Pw =
1 1 Pw (c). Pe + 2 2 c
(3.6)
Estimating Entropy Rates
1549
Defining, for clarity, L c to be the sum of L w over extant children c, L c = c L w (c), we have L w (n) = − log2
2−L e + 2−L c 2
(3.7)
= 1 + min(L e , L c ) − log2 (1 + 2−|L e −L c | ). In practice, this involves a depth-first descent of the context tree, computing L w for all children before doing so for the present node. If there are no children, or the number of observations at n is but one (i.e., a ca (n) = 1), then instead define L w = L e and stop recursion. (With only one observation, any child would have an identical code length.) 4. At the root node λ, L w (λ) is the CTW code length for the sequence. For statistical modeling applications, representing local code lengths L e , L w with standard finite-precision floating point is sufficient, contrary to comments in London et al. (2002) indicating a need to use arbitrary precision arithmetic to store coding distributions. Note that even for A > 2, the factors of 1/2 in equation 3.6 remain as is, reflecting the assumed prior distribution on context trees: that a tree terminating at a given node has an equal prior probability as all subtrees that descend from that node. As discussed briefly in Willems et al. (1995), one could also choose to weight the current node and descendants by factors γ and 1 − γ for 0 < γ < 1. 3.6 Weighting General Statistics. The model selection principle behind CTW may be adapted for more general statistical estimation tasks than source coding. In particular, if we are able to evaluate some statistic Q(n) as a local function of the particular context and the counts observed there, we can find the weighted estimator Qw that weights the local Q values with the same weighting as used in CTW. To compute Qw : 1. Produce the context tree from the observed sequence and compute the per node code lengths as described in section 3.5. 2. For every node, compute the local weighting factor W(n). W(n) is the relative weight of the current node versus the subtree of possible child nodes: W(n) =
2−L e . 2−L e + 2−L c
(3.8)
If there are no children or the current node has only one observation, W(n) = 1.
1550
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
3. Evaluate the local statistic Q(n) for each node from the counts. Compute the weighted statistic Qw , Qw (n) = W(n)Q(n) + (1 − W(n)) Qw (c) . (3.9) c
Like L w this requires a recursive depth first search. 4. The final weighted statistic for the whole tree is Qw (λ), the value at the root node. The weighting can also be understood conceptually as an average over all possible nodes in all possible trees: Qw =
W(n)Q(n),
(3.10)
n
with W(n) the net product of all (1 − W(n)) and W(n) weighting factors descending from the root node to n. The operation of standard CTW is the special case, Q = θˆ KT δ(n = history), of this general statistical weighting. 4 Estimating the Entropy Rate with Context Tree Weighting We now present an algorithm for estimating the entropy rate following the Markov formulation (see equation 3.1) and using the weighting over all tree topologies. We diagram and outline all of these results in parallel in appendix D. As an aside, we could estimate the entropy rate using the code length directly hˆ CTW =L w (λ)/N; however, this estimator is always biased from above, because it obeys the redundancy inherent in any compression scheme (Kennel & Mees, 2002; London et al., 2002). Instead, we emphasize that we use the code lengths (and compression technique) only to define the appropriate weighting over Markov models. This is similar in spirit to string-matching entropy estimators (Amigo, Szczepanski, Wajnryb, & Sanchez-Vives, 2004; Kontoyiannis, Algoet, Suhov, & Wyner, 1998; Lempel & Ziv, 1976; Wyner, Ziv, & Wyner, 1998), which use the same key internal quantities as the Lempel-Ziv class of compression algorithms (Ziv & Lempel, 1977), but without the additional overhead necessary to produce a literal compressed representation. ˆ with the Weighted Context Tree. We first fill an un4.1 Estimating h bounded context tree with all observed symbols R, and find the code lengths as in section 3.5. Recall that at each node, we accumulate the counts of all future symbols c = [c 1 , . . . , c A] (where we have suppressed n for notational convenience), which occurred after the string corresponding to the node.
Estimating Entropy Rates
1551
Every node corresponds to a potential Markov conditioning state σ , with cr ˆ P(r|n) = A
i=1 c i
µ(n) ˆ =
N(n) = N
(4.1) A
i=1 c i
N
(4.2)
as the estimated transition and stationary probabilities, and N is the total ˆ number of symbols. The estimated transition probability P(r|n) has an alphabet of size A and an occupancy ratio N(n)/A 1 for heavily weighted nodes. This is in stark contrast to the explosion of words AD in block entropies. Treating each node as a Markov state via equation 3.1, we estimate the entropy rate by using the local function ˆ Q(n) = H[c|n] µ(n), ˆ
(4.3)
where Hˆ is any estimator of the entropy from observed counts. Following section 2, we select Q(n) = Hˆ Bayes (c|n)µ(n) ˆ and substitute it in into the general weighting procedure of section 3.6, yielding our direct rate estimate hˆ Bayes as the value of Qw at the root. We could also use Hˆ MM instead of Hˆ Bayes to get a rate estimate hˆ MM , which also performs well. This does not lead as easily to a confidence interval, however, as we now discuss for hˆ Bayes . ˆ Bayes . The previous section 4.2 Bayesian Confidence Intervals for h shows how to get a point estimate of the entropy rate. Now we want to find the distribution of entropy rates that seem likely from the data. For clarity, this algorithm is also summarized in appendix D. We simulate tree topologies according to their posterior likelihood, and at each of their nodes, simulate one possible Hˆ Bayes as in section 2.2. Symbolically we perform a Monte Carlo simulation of the integrand of the abstractly represented hˆ Bayes =
h( , µ)P( |T, C)P(T) d dT,
(4.4)
where T represents the topology of a particular tree, the union of parameter vectors at all terminal nodes of that tree, C the observed future counts at those nodes, and h( , µ) the Shannon entropy rate operator on and the occupation probability for those nodes. The result will be samples from the integrand of equation 4.4. Its expectation value is an estimate of the entropy rate hˆ Bayes and the error bars—estimated width of reasonable rates given sampling variation—from the quantiles of the empirical distribution. We randomly draw tree topologies with relative probability equal to the implicit CTW weighting over tree topologies. Consider recursive descent
1552
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
down from the root node λ. At each node n, draw a random variate ξ ∈ [0, 1). If ξ is less than the local weighting factor W(n), then stop here: this is a terminal node for the model. Otherwise, recurse down all extant branches until all paths have terminated. This yields a topology (set of terminal nodes) of a good model, drawn with probability proportional to its weighting as implied by the CTW formulas. At each terminal node of this subtree, draw a single sample of P(H|c), using the MCMC procedure and the counts for that particular node. Call it H ∗ (n). Combine these draws to calculate one estimated sample of the entropy rate: h i∗ =
H ∗ (n)µ(n). ˆ
(4.5)
n
The sum here is over only those terminal nodes selected stochastically for this single estimate. We repeat this procedure NMC times, every time drawing a new topology and set of H ∗ (n) values, and using equation 4.5 to calculate h i∗ for each such draw. This gives samples drawn from the underlying P(h|data). Their mean is the final rate estimate,
hˆ Bayes =
NMC 1 h∗. NMC i=1 i
(4.6)
The distribution of the samples provides a confidence interval for the estimate accounting for natural statistical variation. For instance, to estimate a 90% confidence interval, sort 199 samples of h i∗ ; the 10th and 190th elements in the sorted set approximate the true confidence interval. For computational simplicity, we use µ(n) ˆ = Nn /N, although a more correct µ is fully determined by the node probabilities. Appendix B discusses this issue in detail. 4.3 Asymptotic Consistency of hˆ Bayes . As a data compression method, the arbitrary depth CTW method has been proven to achieve entropy for stationary and ergodic sources, meaning that lim N→∞ hˆ CTW → h for all but infinitesimally improbable strings of length N (Willems, 1998). In statistical terms, this means that hˆ CTW is an asymptotically consistent estimator of the entropy rate. We here give an argument suggesting the same is true of hˆ Bayes . Both hˆ CTW and hˆ Bayes can be expressed as a global sum of the form 3.10, so we may write
L KT (n)
ˆh CTW − hˆ Bayes = Wn µ(n)
− Hˆ Bayes (n)
. Nn n
Estimating Entropy Rates
1553
Using the results in Wolpert and Wolf (1995) and equation. 3.5, we computed the large N asymptotic expansions of the difference of the local entropy estimators: lim (L KT /Nn − Hˆ Bayes ) = C1
N→∞
1 log Nn + C2 (θ, A, β) + ... Nn Nn
with C1 = β(A − 2) + 12 and C2 (θ, A, β) a complicated formula depending only on the local θ n and parameters. As µ(n) = Nn /N ≤ 1, this gives
log N |C2 |
ˆh CTW − hˆ Bayes ≤ Wn C1 + + ... N N n log N 1 ≤ C1 +O , N N
(4.7)
since n Wn = 1. As hˆ CTW is asymptotically consistent and hˆ Bayes → hˆ CTW , then hˆ Bayes is asymptotically consistent. In practice, it appears that hˆ CTW has more bias. Rissanen’s theory (Rissanen, 1989) shows that because hˆ CTW is an actual compressed code length per symbol, and unlike hˆ Bayes , it cannot log N converge to truth any faster than hˆ CTW → h + O( N ). 5 Results We compare the performance of our estimator hˆ Bayes to several other estimators of entropy rate: the direct method hˆ direct (Strong et al., 1998), two stringmatching based estimators hˆ LZ , hˆ SM , and a strict context tree weightingbased estimator hˆ CTW (Kennel & Mees, 2002; London et al., 2002). hˆ CTW and hˆ direct have been discussed previously. hˆ LZ is the Lempel-Ziv complexity measure based on a parsing of the input string with respect to an adaptive dictionary (Amigo et al., 2004; Lempel & Ziv, 1976). hˆ SM averages the length of longest matching strings in a previous fixed-size buffer (Kontoyiannis et al., 1998). See appendix C for a complete description of these algorithms. 5.1 Testing Convergence and Bias. The first example is a source that is a simple three-state hidden Markov model. Its transition matrix is
0
M = 1/5 1/10
1/3 4/5 0
2/3
0 , 9/10
(5.1)
emitting a 0 if the first nonzero transition on each line is taken and a 1 if the second is taken. The entropy rate can be calculated analytically employing
1554
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 1
hBayes h CTW
0.9
String matching LZ parsing
h (bits/symbol)
0.8
0.7
0.6
0.5
0.4
0.3 1.5
2
2.5
3
3.5 log10 N
4
4.5
5
Figure 4: Convergence of several entropy rate estimates, hˆ Bayes , hˆ LZ , hˆ CTW , hˆ SM , on a simple, binary Markov process. The horizontal dashed line is the true entropy rate. hˆ Bayes converges quickest with the smallest overall bias and competitive variance. Data shown are ensembles over draws from the source, with error bars giving sample standard deviations.
(10), h ≈ 0.5623. Figure 4 shows impressive performance of hˆ Bayes , over the straight code length estimator hˆ CTW and string matching methods. The next example is from more complex logistic map dynamics. The continuous space dynamical system is xn+1 = 1 − a xn2 . For a = 1.7, the system is in a generic chaotic regime giving dynamics in x ∈ [−1, 1]. The cutoff x = 0 yields rn =
1,
xn ≥ 0
0,
xn < 0,
a generating partition, and thus a binary symbolization, preserving all the dynamical information from the continuous time series in the symbol stream with a known entropy rate (Kennel & Buhl, 2003). Although the continuous equation of motion is simple, the symbolic dynamics are not, with significant serial dependence among the binary symbols. In fact, it is not a Markov chain
Estimating Entropy Rates
1555
1
hBayes hCTW
0.95
String matching LZ parsing
0.9
h (bits/symbol)
0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 1.5
2
2.5
3 log
10
3.5 N
4
4.5
5
Figure 5: Convergence of several entropy estimates, hˆ Bayes , hˆ LZ , hˆ CTW , hˆ SM , on a symbolized logistic map. The horizontal dashed line is the true entropy rate. Again, hˆ Bayes converges quickest with the smallest overall bias and competitive variance. Data shown are ensembles over draws from the source, with error bars giving sample standard deviations.
of any finite order and shows deep dependence, making this a challenging problem. Figure 5 shows results. Once again hˆ Bayes provides a consistently more accurate estimate. One thing to notice is that whereas in the previous Markov chain example, hˆ LZ was rather good and hˆ SM rather poor, here their performances are reversed. In our experience, this is frequently the case. In various examples when we already knew the exact value, one of hˆ LZ and hˆ SM came close to the truth, and the other was quite erroneous—-but we could never predict ahead of time which one would be better, making them difficult to use for analyzing experimental data where truth is unknown. Finally, Figure 6 shows a comparison of hˆ Bayes to the direct method hˆ direct for the two sources. The plateau disappears for small N or more complex dynamics making estimation of hˆ direct unreliable. 5.2 Testing Error Bars. We now examine the quality of the confidence intervals in the estimation of hˆ Bayes , that is, the quantiles of the h ∗ distribution, equation 4.5. The test procedure is as follows: we produce NS time
1556
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
A
hBayes h
0.65
direct
h
truth
0.6 0.55 103
104
105
106
N
0.75
B
h
0.7 0.65 0.6 0.55
103
104
105
N
0.8
C
1
D
0.8
0.6
D
H /D
HD/D
0.7
0.5
0.6 0.4
0.4 0
20 word length (D)
40
0.2
0
20
40
60
word length (D)
Figure 6: Comparison of hˆ Bayes (circle, solid line) to hˆ direct (dash-dot line) for (A) a simple Markov process and (B) a symbolized logistic map. The horizontal dashed line is the true entropy rate. We implemented a simple extrapolation algorithm for calculating hˆ direct ; however, a human-directed approach could yield a more accurate estimate. (C) Block entropy for Markov process, and (D) for logistic map N = O(106 ). The plateau effect in the block entropies Hˆ D disappears for the more complex logistic dynamics or decreased N (number of symbols), making the estimate of hˆ direct less reliable.
series from the source and find hˆ Bayes (i) for each of them, i = 1, . . . , NS . We find the central 90% quantile of this distribution and take it as the real “error bar”—the dispersion of the statistic under repeated samplings from the source. For each time series, there is an individual confidence interval estimated by the central 90% quantile of the distribution of h ∗ . Ideally, the
Estimating Entropy Rates
1557
0.66
hBayes
0.65
0.64
0.63
0.62
0
10
20
30
40
50
60
70
80
90
Data set number
Figure 7: Testing the Bayesian confidence interval on 99 data sets of length N = 5000 from a pseudo-logistic source (gray circles). Average of hˆ Bayes and average of individually estimated confidence intervals (left-most triangle, black). Average of hˆ Bayes and 90% percentile on actual ensemble (left-most square, gray). The horizontal dashed line is the true entropy rate. The confidence interval is well calibrated: the ensemble average of estimators is very close to the exact value, and the average size of their confidence intervals (each estimated from a single data set) is also nearly equal to the variation in hˆ Bayes under new time series from the source.
ensemble-average size of the individual confidence intervals should be the same size as the size of the variation of the statistic under actual new samples from the source. Our two sources with known entropy rates are the logistic map (as previously discussed) and a pseudo-logistic map, that is, a Markov chain that was estimated from a sample of logistic map symbols from a single tree model found from the method in Kennel and Mees (2002). Comparing these similar sources is instructive. Figure 7 shows results on the logistic-like Markov chain. Here, at N = 5000, as for most N, the estimator performs exceptionally well with almost no bias in ensemble, and estimating sampling variation accurately. Across the wide range of N, truth lies within the 90% intervals for about 90% of the samples from the source, implying the
1558
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 0.66
0.65
0.63
h
Bayes
0.64
0.62
0.61
0. 6
0
10
20
30
40
50
60
70
80
90
Data set number
Figure 8: Testing the Bayesian confidence interval on 99 data sets of length N = 5000 from a true, symbolized logistic source. This figure follows Figure 7. Here, the confidence interval is not well calibrated, and the actual variation under sampling from the source is larger than the size estimated by Bayesian posterior. Note that these data are for a value of N that shows a peculiarly large relative discrepancy in Figure 10.
error bars are calibrated well. The general weighted tree model appears to estimate this Markov chain quite well. Figure 8 shows the equivalent result for the challenging case of small N in the real logistic map. Here, the ensemble average is slightly higher than truth, but the variation in actual samples from the ensemble is larger than the size of the variation estimated by the Bayesian posterior. In other words, for a small number of samples from this more complex source, the error bar is not well calibrated and engulfs truth for roughly 60% of the samples from the source. Our estimator performs well in an absolute sense despite this mismatch. This is also evident in Figures 9 and 10, which show the similar summaries of true variation and estimated variation across a range of N. Paninski (2003) examined variance estimators for block entropy but found they could severely underestimate the true variance in the undersampled limit. By contrast, we get empirically more accurate error bars for rate, staying in a good regime by dint of estimating entropy rate directly with the model selection criteria.
Estimating Entropy Rates
1559
0.8
0.75
h Bayes
0.7
0.65
0.6
0.55
2
2.5
3
3.5 log10 N
4
4.5
5
Figure 9: Convergence of the Bayesian confidence interval on the pseudologistic source: 90% percentile of hˆ Bayes distribution under sampling from source (gray squares) and average from posterior estimation (black triangles). The horizontal dashed line is the true entropy rate. For all N, the average confidence interval engulfs the exact value, and with increasing N, the estimated error bar size matches the actual size accurately.
Controlling for the discrepancy seen in Figure 10 is necessary in experimental data in which truth is unknown. What we think to be happening here is that for some samples from the source, the tree weighting detects statistically significant deep contexts, but for other data sets (more similar to the Markov approximation), the typical depth is comparatively lower. As a result, a finite bias remains, and the variation in the entropy rate on sampling from the true source is larger than expected under the weighted mixture of Markov models. We feel that no model-based statistical estimation procedure can ever be immune to this effect. As a philosophical principle, it is impossible for an estimator to guess outside its model assumptions about what might happen in other, unseen data. Whatever it can use to guess—a model class, estimated parameters, and one observed data set—is already accounted for in its Bayesian posterior distribution. This mismatch can be greater with small amounts of data and complex information sources. We thus ask how we can discern from single trials whether we are within this regime. We suspect the result shown here may represent a rather difficult
1560
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 0.72 0.7 0.68
hBayes
0.66 0.64 0.62 0.6 0.58 0.56 0.54
2
2.5
3
3.5
4
4.5
5
log10 N
Figure 10: Convergence of the Bayesian confidence interval on the true symbolized logistic map. Symbols are the same as in Figure 9. The estimated confidence interval size is smaller than the true size until N is large.
case: we believe that symbolic dynamics of a zero-noise deterministic chaotic system are harder to estimate (on account of more deep, significant contexts) than what would typically occur in neural data. Furthermore, we highlighted a specific value of N where the mismatch is particularly large. 5.3 Diagnostics. If structural mismatch is sometimes unavoidable, are there signs when it is lurking? If it is, then the statistical user might imagine that there remains some positive bias and that truth might be in the unlikely tails of the posterior (outside the Bayesian confidence interval) more often than nominal. Imagine the situation when the underlying source has a complex structure but the length of data is limited. For arbitrarily short data sets, the method will not choose models with a large number of parameters (i.e., a large word length) because that usually leads to overfitting. As more data are seen, more complex models will be justifiable, which typically exhibit a lower entropy rate. If adding still more data does not seem to increase the effective complexity further, then one could think that the estimated tree at that point is a good, convergent model. If we were estimating only a single tree topology, it would be easy to judge convergence of complexity and topology: the number of terminal nodes times A − 1.
Estimating Entropy Rates
1561
6
~
D
4 2 0
2
2.5
3
3.5
4
4.5
5
2
2.5
3
3.5 log N
4
4.5
5
60
~
k
40 20 0
10
Figure 11: Convergence of diagnostics over number of symbols (N) for the pseudo-logistic source (squares) and the true symbolized logistic map (circles). ˜ expectation of average effective depth over samples from the source. (Top) D, ˜ expectation of estimated number of parameters (following the (Bottom) k, Rissanen ansatz) over samples from the source.
For the weighted trees, we offer two statistics that quantify the effective ˜ using the general depth and the complexity. We define the effective depth, D, ˜ is formula 3.9 with Q(n) = D(n)µ(n), ˆ where D(n) is the depth of node n. D the effective, average conditioning depth or word length used for prediction. Rissanen’s theory (Rissanen, 1989) provides an alternate, more interesting complexity measure. Asymptotically in N, the description length scales as ˆ + L = − log P( )
k log N + O(k), 2
(5.2)
where P is the likelihood at the maximum likelihood solution and k is the effective number of free parameters. We identify L with the overall net CTW ˆ ≈ Nh. Since h is unknown, we substitute code length L w (λ), and − log P( ) ˜ one of our low-bias estimators (e.g., hˆ Bayes ) and invert to solve for k: ˆ log N. k˜ = 2(L w (λ) − Nh)/
(5.3)
˜ and k˜ for the two sources. For Figure 11 shows ensemble averages of D smaller N, they track very well, but for middle ranges of N, the statistics for
1562
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
the pseudo-logistic source have converged, but the ones for true logistic data ˜ as expected. For the logistic-like Markov continue to increase, especially k, ˜ source, k asymptotes at about 10, precisely the number of terminal nodes in the Markov model used as a source. Given but a single data set, one would not observe as smooth a rise, of course, but even still, the lack of convergence ˜ or k˜ with N is suggestive of structural mismatch. The prescription from in D the diagnostics is to collect more data, and if that is not possible, consider the possibility that the estimated confidence intervals might not engulf the truth because of modeling-induced bias. 6 Discussion 6.1 Extensions. There remain several opportunities for improvement and future investigations. Of course, estimating mutual information rates of driven input-output systems is a key problem in neuroscience. We address this in a subsequent work by conditioning on the stimulus similar to Strong et al. (1998), although conditioning on the stimulus can be done any way that is feasible. If the stimulus history were on a small, discrete set, then it could be included in the context tree history as well as past observations. Another large issue revolves around selecting the appropriate embedding or representation for a dynamical system (Kennel & Buhl, 2003)—in particular, for a spike train (MacKay & McCulloch, 1952; Rapp et al., 1994; Victor, 2002). There is no reason to suppose a priori that a fixed-width spike count discretization, though commonly used, is the best, as compared to, for example, some kind of discretized interspike interval representation. After selecting the appropriate embedding, the issue of the proper choice for alphabet size A and the Dirichlet parameter β remains. In this work, we assumed that A represented the nominal alphabet size—that is, the total cardinality of all possible symbols of the space in question. In simple cases, this is known by construction. For example, the total cardinality is known a priori in a spike train discretized so that no more than A spikes can ever occur in a bin. In this case, outside physical knowledge constrains the alphabet to size A. There are cases, however, where the true A is excessively large or potentially unknown. The cardinality of observed symbols m might be substantially less than A (Nemenman et al., 2002; Paninski, 2003). A concrete example would be with a large multichannel recording of k distinct neurons. With only a spiking or nonspiking representation, the Adefined in a general outer product sense would be 2k , but far fewer combinations might ever be observed in practice. We believe that more sophisticated representations and approaches are likely to be necessary in this case, with the code-lengthbased ideas proving useful for weighting among representations as well as histories. There are some very simple two-part coding approaches that may suffice if the nominal A is not excessively large. Conceptually, the transmission sequence consists of sending the alphabet identity information at each
Estimating Entropy Rates
1563
node before the symbols are encoded in the reduced alphabet. Assuming uniform prior on m, it takes log2 A bits to specify m and then log2 mA to specify which of them occur, so that the code length may be expressed as log2 A + log2 mA + L KT (m) with L KT (m) the KT code length over the m actually observed symbols (L KT (1) = 0). In the context tree, note that the A in this expression is actually the value of m for each node’s parent, because the identity of used symbols at a child is a subset of the parent. For the root, of course, it is the true nominal A. The text compression community has dealt with the unknown A issue frequently and has devised numerous solutions. A typical approach is to mix regular conditional estimators (over the nonzero alphabet) with some extra “escape” probability to allow for the situation that a new symbol may have occurred. The escape and ordinary probabilities are defined conditionally for sequential estimation in a variety of ways. Shtarkov, Tjalkens, and Willems (1995) demonstrated a zero-memory estimator whose leading asymptotic term is proportional to m and not A. Nemenman (2002) and Orlitsky, Santhanam, and Zhang (2004) described estimators for the potentially unbounded alphabet case. All of these considerations would go into the zero-memory local code length estimator, modifying equations 3.3, and 3.5. These options may require actual sequential computation for the local code length and may not have a clean batch expression like equation 3.5. For moderate A, another option is optimization of net code length over a varying β. There needs to be a penalty for varying β, L β = − log2 P(β) over some assumed prior on β (e.g., P(β) = A2 βe −Aβ , an arbitrary choice that has a maximum at β = 1/A). The total code length, L T (β) = L C T W (β) + L β is computed over β, and following the MDL criterion, the minimizing β ∗ is located and hˆ Bayes evaluated there. The corresponding Bayesian weighting, an alternative to minimizing, is (Solomonoff, 1964) hˆ =
−L T (β) ˆ h(β)2 dβ . −L (β) T dβ 2
(6.1)
This differs conceptually from the use of alphabet-adaptive zero-memory estimators, because here there is but a single β chosen for all nodes, whereas the other estimators would adapt to the actual alphabet observed at each node of the tree, which will induce some additional overhead. Of course, local optimization over β is possible on each node, though here, unlike the global optimization case, careful attention to discretization issues for the free parameter would be necessary (Rissanen, 1989). Figure 12 shows the effect of varying β on the discretized logistic map (A = 2). hˆ CTW = L C T W /N has a modest minimum around β ≈ 0.5. The effect on hˆ Bayes is even smaller still for a wide plateau; thus, for this data set with a small alphabet, this extension has little benefit beyond the heuristic β = 1/A. See Nemenman et al. (2002) and Nemenman, Bialek, and de Ruyter van Steveninck (2004)
1564
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 0.67
0.66
h
0.65
0.64
0.63
0.62
0.61
0
0.5
1
1.5 β
2
2.5
3
Figure 12: Truth (dashed line), hˆ CTW (circles, upper curve), hˆ Bayes (diamonds, medium curve), hˆ Bayes utilizing Hˆ zero (squares, lower curve) over a range of β on 5000 symbols from the logistic map as previously described. The fact that the diamond curve comes closest to truth around β = 1/2 versus squares is coincidence; for other samples from the source and for other data set sizes, the situation is often reversed. The fact that hˆ Bayes depends less on β when utilizing Hˆ zero is a general phenomenon.
for a discussion in substantially larger A and an interesting estimator that also averages over β in a different way from equation 6.1. 6.2 Limitations. It is important to recognize beforehand when the method may give suboptimal results. Since we make explicit time-series models of the observed data, we might encounter difficulties when the model scheme cannot efficiently predict the data using the short-term context tree history scheme. The method is based on universal compression, so that eventually, with sufficient data, it will find conditional dependencies, but the number of data necessary may be excessively large. The CTW weighting prefers smaller and shallower trees (fewer parameters) over deep ones. If the important dependencies are very deep, they may not contribute effectively to a proper estimate. The result would be a positive bias because the estimated tree is too shallow.
Estimating Entropy Rates
1565
Consider a spike train discretized with an extremely small bin size (e.g., 1 ns), such that the number of symbols between individual spikes is large. The CTW tree would put most model weighting near the root, in effect viewing the spike train as a nearly Poisson process. This would occur because the context tree would not discern enough statistically significant dependence to overcome the complexity penalty. One might diagnose this if, at coarser bin sizes, the estimated entropy rate diverged from a Poisson assumption. Another situation may be multiscale dynamics, where a slow variable modulates the dynamics of a fast variable. Without additional hints, the tree would be sensitive only to the fast dynamics. An approach would be to define some discretized slow variable appropriately averaged over the original data and allow it to occur at some level or levels in the tree as context. If it induced a lower net code length, it would be deemed preferable by MDL considerations. More exotic dynamical systems with high complexity may present difficulties to the present method as well as all other general-purpose entropy rate estimation methods. For example symbol sources with a substantial power law subextensive term f (D) would exhibit this pathology (Bialek, Nemenman, & Tishby, 2001). In these cases, deep contexts may have substantial conditional influence on the future, and only a small subset of the underlying state space may have been observed. This behavior may be nearly indistinguishable in experimental practice from ordinary nonstationarity in a time series. Such dynamics can have either zero or positive Shannon or Kolmogorov-Sinai entropy rate, but the present estimator would likely overestimate the rate for both circumstances. A canonical example of this case is the symbolic dynamics of the logistic map at the parameter boundary between periodicity and chaos, where the true entropy rate is zero, but it may take a large data set to notice this. Perturbing the output of that process by flipping a few bits (in our case, with probability p = 0.02) gives a new process with the entropy of that Poisson process, but far higher complexity than a point Poisson process. Figure 13 shows results of our estimator and the direct method (Strong et al., 1998) using a block entropy estimator of Grassberger (2003). For ordinary extensive chaos with h > 0, one typically expects to find a region of linear behavior in Hˆ D versus 1/D and extrapolate to 1/D → 0, but this does not happen at the boundary of chaos. On both of these data sets, there are two apparently good scaling regions, leading to different entropy estimates, with the Ma bound not providing an objective way to choose one from the other (though N/2 H(D) does stay reasonably large as D → ∞ for the no-noise logistic map and does not for the noisified set). Our estimator, hˆ Bayes , automatically and correctly diagnoses the zero entropy case from the positive entropy case. For the latter, there is a noticeable positive bias, as is expected on this difficult process, which has a very long effective memory. On the positive side, recent theoretical results have shown that a particular context tree compression scheme has an asymptotic rate of convergence
1566
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky 0.4
h
0.3 0.2 0.1 0 0
0.02
0.04
0.06
0.08
0.1
0.06
0.08
0.1
1/D 0.4
h
0.3 0.2 0.1 0
0
0.02
0.04 1/D
Figure 13: (Upper) Logistic map at accumulation point, N = 100, 000. True h = 0. Hˆ D /D versus 1/D (circles), Ma lower bound (dotted line), hˆ Bayes (dash-dotted line). The direct method shows two apparently good scaling regions—one for larger D, which extrapolates to h ≈ 0.05, and the other to zero. (Lower) Logistic map at accumulation point, perturbed by Poisson process with p = 0.02. True h ≈ 0.1414 (solid line). Hˆ D /D (diamond), hˆ Bayes (dash-dotted line). Again, there are apparently two good scaling regions: the upper leads to close to the true entropy and the lower, extrapolating to zero. The Ma bounds (dotted lines) do not help constrain the proper scaling region for the direct method. hˆ Bayes gets in the right vicinity, though in the lower plot, there is a positive bias as expected.
in compression rate, which is as good as any other method for almost all strings (Martin, Seroussi, & Weinberger, 2004). While the actual rate of convergence depends on the character of the source, this result does suggest that one cannot do any better than context trees for compression as a class. J. Ziv (personal communication, 2004) conjectured the same would be found for most practical context tree methods such as CTW. It is not clear whether this result also applies to entropy rate estimation, but it gives confidence that the modeling approach is likely to be generally successful. We note that modeling, as in estimating probabilities of finite strings, is a distinct task from entropy estimation, and as entropy is a nonlinear function of the underlying distribution, different models might be good for dynamics estimation versus entropy estimation. Accurately modeling probabilities of
Estimating Entropy Rates
1567
words of variable length is a harder task. The present method does in fact attempt to make models, but note that minimizing the average compression rate (the implied loss function for choosing the tree weightings) is not as difficult as bounding prediction error on specific strings, and as the target of average compression rate in universal compression is bounded by entropy rate, the conceptual difference may not be so large here. Recall that in the classical case, the entropy estimate depends only on the histogram of occupancies (Nemenman et al., 2004; Paninski, 2003), and the specific labels attached to these may be shuffled without changing the entropy estimate. In our method, once a tree is estimated, the specific labels attached to the contexts designating their history may also be shuffled without changing the entropy rate estimate. 6.3 Conclusions. We have discussed how to calculate the entropy rate in any observed, symbolic dynamical system. We highlighted difficulties with estimating entropy rates using traditional estimation techniques and presented a new approach to estimating entropy rates, following the rationale of selecting a word length D more rigorously. Specifically, we examined how to generate a more appropriate model for a symbol stream by using a Markovian assumption and following the minimum description length principle to select the appropriate model structure and complexity. This probabilistic model provides a framework for sampling the Bayesian posterior distribution, P(hˆ Bayes | data), estimating the range of possible models and their entropy rates that could reasonably give rise to our data. We thus can provide Bayesian confidence intervals on the entropy rate of a discrete (symbolic) time series.10 In simulation, our estimator outperforms other forms of entropy rate estimation, such as algorithms based on string matching and strict CTW compression, showing much lower bias and competitive variance. For well-converged situations, the Bayesian confidence interval appears nearly exact, matching the true posterior distribution of the potential entropy rates. In the neural coding literature, substantial effort has focused on estimating block entropies well in the undersampled regime (Nemenman et al., 2004, 2002; Paninski, 2003; Strong et al., 1998; Treves & Panzeri, 1995; Victor, 2002; Victor & Purpura, 1997); however, little work has approached rate estimation from the perspective of model selection. Although several papers have applied ideas from compression to estimate the entropy rate (Amigo et al., 2004; Kontoyiannis et al., 1998; Lempel & Ziv, 1976), these stringmatching estimators have followed the spirit of coincidence detection (Ma, 1981) and equipartition principles from coding theory, not explicit model
10 Fortran
95 source code as well as a MATLAB interface for both Bayesian entropy rate and information rate estimation is available online at http://www.snl.salk.edu/ ∼shlens/info-theory.html.
1568
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
selection. One previous work (London et al., 2002) did employ the CTW’s compression ratio itself, hˆ CTW , as an estimator for analyzing neural data. In related work, we extend these methods to estimate the information rate, with a Bayesian confidence interval, in real neural data. By extracting a probabilistic model for the spike train conditioned on stimuli, we can sample the posterior distribution of average and specific mutual information rates associated with a neuron (Butts, 2003; DeWeese & Meister, 1999). The reliability of our estimator, coupled with Bayesian confidence intervals, increases our confidence in the validity of estimates of information rates in neural data and offers a method to compare these quantities across temporal resolutions, stimulus distributions, and correlations in neural responses (Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Lewen, Bialek, & de Ruyter van Steveninck, 2001; Liu, Tzonev, Rebrik, & Miller, 2001; Reinagel & Reid, 2000; Rieke, Bodnar, & Bialek, 1995; Schneidman, Bialek, & Berry, 2003). Appendix A: Markov Chain Monte Carlo for Bayesian Entropy Estimation In this appendix we outline a numerical method to obtain samples from the integrand of equation 2.6. We select for P(θ) the Dirichlet prior, 1 PDir (θ) = δ θj − 1 (θ j )β−1 , Z j j
(A.1)
which smoothly parameterizes between the prior uniform on θ (β = 1) and the maximum-likelihood situation (β = 0). Except for the limiting cases of β, this prior does not have an intuitive motivation, but it does allow analytical computation of some integrals, explaining its historical use in classical Bayesian statistics as a “conjugate prior” for the multinomial distribution (see equation 2.4). Wolpert and Wolf (1995) analytically computed the expectation in equation 2.6, Hˆ Bayes =
ci + β [ψ(N + Aβ + 1) − ψ(c i + β + 1)] , N + Aβ i
(A.2)
with ψ(z) = d/dz log2 (z). We are interested in the distribution of the integrand, the posterior distribution of entropy, having observed a set of counts. Markov chain Monte Carlo is a numerical algorithm to draw samples from a probability distribution, particularly useful for estimating
F =
F (θ)P(θ)dθ
(A.3)
Estimating Entropy Rates
1569
for some function F and probability distribution P. We refer the reader to a standard reference for MCMC (Gamerman, 1997) and focus on our specific implementation. In our problem, P(θ) = P(c|θ)PDir (θ) F (θ) = H(θ) = − θ j log θ j , j
where we have avoided complicated normalizations. A key advantage is that MCMC works even if P is not normalized. In many Bayesian problems, a numerator of P is easy to compute, but the normalization may be analytically intractable. The inputs are c and β, and the output is a succession of Hk , which samples from the posterior distribution P(H | c). Define N = c j . 1. Set θ 0 = c/N. Set i = 1. Set k = 1. 2. Randomly draw a candidate θ . For j = 1, . . . , A − 1 and σ j = 1/2 1 min(c j , N − c j ) + 12 , set N θ j = θi−1 + σ j N (0, 1), j with N (0, 1) a gaussian random variate of zero mean and unit vari ance. Set θ A = 1 − A−1 j=1 θ j . 3. If any element of θ is < 0 or >1, then reject it a priori. 4. Otherwise, draw a uniform random variate ξ ∈ [0, 1) and test the likelihood ratio R = P(θ )/P(θi−1 ). If ξ ≤ min(R, 1), then accept θ ; otherwise, reject it. 5. If θ is accepted, set θi = θ ; otherwise θi = θi−1 . 6. If (i mod 100) = 0, then record Hk = H(θi ) and increment k. 7. Increment i by 1. Go to step 2 unless NMC values of Hk have been recorded. A key requirement for proper MCMC estimation is that the Markov chain of successive θi be sufficiently well mixing such that a reasonable-length finite sample can serve to well approximate the true integral. The parameter value of 100 (found to be good for A = 2 or A = 3) in the penultimate step is arbitrary and should be increased if the mixing is insufficient. The central freedom is defining the distribution of the candidate θ ∗ perturbations in step 2. For our problem, we have found empirically that this specific implementation produces good results for evaluating integrals or estimating quantiles in the Bayesian posterior distribution.
1570
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
Appendix B: Computing µ from Node Probabilities Equation 4.5 is a rough, first-order approximation of a more accurate estimate. Suppose we have estimated a tree topology with given terminal nodes, and at each node, we have the emission probabilities θ. Knowing the active contexts and symbol emission probabilities provides the transition probabilities as well. Theoretically, the stationary density µ is completely determined from those two quantities by the Chapman-Kolmogorov equation of detailed balance (Gamerman, 1997). If the set of terminal nodes and allowable transitions happen to define a first-order Markov chain, then the stationary density can be computed as an eigenvector problem (Cover & Thomas, 1991). This is not guaranteed. An arbitrary tree is not necessarily a first-order Markov chain, and computing its stationary µ can be more difficult. One solution is to employ the algorithm of Kennel & Mees (2002) to temporarily expand the tree topology until it is a first-order chain. At that point, the stationary µ is computable by the eigenvector method. Another solution is to iterate the detailed balance equations for the tree machine. Starting with µ(n) = N(n)/N, the tree can evolve density to density until the µ converges numerically to a stationary distribution. When a transition is not strictly first order, the relative probabilities of appropriate past states are estimated using the previous iterate’s µ. Both of these solutions make the computation of h i∗ significantly more intricate. We have implemented these more accurate methods for finding µ, but in our empirical investigation, the effect on the estimate, in the ensemble average, appears to be minimal. Thus, for computational simplicity, we usually use µ(n) = N(n)/N. Appendix C: String Matching Entropy Estimation This estimator is used in section 5. With time index i and integer n, define
in as the length of the shortest substring starting at si that does not appear anywhere as a contiguous substring of the previous n symbols ri−n , . . . , ri−1 . Consider sequentially parsing a string from first to last. At some cursor location i, compute ii , which is the length of the “phrase”, advance the cursor by that amount, and repeat until the entire string is parsed into M phrases. An entropy rate estimate for stationary and ergodic sequences is (Lempel & Ziv, 1976) hˆ LZ =
M log2 N, N
measuring entropy in bits. Amigo et al. (2004) examined the performance of this estimator and found it compared favorably to the method of Strong et al.
Estimating Entropy Rates
1571
(1998) for small data sets and was substantially superior to the better-known data compression method in Ziv & Lempel (1978). The string matching estimator of Kontoyiannis et al. (1998) is similar to the Lempel-Ziv estimator, but averages string match length at every location instead of skipping by the length of the found phrase: hˆ SM =
n 1
n n i=1 i
−1 log2 n.
To implement C with N observed symbols, we first remove a small number of symbols off the end and then split the remaining into two halves. String matching begins with the first element of the second half, (N − )/2 + 1, and examines the previous n = (N − )/2 characters. The length excess padding is necessary to allow string matches from the end locations of the match buffer. is presumed to be a few times longer than the expected match length,
≈ log n/ h.
Appendix D: Overview of Method Figure 14 provides a block diagram of the entire procedure for estimating the entropy rate of any discrete (symbolic) time series. The upper gray box details how to estimate an appropriate weighting of probabilistic models, while the lower gray box outlines how to estimate the range of likely entropy rates from this model. The curved boxes represent calculations, and squared boxes represent procedural aspects. 1. β is an arbitrary parameter for the Dirichlet prior. Selecting β = 1/A, where Ais the alphabet size, works well in practice, but see section 3.4 for a complete discussion. 2. Although section 3.2 discusses how to build a context tree, many optimizations exist to ease their computational tractability (Kennel & Mees, 2002). 3. Recursively descend the tree, and calculate equation 3.5 at each node. 4. Beginning at tree leaves, recursively ascend the tree and sum the code lengths of all children c at each node L c = c L w (c). 5. Recursively descend the tree to calculate equation 3.8 at each node. 6. Draw a single sample ξ from a uniform probability distribution [0, 1]. 7. Descend down the tree from the current node to the next level, and select a single child to examine. All remaining nodes are placed on the stack for future processing.
1572
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
Figure 14: Schematic block diagram for the entire procedure for estimating the range of entropy rates associated in a discrete time series.
Estimating Entropy Rates
1573
8. The number of node A observations is the sum of the counts of future symbols N(n) = i=1 c i (n). 9. The stationary probability distribution µ(n) ˆ follows equation 4.2 (but see appendix B). 10. See section 4.2 as well as appendix A on how to generate a single Markov chain Monte Carlo sample from P(θ|c). 11. Use equations 2.4 and 2.5 to estimate the corresponding entropy. 12. Following equation 4.5, the contribution of this node to the entropy ˆ rate is Qw (n) = H ∗ (n)µ(n). 13. Sum up Qw from all nodes to estimate the entropy rate from the weighted probabilistic model. The weighting W(n) is implicit in the probabilistic arrival at node n. 14. The final output of this procedure is a single sample from the continuous distribution P(h|data) denoted by a small line in the graph. We repeat the estimation procedure NMC times to generate multiple samples from the underlying distribution. Acknowledgments This work was supported by NSF IGERT and La Jolla Interfaces in the Sciences (J.S.); a Sloan Research Fellowship and NIH grant EY13150 (E.J.C.). We thank Pam Reinagel for valuable discussion and support; and colleagues at the Institute for Nonlinear Science (UCSD) and the Systems Neurobiology Laboratory (Salk Institute) for tremendous feedback and experimental assistance. References Amigo, J., Szczepanski, J., Wajnryb, E., & Sanchez-Vives, M. (2004). Estimating the entropy rate of spike trains via Lempel-Ziv complexity. Neural Computation, 16, 717–736. Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neural Computation, 9, 349– 368. Bialek, W., Nemenman, I., & Tishby, N. (2001). Predictability, complexity, and learning. Neural Computation, 13, 2409–2463. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Borst, A., & Theunissen, F. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947–957. Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552.
1574
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
Buracas, G., Zador, A., DeWeese, M., & Albright, T. (1998). Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Butts, D. (2003). How much information is associated with a particular stimulus? Network, 14, 177–187. Costa, J., & Hero, A. (2004). Geodesic entropic graphs for dimension and entropy estimation in manifold learning. IEEE Transactions on Signal Processing, 25, 2210– 2221. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. de Ruyter van Steveninck, R., Lewen, G., Strong, S., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. DeWeese, M., & Meister, M. (1999). How to measure the information gained from one symbol. Network, 10, 325–340. Duda, R., Hart, P., & Stork, D. (2000). Pattern classification (2nd ed.). New York: Wiley. Field, D. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America, A4, 2379–2394. Gamerman, D. (1997). Markov chain Monte Carlo: Stochastic simulation of Bayesian inference. New York: CRC Press. Gilmore, R., & Lefranc, M. (2002). The topology of chaos: Alice in stretch and squeeze land. New York: Wiley. Grassberger, P. (2003). Entropy estimates from insufficient samplings. e-print, physics/0307138. Hastings, W. (1970). Monte Carlo sampling using Markov chains and their applications. Biometrica, 57, 97–109. Hilborn, R. (2000). Chaos and nonlinear dynamics: An introduction for scientists and engineers. New York: Oxford University Press. Johnson, D., Gruner, C., Baggerly, K., & Seshagiri, C. (2001). Information-theoretic analysis of neural coding. Journal of Computational Neuroscience, 10, 47–69. Kennel, M., & Buhl, M. (2003). Estimating good discrete partitions from observed data: Symbolic false nearest neighbors. Phys. Rev. Lett., 91, 084102. Kennel, M., & Mees, A. (2002). Context-tree modeling of observed symbolic dynamics. Physical Review E, 66, 056209. Kontoyiannis, I., Algoet, P., Suhov, Y., & Wyner, A. (1998). Nonparametric entropy estimation for stationary processes and random fields with applications to English text. IEEE Transactions in Information Theory, 44, 1319–1327. Kraskov, A., Stogbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 006138. Krichevsky, R. E., & Trofimov, V. K. (1981). The performance of universal coding. IEEE Transactions in Information Theory, 28, 199–207. Lehmann, E., & Casella, G. (1998). Theory of point estimation. New York: Springer. Lempel, A., & Ziv, J. (1976). On the complexity of finite sequences. IEEE Transactions in Information Theory, 22, 75–78. Lewen, G., Bialek, W., & de Ruyter van Steveninck, R. (2001). Neural coding of naturalistic motion stimuli. Network: Computation in Neural Systems, 12, 317–329. Lind, D., & Marcus, B. (1996). Symbolic dynamics and coding. Cambridge: Cambridge University Press.
Estimating Entropy Rates
1575
Liu, R., Tzonev, S., Rebrik, S., & Miller, K. (2001). Variability and information in a neural code of the cat lateral geniculate nucleus. Journal of Neurophysiology, 86, 2789–2806. London, M., Schreibman, A., Hausser, M., Larkum, M., & Segev, I. (2002). The information efficacy of a synapse. Nature Neuroscience, 5, 332–340. Ma, S. (1981). Calculation of entropy from data of motion. Journal of Statistical Physics, 26, 221–240. MacKay, D. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. MacKay, D., & McCulloch, W. (1952). The limiting information capacity of a neuronal link. Bulletin of Mathematical Biophysics, 14, 127–135. Martin, A., Seroussi, G., & Weinberger, M. (2004). Linear time universal coding and time reversal of tree sources via FSM closure. IEEE Transactions on Information Theory, 50, 1442–1468. Metropolis, N., Rosenbluth, M., Rosenbluth, A., Teller, M., & Teller, A. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091. Miller, G., & Madow, W. (1954). On the maximum likelihood estimate of the ShannonWiener measure of information. Air Force Cambridge Research Center Technical Report, 75, 54. Nemenman, I. (2002). Inference of entropies of discrete random variables with unknown cardinalities. physics/02070009. Available online: www.arxiv.org. Nemenman, I., Bialek, W., & de Ruyter van Steveninck, R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69. Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy and inference, revisited. Advances in neural information processsing systems, 14, In T. G. Dietterich, S. Becker, & Z. Ghahraman: (Eds.), Cambridge, MA: MIT Press. Orlitsky, A., Santhanam, N., & Zhang, J. (2004). Universal compression of memoryless sources over unknown alphabets. IEEE Trans. Inf. Theo., 50, 1469–1481. Ott, E. (2002). Chaos in dynamical systems. Cambridge: Cambridge University Press. Paninski, L. (2003). Estmation of entropy and mutual information. Neural Computation, 15, 1191–1254. Perkel, D., & Bullock, T. (1968). Neural coding. Neurosci Res. Prog. Sum., 3, 405–527. Rapp, P., Vining, E., Cohen, N., Albano, A., & Jimenez-Montano, M. (1994). The algorithmic complexity of neural spike trains increases during focal seizures. Journal of Neuroscience, 14, 4731–4739. Reinagel, P., & Reid, R. (2000). Temporal coding of visual information in the thalamus. Journal of Neuroscience, 20, 5392–5400. Rieke, F., Bodnar, D., & Bialek, W. (1995). Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory afferents. Proceedings of the Royal Society of London B: Biological Sciences, 262, 259–265. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific.
1576
M. Kennel, J. Shlens, H. Abarbanel, and E. Chichilnisky
Roulston, M. (1999). Estimating the errors on measured entropy and mutual information. Physica D, 125, 285–294. Ruderman, D., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Physical Review Letters, 73, 814–817. Schneidman, E., Bialek, W., & Berry, M. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23, 11539–11553. Schurmann, T., & Grassberger, P. (1996). Entropy estimation of symbol sequences. Chaos, 6, 414–427. Shannon, C. (1948). A mathematical theory of communication. Bell Systems Technology Journal, 27, 379–423. Shlens, J., Kennel, M., Abarbanel, H., & Chichilnisky, E. (In preparation). Estimating information rates with Bayesian confidence intervals in neural spike trains. University of California, San Diego, and Salk Institute. Shtarkov, Y., Tjalkens, T., & Willems, F. (1995). Multialphabet weighting: Universal coding of context tree sources. Problems of Info. Trans., 33, 17–28. Simonicelli, E., & Olhausen, B. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1215. Solomonoff, R. (1964). A formal theory of inductive inference. Part I. Information and Control, 7, 1–22. Strong, S., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–200. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7, 399–407. Victor, J. (2000). Asymptotic bias in information estimates and the exponential (Bell) polynomials. Neural Computation, 12, 2797–2804. Victor, J. (2002). Binless strategies for estimation of information from neural data. Physical Review E, 66, 051903. Victor, J., & Purpura, K. (1997). Metric-space analysis of spike trains: Theory, algorithms, and application. Network, 8, 127–164. Warland, D., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78, 2336–2350. Willems, F. (1998). Context tree weighting: Extensions. IEEE Transactions on Information Theory, 44, 792–798. Willems, F., Shtarkov, Y., & Tjalkens, T. (1995). The context tree weighting method: Basic properties. IEEE Transactions in Information Theory, IT-41, 653–664. Wolpert, D., & Wolf, D. (1995). Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52, 6841–6854. Wyner, A. D., Ziv, J., & Wyner, A. J. (1998). On the role of pattern matching in information theory. IEEE Transactions on Information Theory, 44, 2045–2056. Ziv, J., & Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions in Information Theory, 23, 337–343. Ziv, J., & Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions in Information Theory, 24, 530–536.
Received June 23, 2004; accepted January 5, 2005.
LETTER
Communicated by Paul Tiesinga
Spike Timing Precision and Neural Error Correction: Local Behavior Michael Stiber
[email protected] Computing & Software Systems, University of Washington, Bothell, WA, 98011-8246 U.S.A.
The effects of spike timing precision and dynamical behavior on error correction in spiking neurons were investigated. Stationary discharges— phase locked, quasiperiodic, or chaotic—were induced in a simulated neuron by presenting pacemaker presynaptic spike trains across a model of a prototypical inhibitory synapse. Reduced timing precision was modeled by jittering presynaptic spike times. Aftereffects of errors—in this communication, missed presynaptic spikes—were determined by comparing postsynaptic spike times between simulations identical except for the presence or absence of errors. Results show that the effects of an error vary greatly depending on the ongoing dynamical behavior. In the case of phase lockings, a high degree of presynaptic spike timing precision can provide significantly faster error recovery. For nonlocked behaviors, isolated missed spikes can have little or no discernible aftereffects (or even serve to paradoxically reduce uncertainty in postsynaptic spike timing), regardless of presynaptic imprecision. This suggests two possible categories of error correction: high-precision locking with rapid recovery and low-precision nonlocked with error immunity. 1 Introduction This work is concerned with the effects of spike timing precision and dynamical behavior on the recovery of ongoing neural discharges after a perturbation, or an error. The concepts of “precision” and “error” are, of course, intimately linked. Determining how information is transmitted among neurons is an essential issue in computational neuroscience, with a spectrum of possibilities, ranging from overall average firing rates to timing of individual spikes (Dan, Alonso, Usrey, & Reid, 1998; Lestienne, 2001). Central to the question of where a presynaptic neuron’s code lies in this spectrum is the precision of spike timing necessary to elicit a particular response from a postsynaptic neuron. This follows logically, as two different presynaptic discharges that produce the same postsynaptic response have achieved the same functional result, and thus can be said to have carried the same message. Neural Computation 17, 1577–1601 (2005)
© 2005 Massachusetts Institute of Technology
1578
M. Stiber
This article addresses this issue directly by comparing recovery from errors in a spike train for two levels of spike timing precision. The goal is to test the feasibility of the idea that extra spike timing precision (in the sense that it is more than the minimum necessary to transmit a given amount of information) can be used for error correction. The errors that are used are missed spikes, based on the observed apparent unreliability of synaptic transmission (Katz, 1966; Allen & Stevens, 1994). Examination of the effects of noise and jitter in spike trains on neural behavior goes back decades (Segundo, Moore, Stensaas, & Bullock, 1963; Segundo, Perkel, Wyman, Hegstad, & Moore, 1968; Bryant, Marcos, & Segundo, 1973; Segundo, Tolkunov, & Wolfe, 1976; Kohn, da Rocha, & Segundo, 1981; Kohn & Segundo, 1983; Segundo, Vibert, Pakdaman, Stiber, & Diez Mart`ınez, 1994). Briefly, the larger the number of input synapses (the smaller the effect of each presynaptic spike) and the smaller the correlation among them, the more their aggregate postsynaptic effect was like a DC (i.e., constant current) bias. Greater correlations and fewer, stronger synapses operated more like single, strong synapses, producing reliably repeatable responses in the postsynaptic cell. The effect of increasing noise on ongoing dynamical behavior in pacemakers was observed as steadily narrowing the input rate domains for phase-locked behaviors, abolishing those with narrower domains before those with broader ones. Consequently, periodic behaviors with broader domains in parameter space were more stable in the presence of noise. Later work showed this effect in rat spinal interneurons (Beierholm, Nielsen, Ryge, Alstrøm, & Kiehn, 2001), and presented similar results for 1:1 alternation (the nonpacemaker counterpart of phase locking) with quasiperiodic driving (Tiesinga, 2002). Analysis of networks showed that coupling large numbers of cells could lead to lower output variability in the face of input jitter (Marˇsa´ lek, Koch, & Maunsell, 1997; Tiesinga & Sejnowski, 2001). Additionally, periodic phaselocked dynamics were noted to be optimal for reduction of output jitter (the exception to this being near bifurcation boundaries, where input jitter could occasionally abolish locking) in network models (Tiesinga & Sejnowski, 2001). This letter investigates how timing precision in a presynaptic train interacts with the destruction of information caused by a single neuron’s internal dynamics. It extends previous work by evaluating both periodic and aperiodic (quasiperiodic and chaotic) postsynaptic dynamics, as well as the effects of proximity to bifurcations bounding periodic behaviors. Additionally, rather than simply examining the amount of perturbation produced by presynaptic errors, it analyzes the time course of recovery. The ongoing presynaptic spike trains used here were stationary ones, with and without jitter, that induced a range of dynamical behaviors in the postsynaptic cell. Error recovery was measured by the time course of the return of the postsynaptic cell to its stationary behavior. This was determined by comparing pairs of simulations that were identical except for the presence of errors, so
Spike Timing Precision and Neural Error Correction
1579
that after an error, the spike train of the postsynaptic cell could be compared to that of one that did not see the error. Pairs of simulations were repeated for inputs with and without jitter, and error recovery was then compared in those two cases.
2 Methods A well-characterized physiological model (see the appendix) of the crayfish slowly adapting stretch receptor organ (SAO) was used. In the living preparation, the SAO produces spike trains with a rate broadly proportionate to muscle stretch; at constant stretch, it produces pacemaker spike trains. It includes the recognized prototype of a moderately powerful GABAergic inhibitory synapse, and thus has been used as a living model to explore synaptic coding across inhibitory synapses in general. The model has been used previously to explain the dynamics behind paradoxical accelerations observed in the living preparation (Segundo, 1979; Buno, ˜ Fuentes, & Barrio, 1987) in response to pacemaker inhibitory postsynaptic potentials (PSPs) (Segundo, Altshuler, Stiber, & Garfinkel, 1991a, 1991b), as well as the SAO’s responses to nonstationary spike trains (Segundo, Stiber, Altshuler, & Vibert, 1994; Segundo, Vibert, Stiber, & Hanneton, 1995; Segundo, Stiber, Vibert, & Hanneton, 1995; Stiber, Ieong, & Segundo, 1997; Segundo, Vibert, & Stiber, 1998). Presynaptic spike train timing generation and analysis was performed in MATLAB. Simulation was performed using custom C code and the ODEPACK differential equation integration library (Hindmarsh, 1983). Both pre- and postsynaptic spike trains were assimilated to point processes (Cox & Isham, 1980), and all analysis was based on interevent intervals. In the absence of presynaptic input, the SAO and its model behave as pacemakers, producing action potentials with an almost invariant interspike interval, N, its natural interval. In the presence of pacemaker presynaptic trains, both exhibit a variety of dynamical behaviors, including phase lockings, quasiperiodicities, and chaos, with characteristic dependencies on the presynaptic rate (Segundo et al., 1991a, 1991b; Stiber & Segundo, 1993; Stiber et al., 1996). The current work began by building on previous works’ survey of the dynamical behaviors exhibited by the model. Approximately 10,000 simulations were performed, varying both presynaptic rate (N/I , 1/I being the presynaptic rate for each simulation) and inhibitory postsynaptic potential (IPSP) strength (P) within biologically plausible limits. This was used to produce an Arnol’d map, or two-dimensional bifurcation diagram, which categorized each simulation’s dynamical behavior within the two-dimensional (N/I, P) plane. Three values of P were then chosen for more detailed examination. These were dubbed weak, moderate, and strong IPSPs, based on the ratio of locked (periodic) to nonlocked (aperiodic) behaviors they
1580
M. Stiber
engendered (≤20%, ≈50%, and ≈100%, respectively). For the current investigation, a moderate value of IPSP strength was chosen. A range of normalized presynaptic rates 0.2 ≤ N/I ≤ 2.0 was then explored using one-dimensional bifurcation diagrams. Here, such diagrams were constructed by plotting postsynaptic spike phase (cross interval from a postsynaptic spike back to the most recent presynaptic spike) versus N/I . These allowed easy determination of the distribution and organization of qualitatively different dynamical behaviors along the presynaptic rate scale (types of behaviors, behavior ranges, and bifurcation locations). Each bifurcation diagram involved a set of 90 to 200 simulations, with diagrams produced for pacemaker and jittered pacemaker presynaptic spike trains. Values of jitter (w; see below) ranging from 0.01I to 0.1I (±1%–±10%) were examined; their bifurcation diagrams were similar overall. Results for ±1% jitter are presented here. This was viewed as a stricter test of the hypothesis (that high-precision inputs would result in faster recovery from errors than low precision, and thus the extra precision could be viewed as a type of redundant, error-correcting code) than larger values of w, as larger values would certainly result in the model’s state being perturbed farther from the stationary attractors (and thus it would be a much shorter distance, and therefore faster, for the model to return to its attractor for high-precision input). Work on evaluating in detail the effects of varying level of jitter on error recovery are ongoing (Stiber & Holderman, 2004b). Four sets of 90 to 100 simulations used to produce bifurcation diagrams (high precision, low precision, and each with errors; see below) then served as a database describing error responses of the model. This communication focuses on the details of the model’s recovery after an error and the effects of presynaptic spike timing precision on that recovery, and so a few illustrative behaviors were selected. The behaviors chosen are archetypical of the full range of dynamical behavior types exhibited by driven nonlinear oscillators: 1:1 locking (period 1), 1:2 locking (period 2), 2:3 locking (period 3, the longest periodic behavior for this model with significant extent along the N/I scale), two quasiperiodicities (low—N/I < 1.0—and high—N/I > 1.0—rate), and chaos. An additional behavior was chosen to illustrate the effects of proximity to a bifurcation. A more complete examination of the global behavior of error correction (e.g., two- and three-dimensional bifurcation structure) is under way (Stiber & Holderman, 2004a). For each example behavior, simulations involved generating two pairs of almost identical presynaptic spike trains, as schematized in Figure 1. For each set of simulations, a reference high-precision train, H, with spike times {s1 , s2 , . . . , sn }, was first generated. As this was a pacemaker train, all interspike intervals were identical (∀k, I = Ik = sk − sk−1 ). I was the independent variable, chosen to elicit the example dynamical behaviors from the model. An error was defined to be complete failure of synaptic transmission, and so a set of spike numbers, K = {k1 , k2 , . . . , km }, was generated randomly such that all errors were separated by at least 15 s (to allow the
Spike Timing Precision and Neural Error Correction
1581
error time, S k j
e
e presynaptic
postsynaptic
Figure 1: Simulation methods. Pairs of high (H, He ) or low (L, Le ) precision spike trains were applied via the simulated inhibitory fiber (IF). Postsynaptic spike times (T ) were recorded for analysis.
neuron to return to its stationary behavior before each error). Presynaptic spikes at times HK = {sk1 , sk2 , . . . , skm } were eliminated from H to produce a high-precision erroneous train, He = H − HK . A low-precision counterpart of H was generated by jittering each spike sk ∈ H by a value uk taken from a uniform distribution with range ±w I . In the current communication, a small value of w (w = 0.01, 2% jitter) was used. The result of applying this jitter was a low-precision train L = {s1 + u1 , s2 + u2 , . . . , sn + un }. A low-precision erroneous train, Le = L − LK , was then produced by eliminating the spikes numbered K at times LK = {sk1 + uk1 , sk2 + uk2 , . . . , skm + ukm }. Subsequent analysis involved pooling timing data for postsynaptic spikes near errors in these sets of four simulations. As a double-check, multiple simulations were run for each set of simulation parameters (with different realizations of {uk }) and analysis performed on the pooled data from all; as might be expected, there was no difference in results for m errors taken from n simulations versus m errors taken from a single simulation. As Figure 2 illustrates, the first step of analysis was based on comparing times of postsynaptic spikes T = {. . . , ti−1 , ti , . . .} just before and after the error times for the erroneous and error-free cases. For each error k j ∈ K, j = 1, 2, . . . , m, the set of c postsynaptic spikes just before and after were selected (here, 5 spikes before and 25, 30, or 45 spikes after, depending on the rate of production of postsynaptic spikes and the recovery time, and thus c was 30, 35, or 50). These spike times can be renamed relative to the error number j as {t j,1 , t j,2 , . . . , t j,c } (the bottom spike train in Figure 2).
1582
M. Stiber
ψj,2
ψj,3
ti-1 tj,1
ti Skj
tj,2
ti+1
ti+2
tj,3
tj,4
perturbation (s)
error time
Time From Error (s)
Figure 2: Perturbation graph generation. The displacement of a postsynaptic spike after an error (bottom spike train) from the time of the corresponding spike in the situation where there was no error (top spike train) is plotted versus time since the error.
For each postsynaptic spike in the erroneous case, {t j,g }, j = 1, 2, . . . , m, g = 1, 2, . . . , c, the cross interval back to the most recent postsynaptic spike in the error-free case (top spike train in the figure), ψ j,g , was computed. As the simulations involve an inhibitory synapse, the number of postsynaptic spikes in the erroneous case was always greater than or equal to the number in the error-free case; this was the reason for using the former spikes as the reference for computing ψj,g . This was done for each pair of high- and lowL H precision input simulations to yield ψ H j,g and ψ j,g , respectively. Thus, ψ j,g L and ψ j,g measure the shift, or perturbation, of each postsynaptic spike from the time it would have occurred if there had been no error. The graph of L (t j,g − sk j , ψ H j,g ) or (t j,g − sk j − uk j , ψ j,g ), such as in Figure 6A, was termed a perturbation graph. In every case, the perturbation graph compares a simulation (either high or low precision) with errors to the identical (canonical) one without errors. Therefore, any scatter of points is due to either the simulation accuracy (which, for lockings at least, can be judged by the points before the error
Spike Timing Precision and Neural Error Correction
Input Perturbation (s)
Perturbation (s)
Input
1583
Time From Error (s)
Time From Error (s) Perturbation Difference (s)
–
∆g Time From Error (s)
Figure 3: Recovery plot construction. A pair of corresponding high- and lowprecision perturbation graphs (top) are subtracted to produce perturbation differences. The recovery plot is the plot of the range of differences, g .
time) or the presence of errors themselves. Jitter, in and of itself, did not produce any scatter because each presynaptic spike in both simulations occurred at identical times in identical simulations (neglecting errors). Perturbation graphs show only the effects of errors and possible interaction of errors with the presence of jitter. L The high- and low-precision perturbations, ψ H j,g and ψ j,g , were compared in recovery plots. Figure 3 illustrates how such plots were generated. For each pair of cross intervals, the difference, δ j,g = ψ Lj,g − ψ H j,g , was computed. Because of the variation in the individual error responses in the low-precision case (the scatter of points in the perturbation plot), δ j,g typically took on a range of values for each g. Rather than examine the average difference, the decision was made to compute the range of differences, g = max j (δ j,g ) − min j (δ j,g ). A difference δ j,g = 0 indicated that the highand low-precision perturbations, for that particular error j and postsynaptic spike g, were identical. A range g = 0 indicated that for all errors, the perturbation of postsynaptic spike g was the same for high and low precision—that low-precision input did not have an effect on the perturbation. Thus, a range g > 0 indicated that some low-precision perturbations differed from high-precision ones. In other words, because these were simulations, none of the postsynaptic spikes were considered outliers. As an
1584
M. Stiber
alternative, the variance of δ j,g (for each value of g) could have been used; in practice, it produced equivalent results. So a recovery plot such as in Figure 4 is a graph of g versus the mean offset e g = t j,g − sk j , for each value of g = 1, 2, . . . , c. It collapses the gth set of perturbations to a single point and compares high- and low-precision perturbations, showing the time course by which the worst low-precision error responses approached those for high precision. The primary motivations behind the use of recovery plots were easier analysis of the time course of recovery than perturbation graphs and elimination of the need to compare pairs of perturbation plots by eye to evaluate the effects of low versus high precision. Assuming the erroneous simulation fully recovered before each error, the values of g for e g < 0 can also serve to test whether low-precision responses were still significantly perturbed. Comparing g for e g < 0 for different simulation parameters allows one to judge whether the simulation was recovered. To characterize the ongoing dynamical behavior exhibited by the SAO, cross intervals between the pre- and postsynaptic times, H and T , respectively, were used. Since H was a pacemaker process, the cross intervals φi = ti − sk from postsynaptic spikes back to most recent preceding presynaptic spikes were termed phases (Winfree, 1980; Glass & Mackey, 1988). The phase return map, plotting (φi , φi+q ), trivially indicates p:q locking since φi = φi+q . Quasiperiodic responses produce a one-dimensional curve without a maximum for q = 1, while chaotic behaviors have return maps with maxima or that are not one-dimensional curves (Abraham & Shaw, 1984; Rapp, Zimmerman, Albano, deGuzman, & Greenbaun, 1985; Ruelle, 1989). For brevity, return maps are omitted from section 3 for phase lockings. 3 Results In this section, error recovery during locked, quasiperiodic, and chaotic responses is considered. A final subsection examines the effect of proximity to a bifurcation boundary on error recovery. 3.1 Phase Lockings. In previous work on both the SAO and this model, phase-locked responses were found to be the most robust in the presence of nonstationary presynaptic spike trains, with 1:1 locking being the most robust among lockings (Segundo, Vibert, Stiber, & Hanneton, 1993; Segundo, Stiber, et al., 1994; Segundo, Vibert, et al., 1995; Segundo, Stiber, et al., 1995; Segundo et al., 1998; Stiber et al., 1997). This has also been noted in other preparations and simulations (Beierholm et al., 2001; Tiesinga & Sejnowski, 2001; Tiesinga, 2002). Thus, analysis began with these behaviors in the expectation that they would exhibit the greatest immunity to low-precision jitter and the fastest error recovery times (Stiber, 2003). Figure 4 presents typical results for 1:1 locking. For high-precision input, the response to every error was identical (see Figure 4A), with a
Spike Timing Precision and Neural Error Correction
6
x 10-3
5 Perturbation, ψ (s)
A
1585
4 3 2 1 0 -1
6
3
0 1 2 Time From Error, eg (s)
3
x 10-3
5 Perturbation, ψ (s)
B
0 1 2 Time From Error, eg (s)
4 3 2 1 0 -1
10-2
C ∆ (s)
10-3 10-4 10-5 10-6 -1
0 1 2 Time From Error, eg (s)
3
Figure 4: Results for center of 1:1 locking (I = 108.78 ms, N/I = 0.96). Highprecision perturbation graph (A) shows identical responses to all errors; low precision (B) shows range of responses. Recovery plot (C) shows long-duration recovery. m = 49 errors, c = 25 spikes per error.
1586
M. Stiber
maximum perturbation of approximately ψ H j,g = 5.25 ms. For low precision (see Figure 4B), the maximum perturbation was ψ Lj,g ≈ 6 ms, with up to a 1.2 ms range for any single value of g. An exponential fit for the points in the recovery plot (see Figure 4C) in the range 0.4 ≤ e g ≤ 0.9 yielded a recovery time constant of τ = 0.20 s (τ/I = 1.84), with a slower recovery proceeding thereafter. This is the rate at which the “worst” low-precision error responses approached that for high precision. Low-precision responses were still significantly perturbed 2 s after an error. Perturbation graphs are shown in Figures 5A and 5B and a recovery plot in Figure 5C for 1:2 locking. Note that the points occur in pairs, with small differences between the pairs of points leading to the appearance of oscillations in the recovery plot. These correspond to the pairs of postsynaptic spikes that occur between each pair of presynaptic spikes (two categories of cross intervals). As with 1:1 locking, all responses to errors for high-precision input were identical (see Figure 5A), the maximum perturbation in this case being 19 ms. Lowprecision maximum perturbations (see Figure 5B) ranged from about 17 ms to 20 ms. The recovery time constant for 0.51 ≤ e g ≤ 1.08 was τ = 0.27 s (τ/I = 1.19). Recovery slows considerably after e g = 1.2 s, at about 10 times the pre-error difference range. As Figure 6A shows, 2:3 locking, unlike the previous lockings, exhibited some variability even in its high-precision response to repeated errors— qualitatively similar to low-precision simulations, though much smaller. Note the linear scale in Figures 6A and 6B. The maximum high-precision error perturbation was around 24 ms with a 1 ms spread; for low precision, this was 27 ms with a 5 ms spread (ignoring bimodality; see below). In both cases, there were triplets of clusters, corresponding to the three postsynaptic spikes occurring for each pair of presynaptic spikes (i.e., this was a period 3 behavior). Differences among the three categories of cross intervals explain what appear to be oscillations in the recovery plot (see Figure 6C) as with 1:2 locking. Clusters had a bimodal distribution of perturbations, with some small (around 0–2 ms) and some large. The recovery plot shows an initial peak at less than an order of magnitude more than the pre-error value, a slow recovery until around e g = 2.5 s; there were additional lower peaks after 2.5 s (before eventual return to pre-error values, as indicated by the points before the error). The range of differences before the error (e g < 0) was on the order of 1 ms, compared to 2 to 5 µs for 1:1 and 1:2 lockings. 3.2 Quasiperiodicities. A quasiperiodic behavior is one in which the phase of the response shifts with respect to the stimulus, eventually taking on any value within some range (Ermentrout & Rinzel, 1984). This is made apparent by the phase return map in Figure 7A, which is a one-dimensional curve covering the range 0 to I . Note that a phase of zero is equivalent to one of I , so the graph corresponds to a torus and the curve is not discontinuous.
Spike Timing Precision and Neural Error Correction
1587
A
Perturbation, ψ (s)
0.02
0.015
0.01
0.005
0 -1
0 1 2 Time From Error, eg (s)
3
B
Perturbation, ψ (s)
0.025 0.02 0.015 0.01 0.005 0 -1
0
1
2
3
Time From Error, eg (s) 10-2
C ∆ (s)
10-3 10-4 10-5 10-6 -1
0
1
2
3
Time From Error, eg (s)
Figure 5: Results for 1:2 locking (I = 227.01 ms, N/I = 0.46). Perturbation graphs for high- (A) and low- (B) precision input and recovery plot (C) all carry mark of the two categories of phases. m = 49 errors, c = 25 spikes per error.
1588
M. Stiber 0.025 Perturbation, ψ (s)
A
0.02 0.015 0.01 0.005 0 -1
0 1 2 Time From Error, eg (s)
3
0.03
B Perturbation, ψ (s)
0.025 0.02 0.015 0.01 0.005 0 -1
0
1
2
3
Time From Error, eg (s) 10-2
∆ (s)
C
10-3
10-4
-1
0
1
2
3
Time From Error, eg (s)
Figure 6: Results for 2:3 locking (I = 168.43 ms, N/I = 0.62). Both highprecision (A) and low-precision (B) perturbation graphs show ranges of responses to different errors. The recovery plot (C) shows reduced, long-duration error aftereffect. m = 49 errors, c = 25 spikes per error.
Spike Timing Precision and Neural Error Correction
A
0.1 Perturbation, ψ (s)
φi+1 (s)
0.12
B
0.15
0. 1
0.05
1589
0.08 0.06 0.04 0.02
0
0
0.05
0.1
0 -1
0.15
φ (s)
0 1 2 Time From Error, eg (s)
3
i
0.1
0.18
D
0.08
0.16
0.06
0.14
∆ (s)
Perturbation, ψ (s)
C
0.04
0.12
0.02
0.1
0 -1
0 1 2 Time From Error, eg (s)
3
0.08 -1
0
1
2
3
Time From Error, eg (s)
Figure 7: Error responses for quasiperiodic behavior near 2:3 locking (I = 174.04 ms, N/I = 0.60). The phase return map (A) is a one-dimensional curve. High-precision (B) and low-precision (C) perturbation plots. Note the linear y-axis for recovery plot (D). m = 49 errors, c = 25 spikes per error.
There was no apparent qualitative difference (and little quantitative difference) between high- (B) and low- (C) precision error responses; if anything, there was even a slight decrease in postsynaptic spike timing variability after an error. This is especially apparent in the recovery plot (D), with a maximum range of I before an error and a minimum range (ignoring the high-frequency “oscillations” every third spike—the “ghost” of the nearby 2:3 locking) of around 95 ms at e g ≈ 1 s. Figure 8 presents results for a high-input-rate (N/I > 1) quasiperiodicity. As can be seen in the phase return map in Figure 8A, the model fired only within a relatively restricted range of phases, approximately 59 to 63 ms after a presynaptic spike. This is an example of a windowed behavior (Segundo, Altshuler, Stiber, & Garfinkel, 1991a), in which the postsynaptic cell is able to fire only within some narrow time. Both high and low precision (see Figures 8B and 8C, respectively) had a spike timing variability of around 3 ms before an error, with a peak perturbation of 28 ms and a perturbation range of about 8 ms after an error. The range of differences (see Figure 8D) was less than doubled by an error, with an oscillatory recovery with period a bit over 1 s and apparent complete recovery after 4 s.
1590
M. Stiber
A
0.025 Perturbation, ψ (s)
0.0625 0.062 φi+1 (s)
0.03
B
0.063
0.0615 0.061 0.0605
0.02 0.015 0.01 0.005
0.06 0.06
0.061 0.062 φ (s)
0
0.063
-1
i
0.03
C
10
D
0.025
1
2
3
x 10-3
8 0.02
∆ (s)
Perturbation, ψ (s)
0
Time From Error, eg (s)
0.015
6
0.01
4 0.005
2
0 -1
0
1
2
Time From Error,eg (s)
3
-1
0
1
2
3
Time From Error, eg (s)
Figure 8: Error responses for windowed quasiperiodic behavior (I = 63.67 ms, N/I = 1.64). The phase return map (A) is a one-dimensional curve. Highprecision (B) and low-precision (C) perturbation plots. Note the linear y-axis for recovery plot (D). m = 49 errors, c = 25 spikes per error.
3.3 Chaos. Figure 9 shows the error responses for a high-input-rate chaotic behavior. The phase return map in Figure 9A is “braided,” clearly not a one-dimensional curve. The neuron could fire between 0 and 56 ms (I ) after a presynaptic spike, with a postsynaptic interval range of 139 to 155 ms and a mean of 148 ms. So this response was not windowed. As with the lowinput-rate quasiperiodic behavior, there was little difference between highand low-precision responses (see Figures 9B and 9C, respectively). In both cases, postsynaptic spike timing variability decreased after the error. There was perhaps a slight increase in the recovery plot (see Figure 9D) after an error, with subsequent decrease and increase again, but this was not dramatic. 3.4 Bifurcation Boundaries. The earlier locking examples were chosen to lie within the middle of the range of presynaptic rates that elicited the behavior. For input rates near an extreme presynaptic rate, different results were found. This is illustrated in Figure 10 for an input near the low-rate bifurcation point bounding 1:1 locking. As in the middle of the range, high-precision inputs (see Figure 10A) elicited the same response
Spike Timing Precision and Neural Error Correction
A
Perturbation, ψ (s)
0.04 φi+1 (s)
0.2
B
0.05
0.03 0.02 0.01 0
0
0.01
0.02
0.03
0.04
0.15
0.1
0.05
0
0.05
-1
φi (s)
1
2
3
0.145
D
0.14
0.15 ∆ (s)
Perturbation, ψ (s)
0
Time From Error, eg (s)
0. 2
C
1591
0.1
0.135 0.13
0.05
0.125
0 -1
0
1
2
Time From Error, eg (s)
3
0.12 -1
0
1
2
3
Time From Error, eg (s)
Figure 9: Chaotic behavior (I = 56.14 ms, N/I = 1.86). The phase return map (A) is not a one-dimensional curve. High-precision (B) and low-precision (C) perturbation plots. Note the linear y-axis for recovery plot (D). m = 49 errors, c = 25 spikes per error.
to every error, with a maximum perturbation of almost 60 ms. For lowprecision inputs (see Figure 10B), maximum error perturbations were between 55 ms and 60 ms. Unlike the previous 1:1 example, the peak perturbation was reached slowly—after three postsynaptic spikes—and the recovery was more gradual (see Figure 10C), with a broad secondary peak around e g = 1.5 s and a value still about two orders of magnitude greater than that before the error at e g = 2.5.
4 Discussion This article has examined the interaction of neuron dynamics and spike timing precision in error correction in a model of the living prototype of an inhibitory synapse and postsynaptic neuron. The dynamical behaviors included locked, quasiperiodic, and chaotic; additional issues, such as proximity to bifurcation points and windowing, were touched on. All analysis was based on spike times and interspike intervals rather than state-space analysis. This is especially significant, in our view, because that is exactly the message that neurons transmit to each other. Thus, variations seen here
1592
M. Stiber 0.06 0.05 Perturbation, ψ (s)
A
0.04 0.03 0.02 0.01 0 -1
0 1 2 Time From Error, eg (s)
3
0 1 2 Time From Error, eg (s)
3
0
3
0.06 0.05 Perturbation, ψ (s)
B
0.04 0.03 0.02 0.01 0
-1
10-2
C ∆ (s)
10-3 10-4 10-5 10-6 -1
1
2
Time From Error, eg (s)
Figure 10: Results for edge of 1:1 locking (I = 133.88 ms, N/I = 0.78). The highprecision perturbation graph (A) shows identical responses to all errors; the low-precision graph (B) shows a range of responses. Both differ from farther away from the bifurcation (see Figure 4). The recovery plot (C) shows greatly extended recovery. m = 49 errors, c = 25 spikes per error.
Spike Timing Precision and Neural Error Correction
1593
are at least potentially significant (as opposed to changes in internal state that might have no impact on spike timing). It is important to remember that in all cases, high-precision erroneous simulations were compared to high-precision error-free ones, and lowprecision erroneous simulations were compared to low-precision error-free ones. Thus, when a recovery plot was generated, values larger than those just before an error showed that at least some low-precision responses were significantly further away from “full recovery” than the high-precision ones. Among all behaviors examined, phase lockings produced the most rapid recovery, with lower ratios that occur in broader rate domains (1:1 and 1:2) recovering faster. One reason for this likely was the gradient of the attractor basin surrounding such behaviors. During the acceptably exponential phase of their recoveries (roughly from 0.5 s to 1 s after an error), 1:1 and 1:2 locking had similar time constant (0.20 s and 0.27 s), despite the fact that their associated presynaptic rates differed by more than a factor of two. The conclusion is that at least for that phase of error recovery, it was not the case that each presynaptic spike simply corrected the error by some amount proportional to the perturbation. Rather, the gradients of the two basins of attraction—at least at those distances from the attractors—were similar. Based on computed recovery time constants, the worst-case low-precision recoveries had significantly longer recovery times—on the order of seconds longer. For 2:3 locking and all nonlocked responses, the “excess perturbation” of low precision versus high precision (the peak of the recovery plot) was much less than for the lower-ratio locked behaviors. In fact, for 2:3 locking, it was roughly a factor of five greater, while for all nonlocked behaviors, it was less than a factor of two. There were two possible mechanisms underlying this: a narrow rate domain that elicited the particular behavior and desynchronization of erroneous and error-free responses by errors. The first matter, that of domain width, was especially relevant to 2:3 locking. As can be seen by comparing Figure 6C to Figures 4C and 5C, the peak was not too different from the lower ratio lockings; the difference was the degree of perturbation before the error (more than two orders of magnitude higher in the case of 2:3 locking, roughly at the level of the added low-precision jitter). This was probably caused by the jitter in the low-precision simulations bumping the state out of the locking behavior (Beierholm et al., 2001; Tiesinga & Sejnowski, 2001), causing the two lowprecision simulations to remain partially desynchronized. They were, at least part of the time, away from the synchronizing effect of the periodic attractor (see the discussion below of bifurcation boundary transitions). That was a minor desynchronization effect because there was a periodic attractor that tended to pull the erroneous and error-free simulations’ states back together. Such was not the case for aperiodic behaviors. In those cases (quasiperiodicity and chaos), once the neuron had been perturbed by the first error, returning to the attractor did not bring its subsequent spikes back
1594
M. Stiber
to the same times they would have occurred in the absence of the error. This was the case for both high and low precision. For the quasiperiodic cases (see Figures 7 and 8), there were preferred firing intervals (where the phase return map was closest to the diagonal in both figures, and additionally due to windowing itself for windowed quasiperiodicity). The closer the quasiperiodicity came to being periodic (the more dominant the preferred firing intervals were), the more it would act to resynchronize the erroneous simulation with the error-free one. For Figure 7, this was not very close at all (about 25 ms), the erroneous and error free experiments were completely desynchronized by the first error, and as a result their “perturbations” in Figures 7B and 7C before the error could be as large as the full range of possible phase differences (I /2). Additionally, the high- and low-precision cases thereafter were also decorrelated, and so the range of their perturbation differences in Figure 7D was the sum of their maxima. However, in both the high- and lowperturbation cases, errors arrived at approximately the same time (within ±w I of each other), and so the errors paradoxically served as a partial resynchronizing input, as indicated by the decrease in Figure 7D after the error. Windowed quasiperiodicity, however, behaved more like a weak locking. As Figure 8 shows, the neuron was constrained to fire within a 4 ms window relative to an input spike when stationary (see Figure 8A), which was much less than the 25 to 30 ms perturbation induced by an error (see Figures 8B and 8C). As a result, the cell exhibited an oscillatory recovery, with highprecision input being somewhat more effective (see Figure 8D). Chaos, by definition of its hallmark sensitive dependence on initial conditions, was the extreme case of these desynchronizing processes, with the jitter of low precision leading those cases to diverge rapidly from high precision and the presence of errors having a similar effect between erroneous and error-free experiments. As Figure 9D shows, error aftereffects were qualitatively similar to (though perhaps more extreme than) that for quasiperiodicity in Figure 7D, with the similarly timed errors affecting the high- and low-precision erroneous experiments in similar fashion. The final results in Figure 10 address the issue of proximity to a bifurcation boundary. This is also relevant to the effects of domain width and similarity or dissimilarity of attractor basins and attractors on error recovery. In both the high- and low-precision cases, in the absence of errors, the neuron’s state was on or near a period 1 attractor at all times (the jitter was not large enough to destroy locking, as was verified by examination of the phase return map for the low-precision case). Unlike the simulation in Figure 4, however, the perturbation produced by an error did not displace the state into the basin of another 1:1 locking. Instead, the perturbation moved the state across the bifurcation boundary into the basins of qualitatively different attractors, and its motion thereafter was very different from that of recovery exclusively through period 1 attractor basins.
Spike Timing Precision and Neural Error Correction
1595
This raises an important point regarding the nature of the perturbation produced by an error. Because the presynaptic intervals served as a control parameter, errors were not merely perturbations of the neuron’s state within its state space, ⊂ Rn . Instead, they were perturbations within the system’s response space, I × (Thom, 1975; Abraham & Shaw, 1984). The perturbation of an ongoing behavior near a bifurcation boundary or within a narrow domain (such as in Figures 6–10) displaced the neuron’s state across domains of different behaviors. After the perturbation, the reestablished sequence of spike times drew the neuron through its response space and back across (possibly multiple) bifurcations into basins of attractors qualitatively like its “home” (stationary attractor). Its evolution, however, because of these dissimilar basins, did not necessarily lead it toward the region of that contained this home. As a result, the perturbation and recovery plots could show multimodal, or oscillatory, shapes. Having established that even a small amount of spike timing jitter can have observable effects on the time course of error recovery (and potentially on the activity of further postsynaptic neurons), subsequent work is progressing along a number of fronts:
r
r
r
A more complete model of the bifurcation behavior of error recovery (Stiber & Holderman, 2004a) is being developed. Rather than using some property of the stationary behavior of a dynamical system, one can examine how properties of transient neural responses depend on preexisting stationary behaviors. The role of degree of spike timing accuracy on error is being investigated (Stiber & Holderman, 2004b). The current results show that even small amounts of jitter can have a significant effect on error recovery. The amounts used here were in the range of 1 to 4 ms or so, and thus are relevant to a range of biological systems, based on observations (for multiple spike trains) of correlations and correlation-dependent activity in retinal ganglion cells and lateral geniculate cells, and are at the low end for some cortical cells (Bair, 1999). Preliminarily, order-ofmagnitude jitter increases in general produce similar effects, though, as might be expected in a nonlinear system, linear increases do not appear to produce linear changes in recovery. The transient response of this model is being examined in terms of its internal state variables (Stiber & Pottorf, 2004), as preliminary to developing a simplified model that captures the error responses and is more amenable to analytic approaches. Early results suggest that spike rate adaptation is the most significant contributor to the observed error responses.
A final matter is that of the application of Shannon’s information theory to coding by nonlinear dynamical neurons. Techniques from the field of information theory (Shannon, 1948; Cover & Thomas, 1991) are currently
1596
M. Stiber
of great interest in addressing the question of what spike trains can be said to carry identical information (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Borst & Theunissen, 1999; Fuhrmann, Segev, Markram, & Tsodyks, 2002), and thus whether the timing of individual spikes or average firing rate is relevant. Entropy calculations can be quantified in terms of bits of information per spike, a compelling statistic. The typical experimental approach is to present a stimulus multiple times and use the “average” response (where this averaging might be accomplished in varying ways) to compute entropy or mutual information. This is then used to infer characteristics of the neural code (e.g., rate versus time codes). It is important, however, to remember that information theory reveals nothing about the semantics of a code (Shannon & Weaver, 1949). Primarily, and in conjunction with knowledge of the noise in a communication channel, it addresses issues of information capacity: how close to optimally compressed a signal is. This part of Shannon’s work—the noisy channel coding theorem—is usually used in neuroscience only in terms of estimating channel noise from multiple presentations of the same stimulus. However, another application of this theorem is the creation of codes that use more bits (symbols, or in the context of neural coding, greater spike timing precision) than the required minimum to provide for error correction in the presence of channel noise. As these results clearly show, both the quantitative and qualitative results of a presynaptic error depend critically on a neuron’s internal state’s ongoing evolution. It is a mapping f : I × → O × , from input and state to output and new state. This suggests that the computation performed by a neuron will inevitably result in a decrease in mutual information between its input and output, according to the data processing theorem (Shannon, 1948; Cover & Thomas, 1991). In contrast, Shannon’s formal definition of an encoder or decoder is a stationary mapping f : A∗ → B ∗ from input sequences of symbols taken from some set A to output sequences of symbols from some (possibly different) set B. In other words, Shannon’s information theory applies when the encoding of an input symbol depends only on its context within the sequence of input symbols. On the other hand, for a nonlinear dynamical system, output depends on input and state, where state is not merely a function of input but also of intrinsic dynamics. Therefore, Shannon’s information theory is conceptually incompatible with neural information processing. In practical terms, this implies that informationtheoretic techniques can be used only to establish lower bounds on matters such as spike timing precision.
Appendix: SAO Model Equations The SAO model equations used here are documented more completely in Edman, Gestrelius, and Grampp (1987) and Stiber et al. (1997); following is
Spike Timing Precision and Neural Error Correction
1597
Table 1: Constants Used in This Simulation, in Rough Order of Appearance in Appendix Equations. Constant Cm Ibias A P Na [Na+ ]o [Na+ ]i T PK [K+ ]o PL,Na PL,K PL,Cl [Cl− ]o [Cl− ]i J p,Na Km Psyn τ+ τ− νm νh νl νn νr zm zh zl zn zr Vm Vh Vl Vn Vr τm τh τl τn τr
Value
Units (MKS)
7.5 × 10−9 −2.5 × 10−9 1.0 × 10−7 6.0 × 10−6 325.0 10.0 291.15 2.0 × 10−6 5.0 5.0 × 10−10 1.8 × 10−8 1.1 × 10−9 650.0 46.0 6.0 × 10−6 13.4 6.0 × 10−8 0.25 × 10−3 0.5 × 10−3 0.0 0.0 0.0 0.03 0.5 −3.1 4.0 3.5 −2.6 4.0 −19.0 × 10−3 −35.0 × 10−3 −53.0 × 10−3 −18.0 × 10−3 −56.0 × 10−3 0.3 × 10−3 5.0 × 10−3 1700.0 × 10−3 6.0 × 10−3 2000.0 × 10−3
Farads A m2 m/s mM mM ◦K m/s mM m/s m/s m/s mM mM mol/(m2 s) mM m/s s s dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless dimensionless V V V V V s s s s s
a brief summary. It is a permeability-based, rather than conductance-based, model with the following main equations: dVm /dt = −(INa + IK + IL,Na + IL,K + IL,Cl + Ip + Ibias + Isyn )/Cm (A.1)
1598
M. Stiber
INa = APNa m2 hl IK = APK n2 r
AF 3
(A.2)
Vm F 2 [K+ ]o − [K+ ]i exp FVm /RT RT 1 − exp FVm /RT
(A.3)
Vm F 2 [X]o − [X]i exp(FVm /RT) RT 1 − exp(FVm /RT)
IL,X = APL,X Ip =
Vm F 2 [Na+ ]o − [Na+ ]i exp FVm /RT RT 1 − exp FVm /RT
J p ,Na 1+
Km [Na+ ]i
(A.4)
3
(A.5)
Vm F 2 [Cl− ]o − [Cl− ]i exp(−FVm /RT) RT 1 − exp(−FVm /RT) e (sk −t)/τ+ − e (sk −t)/τ− . ×
Isyn = APsyn
(A.6)
∀sk ≤t
Values for constants are given in Table 1. Besides active and leak Na+ , K , and Cl− currents, there is a bias current (Ibias ), generated in response to muscle stretch, an active pump (Ip ), and a synaptic current (Isyn ). There are five gating variables, m, h, l, n, and r , with first-order kinetics described by equations A.7 to A.10 (where g ∈ {m, h, l, n, r }). +
dg/dt = (g∞ − g) g∞ = νg + τg =
Qg =
(A.7)
1 − νg zg e 1 + exp kT (Vm − Vg )
exp
1 τg
δ g zg e (Vm kT
1 − δg δg
δ g
− Vg ) + exp
+
Qg τ¯g
1 − δg δg
(A.8)
δg −1
.
(δg −1)zg e (Vm kT
− Vg )
(A.9)
(A.10)
References Abraham, R., & Shaw, C. (1984). Dynamics—The geometry of behavior. Santa Cruz, CA: Aerial Press. Allen, C., & Stevens, C. (1994). An evaluation of causes for unreliability of synaptic transmission. Proc. Natl. Acad. Sci. USA, 91, 10380–10383. Bair, W. (1999). Spike timing in the mammalian visual system. Curr. Opin. Neurobiol., 9(4), 447–453.
Spike Timing Precision and Neural Error Correction
1599
Beierholm, U., Nielsen, C. D., Ryge, J., Alstrøm, P., & Kiehn, O. (2001). Characterization of reliability of spike timing in spinal interneurons during oscillating inputs. J. Neurophysiol., 86, 1858–1868. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Borst, A., & Theunissen, F. (1999). Information theory and neural coding. Nature Neurosci., 2(11), 947–957. Bryant, H., Marcos, A. R., & Segundo, J. (1973). Correlations of neuronal spike discharges produced by monosynaptic connections and by common inputs. J. Neurophysiol., 36, 205–225. Buno, ˜ W., Fuentes, J., & Barrio, L. (1987). Modulation of pacemaker activity by IPSP and brief length perturbations in the crayfish stretch receptor. J. Neurophysiol., 57, 819–834. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Cox, D., & Isham, V. (1980). Point processes. London: Chapman and Hall. Dan, Y., Alonso, J.-M., Usrey, W. M., & Reid, R. C. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neurosci., 1(6), 501–507. Edman, A., Gestrelius, S., & Grampp, W. (1987). Analysis of gated membrane currents and mechanisms of firing control in the rapidly adapting lobster stretch receptor neurone. J. Physiol., 384, 649–669. Ermentrout, G., & Rinzel, J. (1984). Beyond a pacemaker’s entrainment limit: Phase walk-through. Am. J. Physiol., 246, R102–106. Fuhrmann, G., Segev, I., Markram, H., & Tsodyks, M. (2002). Coding of temporal information by activity-dependent synapses. J. Neurophysiol., 87, 140–148. Glass, L., & Mackey, M. (1988). From clocks to chaos. Princeton, NJ: Princeton University Press. Hindmarsh, A. (1983). ODEPACK: A systematized collection of ODE solvers. In R. Stepleman (Ed.), Scientific computing (pp. 55–64). Amsterdam: North-Holland. Katz, B. (1966). Nerve, muscle, and synapse. New York: McGraw-Hill. Kohn, A., da Rocha, A., & Segundo, J. (1981). Presynaptic irregularity and pacemaker inhibition. Biol. Cybern., 41, 5–18. Kohn, A., & Segundo, J. (1983). Neuromime and computer simulations of synaptic interactions between pacemakers: Mathematical expansions of existing models. J. Theoret. Neurobiol., 2, 101–125. Lestienne, R. (2001). Spike timing, synchronization and information processing on the sensory side of the central nervous system. Prog. Neurobiol., 65, 545–591. Marˇsa´ lek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA, 94, 735–740. Rapp, P., Zimmerman, I., Albano, A., deGuzman, G., & Greenbaun, N. (1985). Dynamics of spontaneous neural activity in the simian motor cortex: The dimension of chaotic neurons. Phys. Lett. A, 110, 335–338. Ruelle, D. (1989). Elements of differentiable dynamics and bifurcation theory. Orlando, FL: Academic Press. Segundo, J. (1979). Pacemaker synaptic interactions: Modeled locking and paradoxical features. Biol. Cybern., 35, 55–62.
1600
M. Stiber
Segundo, J. P., Altshuler, E., Stiber, M., & Garfinkel, A. (1991a). Periodic inhibition of living pacemaker neurons: I. Locked, intermittent, messy, and hopping behaviors. Int. J. Bifurcation and Chaos, 1(3) 549–581. Segundo, J. P., Altshuler, E., Stiber, M., & Garfinkel, A. (1991b). Periodic inhibition of living pacemaker neurons: II. Influences of driver rates and transients and of non-driven post-synaptic rates. Int. J. Bifurcation and Chaos, 1(4), 873– 890. Segundo, J., Moore, G., Stensaas, L., & Bullock, T. (1963). Sensitivity of neurones in aplysia to temporal pattern of arriving impulses. J. Exp. Biol., 40, 643–667. Segundo, J., Perkel, D., Wyman, H., Hegstad, H., & Moore, G. (1968). Input-output relations in computer-simulated nerve cells: Influence of the statistical properties, strength, number, and inter-dependence of excitatory pre-synaptic terminals. Kybernetic, 4, 157–171. Segundo, J., Stiber, M., Altshuler, E., & Vibert, J.-F. (1994). Transients in the inhibitory driving of neurons and their post-synaptic consequences. Neurosci., 62(2), 459– 480. Segundo, J., Stiber, M., Vibert, J.-F., & Hanneton, S. (1995). Periodically modulated inhibition and its post-synaptic consequences. II. Influence of pre-synaptic slope, depth, range, noise and of post-synaptic natural discharges. Neurosci., 68(3), 693– 719. Segundo, J., Tolkunov, B., & Wolfe, G. (1976). Relation between trains of action potentials across an inhibitory synapse: Influence of presynaptic irregularity. Biol. Cybern., 24, 169–179. Segundo, J., Vibert, J.-F., Pakdaman, K., Stiber, M., & Diez Mart`inez, O. (1994). Noise and the neurosciences: A long history, a recent revival and some theory. In K. Pribram (Ed.), Origins: Brain and self organization. Mahwah, NJ: Erlbaum. Segundo, J., Vibert, J.-F., & Stiber, M. (1998). Periodically modulated inhibition of living pacemaker neurons. III. The heterogeneity of the postsynaptic spike trains and how control parameters affect it. Neurosci., 87(1), 15–47. Segundo, J., Vibert, J.-F., Stiber, M., & Hanneton, S. (1993). Synaptic coding of periodically modulated spike trains. In Proc. ICNN (pp. 58–63). Parsippany Hills, NJ: IEEE. Segundo, J., Vibert, J.-F., Stiber, M., & Hanneton, S. (1995). Periodically modulated inhibition and its post-synaptic consequences. I. General features. Influences of modulation frequency. Neurosci., 68(3), 657–692. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Stiber, M. (2003). Non-information-maximizing neural coding. In Proc. IJCNN’03. Parsippany Hills, NJ: IEEE. Stiber, M., & Holderman, T. (2004a). Global behavior of neural error correction. In Proc. IJCNN. Parsippany Hills, NJ: IEEE. Stiber, M., & Holderman, T. (2004b). Stochastic-resonance-like effects in neural error correction. Manuscript in preparation, University of Washington, Bothell. Stiber, M., Ieong, R., & Segundo, J. (1997). Responses to transients in living and simulated neurons. IEEE Trans. Neural Networks, 8(6), 1379–1385.
Spike Timing Precision and Neural Error Correction
1601
Stiber, M., Pakdaman, K., Vibert, J.-F., Boussard, E., Segundo, J., Nomura, T., Sato, S., & Doi, S. (1996). Complex responses of living pacemaker neurons to pacemaker inhibition: A comparison of dynamical models. Biosystems, 40, 177–188. Stiber, M., & Pottorf, M. (2004). Response space construction for neural error correction. In Proc. IJCNN. Parsippany Hills, NJ: IEEE. Stiber, M., & Segundo, J. P. (1993). Dynamics of synaptic transfer in living and simulated neurons. In Proc. ICNN (pp. 75–80). Parsippany Hills, NJ: IEEE. Thom, R. (1975). Structural stability and morphogenesis. Reading, MA: Benjamin. Tiesinga, P. (2002). Precision and reliability of periodically and quasiperiodically driven integrate-and-fire neurons. Phys. Rev. E, 65(4), 041913. Tiesinga, P., & Sejnowski, T. J. (2001). Precision of pulse-coupled networks of integrate-and-fire neurons. Network: Comput. Neural Syst., 12, 215–233. Winfree, A. (1980). The geometry of biological time. New York: Springer-Verlag.
Received October 10, 2003; accepted January 5, 2005.
LETTER
Communicated by Tzyy-Ping Jung
A New Approach to Spatial Covariance Modeling of Functional Brain Imaging Data: Ordinal Trend Analysis Christian Habeck
[email protected] Cognitive Neuroscience Division, Taub Institute, and Department of Neurology, College of Physicians and Surgeons, Columbia University, New York, NY 10032, U.S.A.
John W. Krakauer
[email protected] Department of Neurology, College of Physicians and Surgeons, Columbia University, New York, NY 10032, U.S.A.
Claude Ghez
[email protected] Department of Neurology, College of Physicians and Surgeons, and Center for Neurobiology and Behavior, Columbia University, New York, NY 10032, U.S.A.
Harold A. Sackeim
[email protected] Departments of Neurology, Psychiatry, and Radiology, College of Physicians and Surgeons, Columbia University, New York, NY 10032, and Department of Biological Psychiatry, New York State Psychiatric Institute, New York, NY 10032, U.S.A.
David Eidelberg
[email protected] Center for Neurosciences, Institute for Medical Research, North Shore–Long Island Jewish Health System, Manhasset, NY 11030, and Department of Neurology, School of Medicine, New York University, New York, NY 10016, U.S.A.
Yaakov Stern
[email protected] Cognitive Neuroscience Division, Taub Institute, and Departments of Neurology and Psychiatry, College of Physicians and Surgeons, Columbia University, New York, NY 10032, USA; and Department of Biological Psychiatry, New York State Psychiatric Institute, New York, NY 10032, U.S.A. Neural Computation 17, 1602–1645 (2005)
© 2005 Massachusetts Institute of Technology
Ordinal Trend Canonical Variates Analysis
1603
James R. Moeller
[email protected] Cognitive Neuroscience Division, Taub Institute, and Department of Neurology, College of Physicians and Surgeons, Columbia University, New York, NY 10032, USA; and Department of Biological Psychiatry, New York State Psychiatric Institute, New York, NY 10032, U.S.A.
In neuroimaging studies of human cognitive abilities, brain activation patterns that include regions that are strongly interactive in response to experimental task demands are of particular interest. Among the existing network analyses, partial least squares (PLS; McIntosh, 1999; McIntosh, Bookstein, Haxby, & Grady, 1996) has been highly successful, particularly in identifying group differences in regional functional connectivity, including differences as diverse as those associated with states of awareness and normal aging. However, we address the need for a within-group model that identifies patterns of regional functional connectivity that exhibit sustained activity across graduated changes in task parameters. For example, predictions of sustained connectivity are commonplace in studies of cognition that involve a series of tasks over which task difficulty increases (Baddeley, 2003). We designed ordinal trend analysis (OrT) to identify activation patterns that increase monotonically in their expression as the experimental task parameter increases, while the correlative relationships between brain regions remain constant. Of specific interest are patterns that express positive ordinal trends on a subject-by-subject basis. A unique feature of OrT is that it recovers information about functional connectivity based solely on experimental design variables. In particular, there is no requirement by OrT to provide either a quantitative model of the uncertain relationship between functional brain circuitry and subject variables (e.g., task performance and IQ) or partial information about the regions that are functionally connected. In this letter, we provide a step-by-step recipe of the computations performed in the new OrT analysis, including a description of the inferential statistical methods applied. Second, we describe applications of OrT to an event-related fMRI study of verbal working memory and H2 15 O-PET study of visuomotor learning. In sum, OrT has potential applications to not only studies of young adults and their cognitive abilities, but also studies of normal aging and neurological and psychiatric disease. 1 Introduction Perhaps it is not an oversimplification to say that in neuroimaging studies of human cognition, it is rare to capture glimpses of the regional functional connectivity of the underlying neural circuitry, particularly in studies involving H2 15 O PET and event-related fMRI (Friston, Frith, Liddle, &
1604
C. Habeck et al.
Frackowiak, 1993; McIntosh, Bookstein, Haxby, & Grady, 1996; McIntosh & Gonzalez-Lima, 1994). In cognitive neuroscience, the term functional connectivity refers to the distributed nature of human information processing that occurs over a scale of centimeters (DeFelipe et al., 2002; Felleman & Van Essen, 1991; McIntosh & Gonzalez-Lima, 1994; Mellet et al., 2000), where the term was co-opted from neurophysiologists (Gerstein, Perkel, & Subramanian, 1978) who originally used it to describe the cooperative firing between functionally related neurons that were grouped together on a submillimeter scale. Although functional connectivity was intended and is often evoked as a guiding principle for understanding brain function (Friston et al., 1993; Horwitz, 1991; McIntosh, 1999), it bears a resemblance to the elusive qualities of dark matter in the astrophysicist’s presentday universe (Abbott, 2002; Ostriker & Steinhardt, 2003). That is, interregional functional connectivity and dark matter are known to be ubiquitous in their respective sciences, but sightings are rare. What is the explanation for this odd circumstance as it pertains to cognitive neuroscience? Based on current and past neuroimaging studies of ordinary human abilities, we do not know whether the apparent uncertainty we confront in mapping regional functional connectivity reflects an inherent property of human information processing or whether it is a property of the experimental designs and statistical models we apply. Concretely, there is no assurance that latent patterns of functional connectivity will be uncovered using the conventional voxel-by-voxel modeling of expected experimental effects (Friston, Frith, Liddle, & Frackowiak, 1991; Friston et al., 1996; Worsley, Poline, Friston, & Evans, 1997). Indeed, the principal authors of the voxel-wise univariate and multivariate linear models, Friston, Worsley, and colleagues, have been reticent to suggest otherwise (Worsley et al., 1997). Further, McIntosh (1999) in his discourse on the need for spatial covariance modeling and the likely empirical evidence for functional connectivity to be derived therefrom, stops just short of asserting that latent patterns of functional connectivity frequently will be missed by voxel-by-voxel modeling. On the other hand, with the network analyses offered by McIntosh— structural equation modeling (McIntosh & Gonzalez-Lima, 1994) and partial least-squares analysis (McIntosh et al., 1996)—there is always a concern that some misattributions of connectivity will inevitably occur. Indeed, it is somewhat surprising that McIntosh and colleagues’ strongest evidence for functional connectivity has come from demonstrations that connectivity is substantially altered by graduated changes in task parameters (McIntosh, 1999), subject mind-set or state of awareness (McIntosh, Rajah, & Lobaugh, 1999), and subtle changes in neurophysiology that occur with normal aging (Cabeza, Anderson, Houle, Mangels, & Nyberg, 2000; Cabeza, McIntosh, Tulving, Nyberg, & Grady, 1997; Grady, McIntosh, & Craik, 2003). The question is whether this level of apparent volatility in functional connectivity is an inherent property of ordinary human abilities (Fernandez-Duque, Baird, & Posner, 2000a, 2000b), or a by-product of the
Ordinal Trend Canonical Variates Analysis
1605
limitations in our experimental designs and statistical modeling methods. To address this issue, we sought to devise a different type of spatial covariance modeling that would identify sustained functional connectivity across graduated changes in task parameters. Our intention has been to extend the definition of functional connectivity as it was originally applied to individual task conditions so as to define it for experiments that consist of parametric series of tasks conditions. The notion of sustained functional connectivity we consider here is that in which the influence of parametric changes between tasks is exchangeable with the influence of endogenous variables that induce subject differences within a task. Specifically, if—within a task—the effect of changing the level of endogenous variables is to scale up or down the activity of the functionally connected brain regions, then the effect of parametric changes between tasks is likewise to scale up or down the activity in these functionally connected brain regions, albeit on a subject-by-subject basis. In every task and subject, the scaling of activity in the functionally connected regions is therefore determined jointly by the experimental and endogenous variables. Based on this exchangeability of experimental and subject variables, we designed a spatial covariance model that can identify regional brain activations that in aggregate (i.e., as represented by a pattern of regional weights) express a positive ordinal trend with incremental changes in a task parameter; that is, these brain activations increase monotonically as the experimental task parameter increases, while the correlative relationships between brain regions remain constant. The activation patterns of interest are those that express positive ordinal trends on a subject-by-subject basis. Indeed, a prediction of ordinal trends is commonplace in studies of cognition that involve a series of experimental conditions over which task difficulty increases. Representative examples in the study of working memory are the N-back (Braver et al., 1997) and Sternberg (1966, 1969) tasks. Another example is the well-known auditory “oddball task” (Naatanen, Tervaniemi, Sussman, Paavilainen, & Winkler, 2001), where the fraction of standard to deviant tones is varied in a parametric manner. Many cognitive theories presume that there is sustained functional connectivity in the sense described above. For example, in his theory of verbal memory, Baddeley (1988, 2003) describes a coalition of component processes, including articulatory rehearsal, phonological store, and central executive, that is common to all individuals, and he predicts that the effect of increasing memory load is to incrementally increase activity in brain regions associated with these processes. The spatial covariance analysis that we devised, which we have named ordinal trends (OrT) analysis, provides an explicit test of the assumption of sustained functional connectivity. Of course, OrT is applicable not only to studies of task difficulty, but also to the broader spectrum of parametrically designed studies of human cognition.
1606
C. Habeck et al.
The strategy for using the OrT analysis to test the prediction of sustained functional connectivity is briefly the following. In a data set, OrT assigns reduced salience, on the one hand, to latent activation patterns that express mean directional changes between tasks that are different from the predicted ordinal trend.1 On the other hand, among the latent patterns that express mean trends in the predicted direction, OrT assigns a different level of salience depending on the type of task × subject interaction that a pattern expresses. Salience is reduced in activation patterns where the direction of the trend expressed is different for different subjects. By contrast, salience is enhanced in activation patterns where the direction of the trend is same in all subjects. Among the latter type of patterns, the salience assigned to a pattern is directly related to the proportion of the total (voxel × task × subject) variance in the original data set that is accounted for by the pattern and its expression. Among the latent patterns that express ordinal trends, OrT provides an estimate of the pattern with the highest salience, quantifies the expression of this pattern for each subject and task condition, and quantifies the statistical significance of the pattern expression and the reliability of the pattern’s voxel weights. In this article, we discuss the feasibility of the OrT computational approach, followed by a step-by-step recipe of the computations performed in an OrT analysis, including a description of the inferential statistical methods applied. Second, we describe applications of OrT to actual event-related fMRI and PET data sets. We report the results from OrT analyses of two studies of ordinary human abilities: (1) an event-related fMRI study of a verbal working memory task involving a delayed-matched-to-sample experimental design and (2) an H2 15 O-PET study of visuomotor learning. We show that the OrT analysis takes its place alongside PLS network analysis as being only the second spatial covariance model that is specifically designed to recover latent aspects of functional connectivity in neuroimaging studies that involve parametric experimental designs. In short, OrT serves as an omnibus test of sustained functional connectivity that is performed across multiple task conditions and all brain regions (voxels). 2 Feasibility of the OrT Computational Approach From a computational perspective, the ideal approach to identifying latent patterns that express ordinal trends would be simply to multiply the original neuroimaging data matrix by a matrix that would maximally enhance the salience of the target patterns, where the latter matrix is based on the parametric design of the experiment. This approach is similar to 1 In our description of the OrT strategy for assigning salience to regional covariance patterns, a latent pattern is any covariance pattern that is contained in the vector space spanned by the functional images contained in a data set. Moreover, the term latent pattern is not used in reference to a particular canonical representation of the vector space.
Ordinal Trend Canonical Variates Analysis
1607
the current canonical variates analyses (CVA) that have been designed for analyzing neuroimaging data (McIntosh et al., 1996; Worsley et al., 1997). In current CVAs, the task-subject × voxel matrix Y (the neuroimage data set) is multiplied by a task-subject × design matrix X consisting of predictor variables, after which the X Y product is submitted to singular value decomposition (SVD). Algebraically, the effect of matrix multiplication is always to differentially alter the voxel × task × subject variance accounted for by different latent patterns. In particular, matrix multiplication in the OrT analysis would be designed to selectively enhance the voxel × task × subject variance of patterns that expressed ordinal trends and, among these latter patterns, to produce the greatest enhancement in the pattern that expressed the largest voxel × task × subject variance in the original data set. On this basis, the application of principal component analysis (PCA), or SVD, to the transformed data set could be expected to produce major principal components that provided a good approximation to one or more patterns that express ordinal trends. In these respects, our approach builds on the current model-guided PCA methods designed for analyzing neuroimaging data (Petersson, Nichols, Poline, & Holmes, 1999). Spelling out what is required by the OrT analysis made it less certain, however, that there actually is a design matrix that would guarantee the identification of patterns that expressed ordinal trends. A new design matrix had to be invented that would differentiate among three categories of latent patterns, that is, would differentially alter the voxel × task × subject variance of three types of latent patents: first, discriminate among patterns that expressed mean trends in the predicted direction from patterns that expressed mean directional changes between tasks that are different from the predicted trend; and second, discriminate among different types of patterns within the first category. In the first category, the design matrix has to discriminate among patterns in which the direction of the trend expressed is the same in all subjects from patterns that express task × subject interactions in which the trend expressed is different for different subjects. In addition, the design matrix has to preserve the relative size of the voxel × task × subject variance accounted for by latent patterns that express ordinal trends. On the one hand, our computational approach is a form of CVA. On the other hand, the previous work on CVA—as it concerns the analysis of neuroimaging data (McIntosh et al., 1996; Worsley et al., 1997)—is not extendable in a straightforward manner to provide a matrix solution that satisfies the above OrT requirements. Indeed, in the extant CVA approaches, the presumption is that a modest number of predictor variables (i.e., a lowdimensional design matrix) will provide an adequate account of the neuroimaging data or, conversely, a large number of predictor variables would likely produce mixtures of different model effects that are not interpretable. We recognized, however, that no low-dimensional design matrix could correctly assign the appropriate salience to the three pattern categories that
1608
C. Habeck et al.
must be discriminated in the OrT analysis. In particular, a low-dimensional design matrix would not provide the means to differentiate among patterns that expressed the predicted mean trend but differed in the type of task × subject interactions. A high-dimensional matrix is required to differentiate among patterns in which the direction of the trend expressed is the same in all subjects and patterns that express task × subject interactions in which the trend expressed is different for different subjects. Our answer to the question of feasibility was a design matrix of dimension T ∗ N × (T − 1)∗ N, where T is the number of task conditions in the parametric series and N is the number of subjects. For the particular series of task conditions, E i , i = 1, . . . , T, the individual columns of the design matrix assign unit values to one or another pair of consecutive task conditions and zero to all other task conditions. For each pair of consecutive task conditions, these assignments are made on a subject-by-subject basis. When T = 5, for example, the OrT design matrix (Q) is a 5N × 4N matrix of the form
IN IN Q= 0 0 0
0 IN IN 0 0
0 0 IN IN 0
0 0 0 , IN IN
where for each task, IN denotes the N × N identity matrix. In other words, tasks are ordered sequentially along the row dimension, and subjects are repeated within each task. This article includes the tests we have performed to evaluate the utility of this new design matrix. We have applied the OrT analysis not only to real event-related fMRI and H2 15 O PET data sets, but also to data sets simulated using Monte Carlo methods. In the Monte Carlo simulations, the performance of the OrT design matrix was evaluated in terms of the degree that the PCA of the transformed data set outperformed the PCA of the untransformed data set (see the appendix). The primary aim of the Monte Carlo computations was to verify that the target patterns that expressed ordinal trends in the simulated data sets were better estimated by a fixed number of the major principal components of the transformed data than by the same number of principal components of the untransformed data. A detailed discussion of these Monte Carlo tests of feasibility is contained in the appendix. In addition, we compared the performance of the OrT method to the performance of the conventional, low-dimensional CVA models that are ordinarily applied to event-related fMRI and H2 15 O PET data sets. In brief, it may come as a surprise that the low-dimensional CVA models actually performed worse—in recovering target patterns that expressed ordinal trends—than PCA applied to the untransformed data set (Figure 8 in the
Ordinal Trend Canonical Variates Analysis
1609
appendix). The differences in performance are substantial. We also found that even high-dimensional design matrices that do not contain the particular features of the OrT matrix, such as, the Helmert matrix (Venables & Ripley, 1999), perform nearly as poorly as the conventional low-dimensional design matrices (Figure 8 in the appendix). Finally, we have addressed the issue of the statistical specificity of the OrT analysis by demonstrating that the OrT design matrix achieves low type I error rates (false alarm rates) in data sets in which the task × subject neuroimaging data are generated using the statistics of random gaussian fields. In the next section, before presenting the computational recipe for the OrT analysis and the information about its statistical specificity, we provide one example of a Monte Carlo simulation in order to portray with simple graphics (see Figure 1) what it means for the PCA of the OrT transformed data set to outperform the PCA of the untransformed data set in identifying latent patterns with ordinal trends. 2.1 First Illustration of the OrT Analysis. Consider a miniature data set that consists of two task conditions: a control condition B and an experimental challenge condition E1 where, for purposes of visual display, each image was limited to just two voxels. These diminutive images for 100 subjects are displayed in a two-dimensional Cartesian coordinate system in Figure 1. This visual display corresponds to the formal algebraic representation of the data set as a task-and-subject × voxel matrix (Y), consisting of 200 rows (two tasks times 100 subjects) and two columns (two voxels). Moreover, different data points in Figure 1 represent two-voxel images for different subjects and task conditions, where each data point corresponds to a single row in the data matrix Y. With regard to the remainder of the Monte Carlo simulations described in the appendix, they involve more realistic data sets and are described using matrix notation only. In our miniature data set, all 200 images are actually admixtures of just two latent patterns of functional connectivity: a latent pattern that expresses a positive trend for every subject and a second latent pattern that does not. In algebraic notation, the OrT pattern is z1 = [1; 1], and the second, nonOrT pattern is z2 = [1; −1], where the boldface z1 and z2 variables represent 2 × 1 column vectors. In z1 , the two positive voxel weights indicate a form of functional connectivity in which z1 ’s contributions to the overall activity of voxels 1 and 2 are positive. In contrast, in z2 , the voxel weights are of the opposite sign, indicating a form of functional connectivity in which z2 ’s contribution to the overall activity of voxel 1 is positive, but its contribution to voxel 2 is negative. In this sense, z1 and z2 represent orthogonal patterns of functional connectivity, which is represented algebraically by a zero inner product, z1 · z2 = 0. In this simulated data set, the group mean expression of each pattern is configured to reveal a positive change from B to E1. Indeed, the subject
1610
C. Habeck et al.
expressions of the two patterns, z1 and z2 , are configured to have the same means and the same variances in each of the two task conditions, while at the same time, z1 expresses a positive trend for every subject, whereas z2 expresses a positive trend for 55 of the subjects and expresses a negative trend for the remainder of subjects. These several features of latent activation patterns z1 and z2 are displayed in Figure 1A, where the total voxel activity in a data set is plotted for each task and subject (open circles). Also plotted are the subject levels of voxel activity in the individual patterns of functional connectivity. For z1 , the circles are linearly aligned with positive slope along the line [1; 1], and for z2 , the circles are linearly aligned with negative slope along [1; −1]. Different colors are used to distinguish task B (green) from task E1 (red) in depicting the activity in both whole images and the individual patterns of functional connectivity. The details of subject pattern expression are as follows. For z1 , the subject expression values b for task B were sampled from the uniform distribution U(0,1). The vector b is an N × 1 column vector. The expression values for E1 are denoted as e1 and were generated as b + , where also is a N × 1 random variable, sampled from U(0,1). This results in b and e1 having mean 1/2 and 1 and variance 1/12 and 1/6, respectively. For z2 , b and e1 are similarly constructed. However, the subject labels for both b and e1 have been randomly permuted, resulting in different subjects exhibiting opposite trends. In other words, the collection of images in the task-and-subject × region data matrix can be represented as an algebraic sum of the individual contributions of the latent patterns z1 and z2 : Y=
b e1
Target
1 1 +
b e1
1 −1 .
Non−Target
Indeed, it is clear from this formula that subject expression of a latent pattern is simply the projection of the pattern onto the data set. For example,
Figure 1: Miniature data set Y involving two experimental tasks B and E1 and 100 subjects, in which each image, containing just two voxels, is an admixture of a targeted activation pattern and an orthogonal, nontargeted activation pattern. (A, Right) Task × subject voxel activity in Y. Activity values due to the individual patterns are displayed as well as the aggregate data. Different tasks are indicated by different-colored open circles: green indicates activity pertaining to task B, red to task E1 . The blue line indicates the major source of variance, which is also the mean difference image between conditions found with mean contrast analysis. It is the vector average of both targeted and nontargeted activation pattern. (A, left) Task-activity curves for the subject expression of targeted and nontargeted pattern, shown for a subset of 20 subjects to avoid clutter. The ordinal trend feature of strictly monotonic curves with a high intertask correlation can be discerned for the
Ordinal Trend Canonical Variates Analysis
1611
Target expression
(A)
E1 Voxel 2
B
Non-target expression
B
Voxel 1
E1
(B)
Voxel 2
main source of variance
Voxel 1 Figure 1: (cont.) targeted (upper-left figure) but not the nontargeted activation pattern (lower-left figure). (B) Subject-voxel activity in the data transformed by the OrT matrix Q(Q Q)−1/2 (open blue circles). Application of the OrT matrix results in the same amount of variance of the open circles along the direction of the nontargeted component, but a much enlarged amount of variance ( = var(b) + var(e1 ) + 2cov(e, b1 )) along the direction of the targeted pattern because of the high intertask correlation of subject expression. The direction of the major source of variance ( = first principal component) is drawn as a blue line.
1612
C. Habeck et al.
for z1 , the pattern expression is the inner product of the z1 vector with the Y matrix:
b e1
= Y · z1 . Target
A comparison between Figures 1A and 1B illustrates the considerable impact that the OrT matrix has on the voxel × task × subject variances of z1 and z2 . In particular, the comparison reveals that the OrT transformed data set outperforms the PCA of the untransformed data set in identifying as the first principal component the latent pattern z1 for which all subjects exhibit a positive trend. Figure 1A shows that the patterns z1 and z2 have equal salience and that the first principal component of the untransformed data set is not latent pattern z1 but, rather, the vector average of patterns z1 and z2 along the direction [1; 0], which is indicated by a blue line. By contrast, Figure 1B shows that after OrT matrix multiplication, the patterns z1 and z2 no longer have equal salience, and now the first principal component is a good approximation of z1 . We have used extensive Monte Carlo simulations to generalize these results and establish a benchmark of the accuracy of OrT pattern estimation. The simulated data sets and OrT performance are described in the appendix. Monte Carlo simulations were written in MATLAB 6.0 (Mathworks, Natick, MA) and performed on Linux workstations. Interested readers who want to learn more about the derivation and utility of the OrT design matrix can find all pertinent information in the appendix. 2.2 Algorithm of OrT/CVA. We now present a list of the 5 computational steps of the OrT/CVA, assuming that the neuroimaging data have undergone sufficient preprocessing, resulting in one scan per subject per task. (The preprocessing steps will be explained in detail in the sections showing applications to real-world PET and fMRI data sets.) We assume three task conditions, B, E1, E2, but our recipe can be generalized to any number of task conditions (two or greater). Step 1: Application of a projection operator, P, by multiplication from the right according to YP, to eliminate strictly task-independent effects. P is constructed from the set of 2N eigen images of the Helmerttransformed data matrix H Y. The eigen decomposition can be written as Y HH Y W = W with the Helmert matrix
−IN H = IN 0
IN IN . −2IN
Ordinal Trend Canonical Variates Analysis
1613
The matrix W contains the 2N eigen images as column vectors, and is a 2N-diagonal matrix containing the nonzero eigen values. P is the projection matrix of the Helmert eigen images; that is, P is the matrix WW . The modified data matrix YP has the same dimensions as the original data matrix Y. However, YP will contain N fewer activation patterns and has rank 2N (i.e., it is of lower rank than Y, which has rank 3N). Step 2: Application of the OrT design matrix, Q, by multiplication from the left according to [Q(Q Q)−1/2 ] YP, to increase the salience of ordinal trend effects. With N subjects and 3 task conditions, the OrT design matrix consists of 2N predictor variables, where a pair of predictor variables is constructed for each subject. With the preselected ordering of tasks B, E 1 , and E2, the predictor variables for the jth subject are (1) a 3N vector in which the entries are zero except for the jth scans for the B and E1 task conditions, which both contain unit values, and (2) a 3N vector in which the entries are zero except for the jth scans for the E1 and E2 task conditions, which contain unit values. The OrT design matrix can thus be written as
0 IN . IN
IN Q = IN 0
The OrT design matrix applied to the imaging data YP is the orthonormal version Q(Q Q)−1/2 of the above matrix. This normalization guarantees that all predictor variables, specifically all subjects, are equally influential in assigning salience to latent patterns. Step 3: Singular value decomposition is applied to the mean-centered [Q(Q Q)−1/2 ] YP matrix. This is equivalent to applying PCA, that is, P Y Q(Q Q)−1 Q YP
V = V.
V contains 2N orthogonal eigen images as column vectors, and is a 2Ndiagonal matrix of the eigen values. Step 4: The first K eigen images are tested for the presence of an ordinal trend. For the first K singular images, a 2N × K predictor array is calculated according to [E1 − B; E1 + B − 2E2 ]. B is obtained by projection of all K images onto the raw data pertaining to condition B : B = Y(1 : N, :) V(:, 1 : K ). Likewise, for E 1 and E2, we have E1 = Y(N + 1 : 2N, :)V(:, 1 : K ), and E2 = Y(2N + 1 : 3N, :)V(:, 1 : K ). We then conduct a linear regression to best predict the dependent variable of the regression, which is a 2N column vector [1; 1], with the 2N × K predictor array described above:
1 −1
≈
E1 − B β. E1 + B − 2E2
1614
C. Habeck et al.
Table 1: Tabulation of Type I Error Rates for the Number-of-Exceptions Criterion, Obtained from 10,000 Monte Carlo Simulations for 13 Subjects and 500 Regional Resolution Elements.
0 exceptions 1 exceptions 2 exceptions 3 exceptions 4 exceptions 5 exceptions 6 exceptions
PC1
PC1-2
PC1-3
PC1-4
PC1-5
PC1-6
0.000 0.001 0.011 0.059 0.209 0.533 1.000
0.000 0.005 0.038 0.156 0.409 0.730 1.000
0.001 0.015 0.090 0.290 0.591 0.856 1.000
0.004 0.041 0.176 0.440 0.739 0.925 1.000
0.012 0.088 0.291 0.589 0.844 0.966 1.000
0.030 0.167 0.432 0.724 0.916 0.985 1.000
In other words, the regression is a type of discriminant analysis that produces the linear combination of the K eigen images according to V(:, 1 : K )β whose mean expression changes maximally across task conditions. For the test of significance of the ordinal trend, we compute the task-subject scores for this new linear combination image according to the right-hand side of the above regression equation. The test of significance is based on the minimum number of exceptions to a perfect segregation of these contrast scores, which is an inverse correlate to the maximum number of subjects who exhibit monotonic task-activity curves. Monte Carlo methods are used to calculate the type I error rate of ordinal trends based on the minimum number of exceptions to a perfect segregation of scores. Table 1 illustrates the type I error based on (1) fixing in advance the number of the principal components of [Q(Q Q)−1/2 ] YP employed in the discriminant analysis described above and (2) task × subject images that have been smoothed to 500 regional resolution elements (”resels”), where images contain resels of independently and normally distributed noise. Naturally, different error rates are obtained depending on the particular decision rule that is used to select principal components 1 to K from [Q(Q Q)−1/2 ] YP and the number of resels per image. Step 5: Bootstrap estimation of the robustness of voxel weights in the ordinal trend topographic estimate. A bootstrap resampling procedure can be used to estimate the variability of the regional weights in the patterns about their point estimate values. The complete analysis (steps 1–4) that was performed on the original subject sample to arrive at the ordinal trend topographic estimate is usually repeated 100 to 1000 times on samples of subjects that have been chosen randomly with replacement from the original subject pool.2 The inverse coefficient of variation (ICV) serves as the 2
The computational requirements of this resampling process are as follows: for our fMRI example using 16 subjects and 500 iterations on a Linux workstation with 4 GB of RAM, the bootstrap estimation process took about 2 hours.
Ordinal Trend Canonical Variates Analysis
1615
measure of the reliability of the regional weight at each voxel in the topographic pattern. ICV is computed from the point estimate of the regional weights, wvoxel , and the variability of the resampling process around this point estimate, captured as the standard deviation σvoxel , as ICVvoxel =
Wvoxel ∼ N(0, 1), σvoxel
and is approximately standard normally distributed (Efron & Tibshirani, 1994). The larger the absolute magnitude of ICVvoxel , the smaller the relative variability of the regional weight about its point estimate value. We adopted a threshold of at least |ICVvoxel | > 2 for the two examples discussed in the next section. Under the assumption of a standard-normal distribution, this corresponds to a one-tailed p-level of p < 0.0228. 3 Application to Event-Related fMRI Study of Working Memory Methodological details are spelled out for a typical application of OrT to event-related fMRI data. The application is a study of verbal working memory, some of whose results have been published in Habeck et al., (2004). Here we present results that have not been included in the previous publication. 3.1 Study Design. Eighteen young subjects (age = 26.3 ± 4.9 years) participated in an event-related functional magnetic resonance imaging (efMRI) paradigm of a delayed-match-to-sample (DMS) task. The initial scan occurred at 9 A.M. (”PRE”), and the follow-up scan occurred at the same time 48 hours later (”POST”) to eliminate confounding circadian effects, yielding 48 hours of prolonged wakefulness. All subjects had been carefully screened for normal sleep patterns for 2 weeks prior to the experiments and for the absence of neurological or psychiatric contraindications. (Details are available elsewhere (Habeck et al., 2004) and are omitted here for brevity.) Finally, 22 additional subjects (age 23.93 ± 1.14) participated in the PRE scan but did not undergo the sleep deprivation protocol. The DMS task was a variant of the Sternberg task (Sternberg, 1966, 1969). A trial lasted 16 seconds. Subjects were instructed to respond as accurately as possible. No feedback about their performance was given. The sequence of trial events was as follows: first, a fixed 3 second period of blank presentation marked the beginning of trial; then, during the stimulus period of the task, an array of one, three, or six uppercase letters was presented for 3 seconds (the stimulus phase). With the offset of the visual stimulus, subjects were instructed to focus on the blank screen and hold the stimulus items in mind for a 1 second maintenance interval (the retention phase). Finally, a probe appeared for 3 seconds (the probe phase), which was a lowercase letter centered in the field of view. In response to the probe,
1616
C. Habeck et al.
subjects indicated by a button press whether the probe matched a letter in the study array (the left index finger indicated yes and the right index finger indicated No). Each of three experimental blocks contained 10 trials for each of 3 set sizes with 5 true negative and 5 true positive probes per set size. There were 10 × 3 × 3 = 90 experimental trials per scanning session. In addition to the fixed 3 second period of a blank screen presentation, which we counted as part of the experimental trial, there were intertrial intervals (ITI) that consisted of presentation of a blank screen and were used as baseline epochs in the time-series analysis of the subject’s data. Their length was variable and determined in the following way: 70 2-second increments were available throughout the whole block, for 30 intertrial intervals. It was decided stochastically whether a 2 second increment of ITI would be inserted prior to the start of the trial or whether the trial would begin immediately. The details of this procedure have been reported in detail elsewhere (Habeck et al., 2004). With 30 trials of 16 seconds each, each block lasted for 140 + (30 × 16) = 620 seconds. There were two breaks of approximately 1 minute each between block 1 and 2 as well as block 2 and 3. This brings the overall time subjects spent in the scanner by each subject to (3 × 620) + 120 = 1980 seconds, or 33 minutes. On the evening before the first day of fMRI scanning, every subject received seven blocks of initial training on the experimental setup: the first 6 training blocks were run with feedback and the seventh without feedback. On the first day of fMRI scanning, all subjects were well rested. Reported here are new OrT/CVA analyses of the fMRI data sets of PRE and POST sleep deprivation; the focus is on the functional activity of the retention phase. The first analytic goal was to recover a pattern of sustained functional connectivity in which subjects expressed ordinal trends with an increasing memory load of 1-, 3-, and 6-letter arrays. The OrT/CVA analysis was initially performed on fMRI data from a subgroup of the 40 subjects who participated in our working memory study: the subgroup consisted of 16 randomly selected subjects. Subsequent to this analysis, the plan was to apply the presumptive OrT pattern obtained from the analysis of 16 subjects to the fMRI data of the 24 additional individuals, where there were again three load conditions per subject, and again subjects were well rested. The aim of this ”forward application” was to illustrate that the original ordinal trend effect could be replicated in an independent subject sample of comparable size. Successful forward application provided a demonstration that an accurate estimate of the OrT pattern had been obtained from the initial OrT/CVA analysis. The forward application of an OrT pattern simply entailed the calculation of pattern expression in the individual fMRI images obtained for the retention interval for each of the 24 subjects, for each of the three load conditions. For each subject and load condition, pattern expression is a scalar
Ordinal Trend Canonical Variates Analysis
1617
value that is the inner product between each raw fMRI image and the OrT pattern image: the inner product is simply the voxel-by-voxel multiplication of the weights in the fMRI and OrT pattern images summed over the whole brain. (See section 2.1 for a description of this operation in terms of vector notation.) Provided that the ordinal trend effect was replicated in the independent group of 24 well-rested subjects, our plan was to assess whether this pattern of sustained functional connectivity revealed in the well-rested state was preserved after sleep deprivation. Concretely, our plan was to forwardapply the OrT pattern estimate of the well-rested state to the fMRI data of the 18 sleep-deprived subjects, where again there were three load conditions (1-, 3-, and 6-letter arrays) per subject. This forward application of the OrT pattern provided a test for ordinal trends with maximal statistical degrees of freedom, where the statistical power of the forward application was dependent on the accuracy of the original OrT pattern estimate. 3.2 FMRI Preprocessing Steps. Functional images were acquired using a 1.5 Tesla magnetic resonance scanner (Philips). A gradient echo EPI sequence (TE = 50 ms; TR = 3 sec; flip angle = 90◦ ) and a standard quadrature head coil were used to acquire T2∗ weighted images with an in-plane resolution of 3.124 mm × 3.124 mm (64 × 64 matrix; 20 cm2 field of view). Based on T1 “scout” images, 8 mm transaxial slices (15–17) were acquired. Following the fMRI runs, a high (in-plane) resolution T2 image at the same slice locations used in the fMRI run was acquired using a fast spin echo sequence (TE = 100 ms; TR = 3 sec; 256 × 256 matrix; 20 cm2 field of view). Task administration and data collection were controlled by a computer running appropriate software (Psyscope 1.1) in electronic synchrony with the MR scanner. Task stimuli were back-projected onto a screen located at the foot of the MRI bed using an LCD projector. Subjects viewed the screen via a mirror system located in the head coil. Task responses were made on an LUMItouch response system, and behavioral response data were recorded on the task computer. All image processing and analysis was done using the SPM99 program (Wellcome Department of Cognitive Neurology) and supporting code written in Matlab 6.0 (Mathworks, Natick, MA). FMRI time series were corrected for order of slice acquisition. All functional volumes in a given subject were realigned to the first volume from the first run of each study. The T2 anatomical image was then coregistered to the first functional volume, using the mutual information coregistration algorithm implemented in SPM99. This coregistered structural image was then used in determining nonlinear spatial normalization (7 × 8 × 7 nonlinear basis functions) parameters for a transformation into a Talairach standard space defined by the Montreal Neurological Institute template brain applied with SPM99. These normalization parameters were then applied to the functional data (using SINC interpolation to reslice the images to 2 mm × 2 mm × 2 mm).
1618
C. Habeck et al.
In a level 1 time-series analysis of the individual subject data, the fMRI responses to the three separate temporal components of the task, in each experimental condition and in each block, were fit to separate sets of predictor variables (Zarahn, 2000). The predictor variables of the time-series modeling were the following: a constant intercept (0th-order discrete cosine set) was chosen for the stimulus and probe phases, whereas a 0th- to 2nd-order discrete cosine set was chosen for the retention phase. For one block, this results in five predictor variables (one for stimulus, three for retention, one for probe) per set size (one, three, and six) per probe type (positive or negative). An additional intercept term is provided for the effect of block, bringing the total number of predictor variables per block to (5 × 3 × 2) + 1 = 31. Predictor variables had a nonzero value at every point in the time series where a particular condition was met and a zero value at every other point. For example, one predictor had a value of one during all stimulus phases of set size one, with a positive probe, during the first block. The set was convolved with a canonical hemodynamic response waveform (a sum of two gamma functions, as specified in the SPM99 program (Friston et al., 1998) whose beginnings were marked by the appropriate onset vector for each epoch, set size, and probe type. The resulting timeseries vectors were used in the design matrix for the within-subject model estimation. The number of rows was the total number of volumes denoting the complete fMRI time series across the scanning session. The number of columns was 3 × 31 = 93, with 31 design vectors for each experimental block. The bandpass-filtered (low pass by a gaussian with a FWHM of 4 sec and a high-pass cutoff of 14.5 mHz) fMRI time series at each voxel were regressed onto these predictor variables. A first-order autoregressive autocorrelation model was fit to the residuals to make statistical inference more robust to the intrinsic temporal autocorrelation structure (Friston et al., 2000). At every voxel in the image, components of the event-related responses that matched the canonical hemodynamic response waveform were estimated for the whole scanning session. Linear contrasts assessed the amplitudes (normalized regression coefficients) of these components. A typical contrast used in our analysis, for instance, would be activity during the retention phase for six items collapsed across probe types and experimental blocks versus activity in the ITI blank period. This method of time-series modeling and contrast estimation at each voxel reduces the number of images to one per subject per condition. To account for gain differences between fMRI sessions, activation values were normalized by their voxel averages. The resulting parametric map images were smoothed using an isotropic gaussian kernel (FWHM = 8 mm) and used as the data in the subsequent analysis. Afterward, a probabilistic gray matter mask was applied with a threshold of 0.5: every voxel submitted to the analysis had at least a chance of 0.5 of being gray matter. The resulting masked brain images contained 115 resolution elements as indicated by SPM99. These parametric maps serve
Ordinal Trend Canonical Variates Analysis
1619
as the dependent variables for the subsequent population-level OrT/CVA analysis. 3.3 Results. In the initial OrT/CVA analysis of the retention period of the working memory task, which involved 16 subjects and three levels of memory load (one-, three-, and six-letter arrays), the first two principal components of the OrT/CVA combined linearly to produce an activation pattern that expressed a statistically significant ordinal trend effect. Here, statistical significance indicated that a regional activation pattern was present in the retention period whose functional connectivity was sustained across increasing levels of memory load. This OrT pattern accounted for 5.8% of the variance in the raw fMRI data set. Brain regions that concomitantly increased in activation (as ascertained by the bootstrap test with a threshold of |ICVvoxel | > 2) for the majority of subjects as a function of memory load were found mainly in parietal areas (BA 7 and BA 40), frontal/prefrontal areas (BA 6,8,9), right fusiform gyrus (BA 19), and left superior temporal gyrus (BA 22). Brain regions that concomitantly decreased in activation for the majority of subjects were found mainly in the anterior and posterior cingulate gyri (BA 31, 24), insula (BA 13), cuneus (BA 19), right parahippocampal gyrus (BA 19), and medial frontal gyrus (BA 10). For a complete listing of both areas of increased and decreased activation see Tables 2 and 3 and Figure 2. Based on the number-of-exceptions statistic (described in step 4 of the OrT/CVA algorithm), there was a significant ordinal trend ( p < 0.01, 2 exceptions; see Figure 3). This OrT pattern of load-related regional activations was forward-applied into the fMRI data set of the additional 24 subjects, who also were scanned while well rested. The matrix of pattern expression values for three load conditions, for each of 24 subjects, yielded a value of 5 for the number-of-exception statistic and p < 0.001. The p-value was computed using a Monte Carlo method similar to that described in step 4 of the OrT/CVA algorithm, where the p-value is the probability of obtaining a statistic of five or less from data sets that were generated from the statistics of gaussian random noise. The p-value reported here is based on 10,000 Monte Carlo simulations of data sets in which individual images contained 115 resolution elements each. Although the OrT pattern accounted for 5.8% of variance in the fMRI data set from which it was originally derived, the same OrT pattern accounted for less variance (i.e., 2.0% variance) in the fMRI data set into which it had been forward applied (the data set of the 24 additional subjects). This reduction in the variance-accounted-for most likely reflects a limitation in terms of the accuracy with which a true OrT pattern can be estimated from an original sample of 16 subjects. Notwithstanding, this reduction in varianceaccounted-for does not detract from the fact that ordinal trends were expressed to a significant degree by the estimated OrT pattern in the fMRI data set of the 24 additional subjects. Indeed, this outcome is consistent
1620
C. Habeck et al.
Table 2: FMRI Example: Talairach Locations of Nearest Gray Matter Locations with Significant Increased Activation Across Memory Load as Ascertained by a Bootstrap Resampling Test (ICV > 2.0). X
Y
Z
Anatomical Description
Brodmann Area
32 42 −8 28 38 6 −4 −40 −24 8 −22 −28 −59 42 34 −30 −42 −6 −6 −42 −40 51 65 14 6 12 53 −51
−52 −42 3 −65 −73 −3 −11 0 1 14 −62 −72 −39 40 42 −73 −76 6 14 41 −61 −68 −39 −80 −81 −65 −53 0
50 56 62 −15 −15 11 13 48 55 3 45 46 30 18 26 −13 −10 0 1 7 −14 9 0 −16 −18 53 −14 33
Superior parietal lobule Inferior parietal lobule Medial frontal gyrus Declive Declive Thalamus Thalamus Middle frontal gyrus Subgyral Caudate Superior parietal lobule Superior parietal lobule Inferior parietal lobule Middle frontal gyrus Middle frontal gyrus Fusiform gyrus Middle occipital gyrus Caudate Caudate Inferior frontal gyrus Fusiform gyrus Middle occipital gyrus Middle temporal gyrus Declive Declive Superior parietal lobule Inferior temporal gyrus Precentral gyrus
Brodmann area 7 Brodmann area 40 Brodmann area 6 Cerebellum Cerebellum Thalamus Medial dorsal nucleus Brodmann area 6 Brodmann area 6 Caudate head Brodmann area 7 Brodmann area 7 Brodmann area 40 Brodmann area 10 Brodmann area 9 Brodmann area 19 Brodmann area 18 Caudate head Caudate head Brodmann area 46 Brodmann area 37 Brodmann area 19 Brodmann area 21 Cerebellum Cerebellum Brodmann area 7 Brodmann area 20 Brodmann area 6
Source: Results come from Talairach Daemon Client 1.1, Research Imaging Center, University of Texas Health Science Center at San Antonio.
with the notion that, compared to a separate OrT/CVA analysis of a new data set, a substantial gain in statistical power can be achieved by the forward application of a previously obtained OrT pattern estimate. In contrast to the above results, the forward application of the OrT pattern did not reveal a significant ordinal trend effect in the comparable fMRI data of 18 sleep-deprived subjects. In these later subjects, who performed the working memory task after 48 hours of sleep deprivation, the forward application of the OrT pattern of the well-rested state produced six exceptions and a p = 0.11. A separate OrT/CVA analysis of the 18 sleep-deprived subjects, again performed on the fMRI data set of the retention period, also failed to produce an activation pattern that expressed significant OrT trends.
Ordinal Trend Canonical Variates Analysis
1621
Table 3: FMRI Example: Talairach Locations of Nearest Gray Matter Locations with Significant Decreased Activation Across Memory Load as Ascertained by a Bootstrap Resampling Test (ICV < − 2.0). X
Y
−44 −57 −59 −12 24 38 28 46 38 40 59 44 14 −12 67 4 −28 −61 −20 65 46 26 −57 −14 −42 8 22 55 −63 −53 26 −20 −24 12 38 −53 14 2 −61 −4 −8 −32 67
−66 −57 −49 −66 27 −15 −45 −52 −75 −56 −53 −53 −88 −88 −32 50 13 −53 39 −16 −21 −6 −9 20 −66 −14 −39 5 −14 −65 −50 −29 −8 −25 −79 −5 −44 1 −7 −50 −75 −47 −35
Z
Anatomical Description
Brodmann Area
36 34 36 −39 41 14 −3 −24 −30 −33 36 23 25 28 16 20 −4 23 35 −1 −1 −3 −16 58 9 −3 −38 −12 −9 −12 −24 5 −5 12 6 −17 −35 29 21 −33 24 −13 −10
Angular gyrus Supramarginal gyrus Supramarginal gyrus Inferior semilunar lobule Middle frontal gyrus Insula Parahippocampal gyrus Tuber Tuber Cerebellar tonsil Supramarginal gyrus Superior temporal gyrus Cuneus Cuneus Superior temporal gyrus Medial frontal gyrus Claustrum Supramarginal gyrus Superior frontal gyrus Superior temporal gyrus Superior temporal gyrus Lentiform nucleus Inferior temporal gyrus Superior frontal gyrus Middle temporal gyrus Brainstem Cerebellar tonsil Middle temporal gyrus Middle temporal gyrus Fusiform gyrus Culmen Thalamus Lentiform nucleus Thalamus Middle occipital gyrus Middle temporal gyrus Cerebellar tonsil Cingulate gyrus Postcentral gyrus Cerebellar tonsil Cuneus Fusiform gyrus Middle temporal gyrus
Brodmann area 39 Brodmann area 40 Brodmann area 40 Cerebellum Brodmann area 8 Brodmann area 13 Brodmann area 19 Cerebellum Cerebellum Cerebellum Brodmann area 40 Brodmann area 39 Brodmann area 19 Brodmann area 19 Brodmann area 22 Brodmann area 9 Cerebellum Brodmann area 40 Brodmann area 9 Brodmann area 21 Brodmann area 22 Putamen Brodmann area 21 Brodmann area 6 Brodmann area 37 Subthalamic nucleus Cerebellum Brodmann area 21 Brodmann area 21 Brodmann area 19 Cerebellum Pulvinar Lateral globus pallidus Pulvinar Brodmann area 19 Brodmann area 21 Cerebellum Brodmann area 24 Brodmann area 43 Cerebellum Brodmann area 18 Brodmann area 37 Brodmann area 21
Source: Results come from Talairach Daemon Client 1.1, Research Imaging Center, University of Texas Health Science Center at San Antonio.
1622
C. Habeck et al.
Figure 2: Activation pattern whose subject expression shows an ordinal trend across memory load during the retention period. The pattern estimate is based on the first two OrT principal components. Pattern voxels whose absolute values exceeded a threshold value of 2 in their inverse coefficient of variation (ICV) are shown in sagittal, coronal, and transverse projection views, produced with spm99 software package. (ICV values were estimated using a bootstrap method.) (A) Positively weighted areas—areas that are increasing in activation across memory load for a majority of subjects. (B) Negatively weighted areas—areas that are decreasing in activation across memory load for a majority of subjects.
In summary, in analyzing the effects of sleep deprivation on working memory, we first applied the OrT/CVA analysis to the fMRI data of well-rested subjects to obtain an activation pattern that indicated loadrelated processing that was operative during the delay period. Although the functional connectivity captured in this activation pattern appeared to be sustained with increasing memory load in well-rested subjects, our results suggest that it had been disrupted in subjects who were sleep deprived for 48 hours. Indeed, neither a forward application of the load-related pattern nor a separate OrT/CVA of the fMRI data set of the 18 sleepdeprived subjects produced a significant ordinal trend effect.
4 Application to Imaging of Visuomotor Learning Using PET We offer here a second application of the OrT methodology, in this case to neuroimaging data obtained in a study that used H2 15 O PET to investigate a subtle form of visuomotor adaptation, the learning of a novel visuomotor gain. The neural response to the cognitive challenge was not detectable using the conventional brain-wide analysis of voxel activity (statistical parametric mapping, SPM; (Friston et al., 1996)). Notwithstanding, some of the sites of activation were predictable a priori and included brain regions that
Ordinal Trend Canonical Variates Analysis
1623
would be expected to be strongly interactive. Indeed, an SPM analysis that was restricted to predictable activation sites did reveal significant responses ( p < 0.05, corrected) during adaptation (Krakauer et al., 2003). The aims of the OrT analysis were more ambitious: to demonstrate that a brain-wide analysis, different from a voxel-wise SPM99 analysis, could detect activations in the predicted regions and that the spatial covariance pattern is significantly associated with visuomotor adaptation—that is, the expression of the OrT activation pattern provides a reliable account of the subject differences in visuomotor adaptation. The results of the OrT analysis met both aims.
4.1 Study Design. The neuroimaging study examined a form of visuomotor adaptation in which subjects performed reaching movements. Individuals additionally had the task of learning to rescale the spatial mapping between actual hand movements and the visual appearance of their trajectories displayed on a monitor (Krakauer et al., 2003). The imaging technology used was H2 15 O PET; 10 subjects participated in the study. Basic task requirements were described in detail in previous publications (Ghilardi et al., 2000; Nakamura et al., 2001). In brief, all tasks required subjects to move a handheld cursor with their right hand on a digitizing tablet (Numonics Corporation, Model 2200) while their hand and target locations were displayed on a 15 inch computer screen. A computer controlled the experiment to generate screen displays and acquire kinematic data from the digitizing tablet at 200 Hz. On the day prior to PET scanning, all subjects received a session of training on the experimental setup, during which time they achieved a level of errorless performance on a baseline condition. The baseline condition (CONTROL) required subjects to move a cursor out and back in one uninterrupted movement from a central starting position to one of eight radially arrayed circular targets. In this condition, the relation between tablet and screen was one to one: the extent and the direction of the hand trajectory on the tablet were replicated on the screen. Each out-and-back movement took 1 second, and the succession of eight out-and-back movements was counterclockwise. This eight-target cycle was repeated eight times over a 96 second period. During PET scanning, a novel learning condition (GAINalt) was covertly introduced in which the tabletto-screen gain was altered every two cycles between 1:1.5 and 1:0.5, thereby maintaining a relatively constant level of challenge across the 96 second period. Two nonconsecutive scans were acquired for each subject performing the GAINalt task, denoted GAINalt1 and GAINalt2. Subject performance during each scan was characterized as a series of learning curves, one for each pair of cycles of constant gain, which were well fitted with single exponential functions. The coefficients of these exponential functions were used as estimates of the average rates of adaptation in the session. For this, the coefficients for the 1:1.5 and 1:0.5 gain change epochs were combined
1624
C. Habeck et al.
to obtain a mean adaptation rate for each scan and subject. This estimated rate of adaptation was the performance variable used in the correlational analysis with OrT pattern expression. 4.2 PET Preprocessing Steps. The PET data analyzed were 3D, H2 15 O PET scans of a 96-second duration. The same raw count images were used in both the SPM and OrT/CVA analyses, where the images were smoothed, aligned, and mapped into MNI coordinates using the SPM99 package (SPM99, Wellcome Department of Cognitive Neurology). Raw images were masked with aprobabilistic gray matter mask at a threshold of 0.2. The entire masked raw image of each subject and condition was used in the OrT analysis. However, in the spatially restricted SPM analysis, the brain areas included were limited to the left sensorimotor cortex (BA 1, 2, 3, and 4), premotor cortex and SMA (BA 6), posterior cingulate (BA 23), parietal (BA 5, 7, 40), and visual areas (BA 17, 18), as well as the subcortical areas of the left putamen, globus pallidus, and thalamus. Both the brain-wide SPM analysis and the spatially restricted SPM analysis sought to identify differences between CONTROL activation and the average activation in averages of two GAINalt scans. By contrast, the OrT analysis was designed to allow for the possibility of task repetition effects by modeling the following ordered triad of conditions: (1) the initial 96 second period of alternating gain (GAINalt1), (2) the second, (GAINalt2), and (3) baseline (CONTROL). In terms of the OrT nomenclature, the CONTROL task served as the baseline condition B; GAINalt2, as the condition of intermediate challenge E1; and GAINalt1, as the condition of highest challenge E2. This OrT analysis therefore permitted physiological repetitionsuppression effects that took the form of a negative ordinal trend across the prespecified task ordering (Ungerleider, Doyon, & Karni, 2002). In addition, OrT allowed individual differences in the expression of the activation pattern that could be accounted for by subject differences in the repetition effect or the rate of visuomotor adaptation. 4.3 Results. The OrT method identified a pattern of regional activity that was a linear combination of the first two principal components for which its change in expression between GAINalt1 and CONTROL was significantly correlated with the subject rate of adaptation (see Figure 4). To obtain this pattern, the difference in adaptation rate between GAINalt1
Figure 3: Subject expression of memory load–related OrT pattern constructed from the first 2 principal components of the data from the retention phase of 16 subjects for 1, 3, and 6 letters. Every subject’s expression in the 1-letter condition has been subtracted from the expression of all three conditions to heighten the visual impression of the variability in curve shapes. As a consequence, every
(B)
Pattern expression
(C)
Pattern expression
(A)
1625
Pattern expression
Ordinal Trend Canonical Variates Analysis
1 item 3 items 6 items Memory load during retention Figure 3: (cont.) subject’s pattern expression in the 1-letter conditions is now zero. (A) Task × subject expression curves for the 16 subjects from whom the pattern was originally derived. The number-of-exceptions statistic has the value 2, resulting in a p-value p < 0.01. (B) Forward application of the memory load–related pattern to 24 additional subjects, showing a preserved relationship between the task × subject expression of the pattern and memory load. The number-of-exceptions statistic has the value 5, confirming a significant ordinal trend, p < 0.001. (C) Forward application of the memory load–related pattern to the 18 sleep-deprived subjects, immediately following 48 hours of sleep deprivation. A relationship between the task × subject expression of the pattern and memory load is no longer evident. The number-of-exceptions statistic has the value 6, p = 0.11, which is insufficient to reject the null hypothesis of the absence of an ordinal trend.
1626
C. Habeck et al.
(A)
(B)
94 92 90 88 86 84
Subject network activity
% Error Reduction (initial 6 cycles)
96
82 80 Subject Network Activity: GAINAlt1 – Control
Control
GAINAlt2 GAINAlt1 Task Condition
Figure 4: (A) Relationship between the task × subject expression of the first two OrT principal components and subject rates of adaptation in the GAINalt1 and GAINalt2 conditions. Subject rates of adaptation in GAINalt1 were significantly predicted (R2 = 0.88; p < 0.01) by a linear combination of the component expressions in the individual GAINalt1—CONTROL subtraction images. Further, the expression of the same component combination in the GAINalt2— CONTROL subtraction images predicted rates of adaptation in the repeat condition (R2 = 0.55; p < 0.05; figure not shown). (B) Expression of task activity curves for each of 10 subjects for the activation pattern whose expression predicted subjects’ rates of gain adaptation. Each subject’s CONTROL value is subtracted from his GAINalt1 and GAINalt2 values, highlighting subject differences.
and CONTROL was used as the dependent variable in a multiple linear regression to produce a linear combination of the first two principal components. The p-value from the multiple regression analysis was p < 0.01. In addition, the subject × task expression of this pattern revealed that 9 of 10 subjects exhibited increasing ordinal trends from the CONTROL to GAINalt2 to GAINalt1 conditions (see Figure 4), which is a significant degree of concordance between pattern expression and the ordinal trend criterion ( p < 0.05) according to the type I error rate computed using the Monte Carlo method described earlier. In other words, the spatially unrestricted OrT/CVA revealed a pattern of activation that was statistically significant based on the criterion of ordinal trends and, separately, on the successful prediction of subject performance scores. Notwithstanding, subjects differed in the amount of decline, with some subjects showing right prefrontal-basal
Ordinal Trend Canonical Variates Analysis
sagittal
1627
coronal
transverse
Figure 5: Activation pattern whose expression in subjects’ GAINalt1— CONTROL and GAINalt2—CONTROL subtraction images predicted the respective GAINalt1 and GAINalt2 rates of adaptation. Pattern estimate is based on the first two OrT principal components. Pattern voxel values that exceed a threshold value of 2 in their ICV are shown in sagittal, coronol, and transverse projection views (spm99). The pattern consists of left medial cerebellum; left/right basal ganglia and thalamus; left/right primary and secondary visual cortices (BA 17, 18, 37); and brainstem/pons. It also shows some right frontal and prefrontal activation (BA 6, 46, 47).
ganglionic-cerebellar activity that was reduced almost to their respective null task levels, while others showed little or no reduction in activation (see Figure 4). The bootstrap pattern associated with gain adaptation identified areas similar to those identified by Krakauer et al. (2003)—the left and right putamen and the left cerebellum (see Figure 5 and Table 3)—but also showed additional activations as indicated in Table 1. It is important to state that there is no explicit consideration regarding extent of activation in the bootstrap method. Nevertheless, the number of contiguous voxels that reach significance may be a further indication of the importance of each region of the OrT pattern in mediating gain adaptation. Between the 96 second period in which subjects were initially challenged with tablet-to-screen gain changes and the second 96 second period of gain changes, individuals showed a substantial mean decline in their brain responses to the challenge in right prefrontal, basal ganglionic, and cerebellar activity. Were this level of decline to be replicated in a forward application of the OrT pattern to a new subject sample, the mean difference in OrT pattern expression would be significant at p < 0.05. 5 Discussion The role that OrT is expected to play in functional neuroimaging and cognitive neuroscience can be summarized as follows. OrT takes its place
1628
C. Habeck et al.
alongside the voxel-seeded PLS analysis as being only the second spatial covariance model that is specifically designed to recover latent aspects of functional connectivity in neuroimaging studies that involve parametric experimental designs. Both OrT and voxel-seeded PLS recover information about connectivity based solely on experimental design variables. In particular, there is no requirement in applications of either the OrT or voxel-seeded PLS analysis to provide a quantitative model of the uncertain relationship between functional brain circuitry and subject variables, such as assessments of performance in individual task conditions or general skill level (e.g., IQ or level of education). OrT and PLS analyses target distinct types of task × subject interactions, where each analysis reveals a different aspect of functional connectivity. In this regard, OrT is different in two respects. First, it models a type of task × subject interaction associated with sustained functional connectivity across graduated changes in task parameters. In fact, the interaction modeled by OrT contains information about functional connectivity that previously has not been used in either spatial covariance modeling or voxel-wise, univariate, and multivariate linear modeling. Second, although voxel-seeded PLS requires partial information about the brain regions involved in patterns of functional connectivity, OrT does not. OrT is guided simply by the theoretical prediction of directional changes in regional activity with changes in task parameter values. In short, OrT, like voxel-seeded PLS, represents a unique omnibus test of functional connectivity that is performed across multiple task conditions and the entire brain. From an applied perspective, we have presented the results of the OrT analysis of both event-related fMRI and H2 15 O PET studies of memory and learning. In part, the goal has been to demonstrate the statistical methods that are used to evaluate the specificity and sensitivity of the OrT method: detection of latent patterns that express ordinal trends on a subject-bysubject basis, estimating the salience of individual regions (voxels) in the latent OrT patterns, and the reliability of the regional weights. In addition, the empirical findings appear to have scientific merit in their own right. 5.1 Event-Related fMRI Study of Working Memory. We applied the OrT/CVA method to the retention data of a delayed-match-to-sample task in order to identify an activation pattern whose subject × task expression would reveal a positive monotonic trend with memory load. Such a pattern was successfully derived from a sample of 16 subjects ( p < 0.01) and its validity confirmed through forward application to a replication sample of 24 subjects ( p < 0.001). Figure 5 and Table 4 depict areas whose increase in activation parallels the increase in memory load found in the inferior and superior parietal lobe (BA 40, 7), the middle frontal gyrus (BA 9), and the left superior temporal gyrus (BA 22); the last region merits speculation that auditory rehearsal is taking place during the retention period (Baddeley, 2003). There also were areas that decreased their activation with increasing memory demand
Ordinal Trend Canonical Variates Analysis
1629
Table 4: PET Example: Talairach Locations of Nearest Gray Matter whose Significant Contribution to the Pattern Associated with Adaptation Rate Was Ascertained by a Bootstrap Resampling Test (ICV > 2). X
Y
−36 50 55 28 36 −26 40 −51 50 53 −63 −14 −10 22 −20 42 46 −18 32 24 14 −20 67 −8 67
−47 41 4 −12 −6 4 9 −44 −53 −65 −32 −23 −44 15 −26 −83 −46 −23 8 −84 −88 −88 −42 −60 −44
Z −13 9 31 2 −1 2 −6 −16 −12 −9 13 5 8 −4 −9 6 47 12 7 −8 23 −7 13 47 10
Anatomical Description
Brodmann Area
Fusiform Gyrus Inferior Frontal Gyrus Precentral Gyrus Lentiform Nucleus Claustrum Lentiform Nucleus Insula Fusiform Gyrus Inferior Temporal Gyrus Middle Occipital Gyrus Superior Temporal Gyrus Thalamus Posterior Cingulate Lentiform Nucleus Parahippocampal Gyrus Middle Occipital Gyrus Inferior Parietal Lobule Thalamus Claustrum Middle Occipital Gyrus Cuneus Middle Occipital Gyrus Superior Temporal Gyrus Precuneus Superior Temporal Gyrus
Brodmann area 37 Brodmann area 46 Brodmann area 6 Putamen ∗ Putamen Brodmann area 13 Brodmann area 37 Brodmann area 20 Brodmann area 37 Brodmann area 22 Pulvinar Brodmann area 29 Putamen Brodmann area 28 Brodmann area 19 Brodmann area 40 Pulvinar ∗ Brodmann area 18 Brodmann area 19 Brodmann area 18 Brodmann area 22 Brodmann area 7 Brodmann area 22
Source: Results come from Talairach Daemon Client 1.1, Research Imaging Center, University of Texas Health Science Center at San Antonio.
during the retention period, featuring the anterior and posterior cingulate gyri (BA 31,24) and the medial frontal gyrus (BA 10). Deactivations with experimental task parameters have received more attention in recent years and offer some points of contact with our results. The specific neuroanatomy of connections between medial and lateral prefrontal cortices as well as other cortical areas is an area of ongoing research (Barbas, 2000; Barbas, Ghashghaei, Dombrowski, & Rempel-Clower, 1999) that posits that the connectivity among these particular regions of PFC and posterior regions (involved in oculomotor guidance and spatial attention) contributes to the synthesis of memory, cognition, and emotion in general. A recent study using working-memory with an N-back design (Pochon et al., 2002) also found anterior medial prefrontal deactivating with increasing memory load. The authors of this study offer a rationale for the deactivation of the medial prefrontal cortex that is consistent with the general resource account framework (Engle, Conway, Tuholsky, & Shisler, 1995).
1630
C. Habeck et al.
A shift of resources away from ongoing, but inessential, processes to an increasingly demanding cognitive task might underlie the medial prefrontal deactivation in accordance with this framework. The amount of this shift might still be subject dependent, with a fixed ratio between activity increases and decreases, and result in a large covariance between areas detectable by a multivariate analysis technique. Because of the role of prefrontal limbic cortices (i.e., orbitofrontal and medial prefrontal cortices) in emotional processing (Barbas, 2000), the results suggest that a shifting balance during higher cognitive processing causes increasing activity in cortical cognitive areas and decreasing activity in the limbic and paralimbic structures. Such reciprocal changes in brain activation associated with emotional and cognitive processing are also found in mood disorders such as depression (Mayberg et al., 1999), although with a different relative sign to our findings. In depressed patients, hyperactivity in limbic and paralimbic areas is accompanied by decreased activity in cortical areas, resulting in worse cognitive performance. Negative mood and high memory demand might thus be interpreted as opposite ends of a common continuum, which is reflected in sustained functional connectivity (i.e., a fixed correlative relationship between regional activation), resulting in the changing level of expression of one covariance pattern only. Although the functional connectivity captured in the above activation pattern appears to be sustained with increasing memory load in well-rested subjects, it appears to be disrupted in subjects who were sleep deprived for 48 hours. In other words, one effect of sleep deprivation on working memory—in the retention period—is to disrupt the particular memory processing that normally mediates letter retention at low to moderate memory loads. This conclusion is based on two OrT analytic results. First, the fMRI data of 18 sleep-deprived subjects failed to produce significant positive ordinal trends when the OrT pattern that revealed positive ordinal trends in 40 well-rested subjects was forward-applied. Second, an independent OrT/CVA analysis of the 18 sleep-deprived subjects failed to produce an activation pattern that expressed significant ordinal trends. Apparently the effect of sleep deprivation is not simply to increase the load on the memory processes that are normally operating in well-rested subject at low to moderate load levels. The effect of sleep deprivation on working memory may be to induce nonadditive or nonmultiplicative load effects on memory processing, where the effects may be different in different individuals. 5.2 H2 15 O PET Study of Visuomotor Learning. We applied the OrT method to an H2 15 O data set obtained in a study of visuomotor learning. These PET data had previously been analyzed using SPM99 with conventional voxel-by-voxel modeling (Krakauer et al., 2003). A statistical significant activation pattern was obtained using OrT, where many of the regions with reliable levels of activation were predicted a priori. These regions include several of those normally activated during the execution of
Ordinal Trend Canonical Variates Analysis
1631
overlearned hand movements (i.e., the CONTROL task), which are not activated during sensory control tasks. In this regard, the OrT activation was similar to the pattern obtained in a spatially restricted SPM comparison of the gain adaptation task (i.e., the GAINalt task) and the CONTROL task. In the latter SPM analysis, a spatial mask was used to achieve statistically significant results. This mask delimited areas that routinely had been demonstrated to be activated during the execution of overlearned hand movements: the left primary sensorimotor cortex, (BA 3, 2, 1 and 4), premotor cortex and SMA (BA6), posterior cingulate (BA 23), parietal (BA 5,7,40), and visual areas (BA 17,18) as well as the subcortical areas of the left putamen, globus pallidus, thalamus, and cerebellum. The latter SPM analysis was based on the expectation that the brain areas that mediate gain adaptation reside for the most part within the motor network that is responsible for the execution of overlearned hand movements. However, right prefrontal areas of activation, which do not normally occur during overlearned hand movements, were also part of the OrT activation pattern. Potentially this combination of prefrontal, basal ganglionic, and cerebellar activation sites represents a pattern of strongly functionally connected brain areas. Moreover, combined with the significant association between the OrT pattern expression and subject rates of gain adaptation, the OrT finding raises the possibility that prefrontal regions may be involved in some aspect of gain adaptation. The necessary caveat is that the potential contribution of this discovery relies on a future series of more elaborate experimental investigations into the functional connectivity between prefrontal cortex, basal ganglia, and cerebellum during visuomotor adaptation. 5.3 The Invention of the OrT Design Matrix and Why the Matrix Works. As one might imagine, the OrT design matrix was not invented through a random process of trial and error. Although there was no guarantee that a matrix multiplication approach (i.e., a CVA approach) would actually work, we sought to design a matrix that selectively enhanced the voxel × task × subject variance of patterns that expressed ordinal trends. It was necessary to specify explicitly the different types of latent patterns whose voxel × task × subject variance must be reduced in the transformed data set. Without that stipulation, the application of PCA or SVD to the transformed data set would not necessarily produce major principal components that provided a good approximation to one or more patterns that express ordinal trends. By design, the OrT analysis relies on the major principal components to provide good approximations to patterns that express ordinal trends. As suggested in section 1, the invention of the OrT design matrix required a thorough understanding of the possible similarities and differences between the voxel × task × subject variances of patterns that express ordinal trends and patterns that do not. On the one hand, there are the similarities and differences between patterns that express the predicted mean trend and
1632
C. Habeck et al.
patterns that express mean directional changes that are different from the predicted trend. On the other hand, there are similarities and differences between patterns that express mean trends in the predicted direction but different types of task × subject interactions. Finally, there are similarities and differences between patterns that express ordinal trends in the predicted direction but exhibit different amounts of voxel × task × subject variance in the original data set. The OrT design matrix we invented reduced these similarities and differences to just three factors: task mean differences, within-task variances, and intertask correlations. To appreciate the importance of the novel third factor, consider a series of three experimental conditions, labeled B-E 1 E2, in which features 1 and 2 are identical in all the latent patterns of the data set, whereas the intertask correlations are different. Indeed, suppose it is ρit = CORR(b, e1 )/2 + CORR(e1 , e2 )/2 that distinguishes latent patterns that express ordinal trends from patterns that do not. In patterns that express ordinal trends, the (mean) intertask correlation will be moderately to highly positive—indeed, significantly more positive than the intertask correlations of patterns that do not express ordinal trends. A Monte Carlo sampling of 100,000 families of subject ordinal trends (across B-E 1 -E2 for 13 subjects) illustrates the statistical robustness of this identifying feature of OrT patterns (see Figure 6). In the OrT transformed data set, intertask correlations appear as an explicit term in the algebraic expression of the voxel × task × subject variance of individual latent patterns. For a simple example, consider the miniature data set described in section 2, which consisted of just two experimental conditions. In this example, the contributions of each latent pattern to the 2 × 2 regional covariance matrix Y [Q(Q Q)−1 ]Q Y of the transformed data set is the variance VAR(b + e1 ) = VAR(b) + VAR(e1 ) + 2COV(b, e1 ) associated with the latent pattern. In OrT analyses involving a series of three or more tasks, the contribution of each latent pattern to the meancentered covariance matrix also includes the additive factor of the squared mean differences between tasks. The difference in the COV(b, e1 ) values for the OrT and non-OrT patterns, which corresponds to the difference in their intertask correlations, was the unique factor that distinguished the two covariance patterns. In fact, for the OrT pattern in the miniature data set, COV(b, e1 ) was a large positive value contributing additively to the overall variance of the OrT pattern in [Q(Q Q)−1/2 ] Y, whereas COV(b, e1 ) was zero for the non-OrT pattern. The intertask correlations depicted in the Monte Carlo simulations of Figure 6 and in the miniature data set are a feature of all covariance matrices of OrT transformed data sets, regardless of the number of experimental conditions in the parametric series. The recognition that intertask correlations were a key feature of OrT patterns meant that our search for an optimal design matrix for OrT/CVA was limited to high-dimensional design matrices: namely T ∗ N × (T − 1)∗ N
Ordinal Trend Canonical Variates Analysis
1633
Cumulative frequency
1
0.8
D
B
C
A
0.6
0.8
0.6
0.4
0.2
0 −1
−0.8
−0.6 −0.4
−0.2
0
0.2
0.4
1
Inter–task r Figure 6: Correlational statistics for a Monte Carlo simulation of random samples of monotonic curves, computed for task triplets (ordered B-E1 -E2) and 13 subjects. Specifically, the cumulative distribution functions are depicted for (A) the maximum of CORR(b, e1 ) and CORR(e1 , e2 ); (B) the difference between this maximum value and Corr(b, e2 ); (C) Corr(b, e2 ); and (D) the difference between the minimum of the pair CORR(b, e1 ) and CORR(e1 , e2 ), and CORR(b, e2 ). Vertical lines indicate median and 5 percent values for individual cumulative distribution functions.
matrices. Our strategy for testing candidate matrices was to investigate simulated data sets in which low-dimensional design matrices failed in a significant way to recover latent patterns with ordinal trends. The latter design matrices are those frequently used in either voxel-wise univariate or multivariate analyses or in PLS analyses. Our test of feasibility was that the OrT design matrix Q did not fail in these worst-case scenarios. (See the appendix for details.) 5.4 Conclusion. The OrT analyses of the event-related fMRI and H2 15 O PET studies revealed patterns that suggested the presence of sustained functional activity in the face of considerable subject variation in the trajectories of their ordinal trends. Moreover, in this article, we have argued that although it may be difficult to anticipate the impact that normal subject
1634
C. Habeck et al.
variation has on regional functional connectivity, individual variation can be treated as an additional experimental dimension in neuroimaging studies rather than as an unexplained phenotypic variation. The OrT approach is to model functional connectivity as interactions between experimental parameters and endogenous variables. Indeed, these interactions, like any other type of experimental effect, must be relatively large to be reliably measured. In other words, OrT performs best when applied to studies that achieve an optimal trade-off between study designs that include normal phenotypic variation and experimental designs that exert sufficient control over individual differences so as to achieve sustained functional connectivity. At the same time, OrT analysis takes full advantage of experimental parametric designs to maximize statistical specificity. In particular, the specificity of the analysis increases dramatically with the number of conditions in the series. The probability of a predicted ordinal trend occurring by chance in one subject alone equals (T!)−1 ; the conjunction across subjects thus yields a probability of (T!)−N , explaining how the specificity of the method increases with the number of tasks and subjects. Moreover, statistical specificity increases in the OrT analyses considerably more rapidly than it does when analyses are limited to T × (T − 1)/2 pair-wise comparison tests that are corrected for multiple comparisons. The OrT experimental strategy would appear to be appropriate for studying not only young adults and their cognitive abilities, but also normal aging (Cabeza et al., 1997; Grady et al., 2003; Stern et al., in press) and neurological and psychiatric diseases (Alexander et al., 1999; Eidelberg et al., 1998; Fukuda et al., 2001). On the other hand, as we pointed out earlier, voxel-seeded PLS analyses have been particularly successful in detecting altered functional connectivity, which is also of considerable scientific value. Indeed, some neuroscientists (e.g., Kosslyn et al., 2002) would argue that individual differences in cognitive style would persist even in an ideal world without the practical limitations of experimental design, imaging technologies, and statistical sampling. Overall, OrT and voxel-seeded PLS together are useful for verifying the robustness of our cognitive theories in the face of factors whose influence on mental activity we do not yet know much about. For all the above reasons, both OrT and PLS analyses may take a prominent place in the toolkit of the multivariate modeler of brain imaging data.
Appendix: Derivation and Validation of OrT Design Matrix To answer whether a design matrix achieves a salience enhancement of the effects of interest is difficult. It supposes that the very knowledge sought after—knowledge of the underlying covariance structure of the
Ordinal Trend Canonical Variates Analysis
1635
data constituted by targeted and nontargeted activation patterns—is known beforehand, which is impossible. Monte Carlo simulations, on the other hand, afford a model instantiation of a data set with precise knowledge of the targeted and nontargeted activation patterns prior to the analysis, allowing a thorough assessment of the performance of different design matrix and analytic method choices in identifying the targeted activation patterns. This is not to claim that Monte Carlo simulations are an adequate substitute for the complexity of real neuroimaging data, but if these simple scenarios show deviations from the performance characteristics anticipated for routinely chosen design matrices, there is ground for suspicion that in the case of real-world neuroimaging data, the problems would be compounded. Monte Carlo simulations therefore present a test bed for verification (and possibly correction) that our intuitions about the performance of design matrix choices really hold up.
A.1 Unique Target Features. The objective of the OrT design matrix is to assign maximum salience to monotonic task-activity curves. One unique feature is that a family of N monotonic curves, randomly sampled from the set of all possible tripoint monotonic curves, nearly always exhibits positive intertask correlations between the subject levels of pattern expression. Second, of the three correlations among tasks, the largest is between consecutively ordered tasks, thereby identifying the third task as an end point of the ordering. Indeed, in a majority of cases, the two correlations between consecutively ordered tasks are greater than the correlation between nonconsecutively ordered tasks, thereby identifying both end points of the ordering. An elementary model of random between-task changes in pattern expression illustrates the stochastic features of a random sample of monotonic curves. For a task ordering B-E 1 -E2, the targeted pattern’s activity levels in tasks E 1 and E2, denoted e1 and e2 , respectively, are constructed from sums of identical and independently distributed random variables b, E1 − B and E2 − E1 , where each is a positive-valued random variable sampled from the uniform distribution U(0,1). e1 is distributed as b + E1 − B and e2 as b + E1 − B + E2 − E1 . These assumptions guarantee monotonic curves with as few constraints and as good a generality as possible. With increasing parametric load as specified by the task conditions, subjects increase their expression of the targeted topography, although the amount of the increase is independent of the current level of expression. As a consequence, the expected value of all three intertask correlations is positive, and in the majority of the cases, it holds that
min{CORR(b, e1 , CORR(e1 , e2 )} > CORR(b, e2 ).
1636
C. Habeck et al.
Figure 6 charts the results of a Monte Carlo simulation of 100,000 random samples of tripoint monotonic curves in which the correlations between consecutively ordered tasks are compared, sample by sample, to the correlation between nonconsecutively ordered tasks. The nonnegativity of the correlational statistics for cumulative distribution functions depicted in Figures 6A, 6B, and 6C is characteristic of almost all randomly sampled families of monotonic curves, even when b, E1 − B and E2 − E1 are not identically distributed. When comparing the factors that determine pattern salience in the original data Y and the factors that determine pattern salience for the OrTtransformed data [Q(Q Q)−1/2 ] YP in the OrT analysis, the weight given to the nonnegativity of intertask correlations is immediately evident. In the untransformed data sets, the factors that determine salience are the spatial extent of the pattern activation, the size of the mean trend effect, and the subject variance in within-task expression. Indeed, from the perspective of neuromodeling, we consider the worst-case scenario as the one where every nontarget is equal to that of the target in each of these variables. By comparison, in the OrT transformed data matrix [Q(Q Q)−1/2 ] YP there are four features of pattern expression that determine pattern salience: the spatial extent of the pattern activation, the size of the mean trend (i.e., the squared difference between the means of the summed pattern expressions b + e1 and e1 + e2 ), the subject variance in within-task expression, and the intertask correlations between subjects’ pattern expression for tasks B and E 1 and tasks E 1 and E2. In a comparison of the two lists of determinants of pattern salience, the principal difference is the increased salience with positive intertask correlations.
A.2. Monte Carlo Simulations. A.2.1 Data Set Design. The worst-case scenarios that are reported here are those in which the salience of every nontarget is equal to that of the target in the untransformed data Y. More specifically, each nontarget activation pattern is comparable to the target pattern in terms of the spatial extent of the pattern activation, the size of the mean trend effects in pattern expression, and the subject variance in within-task expression. The similarities between the targeted and nontargeted component processes have been further augmented. First, there is complete overlap in the voxels activated by the targeted and nontargeted component processes. Second, one or more nontargeted activation patterns exhibit a mean task trend identical to that expressed by the target pattern, although they differ from the target in that they do not express positive intertask correlations. Third, other nontargeted activation patterns express positive intertask correlations, although they differ from the target in the direction of their mean task trends. In the Monte Carlo simulations reported here, individual data sets consist of three task conditions in which the activation patterns of seven component
Ordinal Trend Canonical Variates Analysis
1637
processes are superimposed. The raw data matrix Y can thus be denoted as b e1 zk . Y= k=1 e2 k 7
The index k here denotes the different topographies, and zk therefore is a column vector with a number of rows equal to the number of voxels or regional resolution elements. Even without application of the projection operator P that removes the task-independent effects, one can appreciate the salience enhancement achieved through the application of the OrT design matrix Q by computing the voxel × voxel covariance matrix of Q Y, Y Q(Q Q)−1 Q Y ≈
7 2
b + 2e21 + e22 + 2b · e1 + 2e1 · e2 k zk zk .
k=1
The shadow processes essentially have zero intertask covariance in their subject expression. The ordinal trend effects for the alternative task orderings B-E2-E 1 and E 1 -B-E2 still have some residual intertask covariances, but as shown in Figure 6, in most cases these are lower than for the ordinal trend effects of the ordering B-E 1 -E2. All seven patterns are nonfocal activations, where the relative distribution function of voxel weights is matched in the different component processes. In each data set, voxel weights are sampled from the uniform distribution U(0,1) and independently distributed in different component processes. The number of voxels used in these simulated data sets is 500, approximating the number of resels normally contained in smoothed PET and fMRI images. In this Monte Carlo simulation, data sets consisted of 39 task × subject scans, or three scans for each of 13 subjects. No voxel noise was overlaid on these composite patterns of component activations. Taken as a whole, each composite activation pattern of a data set may be thought of as representing the footprint of the larger functional brain architecture that is common to all subjects and tasks. Figure 7 demonstrates some representative task-activity curves for the subject expression of the all the activation patterns used in the simulations. Among the seven activation patterns in a data set, there was a target for each of the three task orderings—an activation pattern that expressed monotonic task-activity curves for all subjects for all possible task orderings B-E1-E2, B-E2-E1, and E1-B-E2. (The end points in a task ordering are interchangeable since monotonically increasing and decreasing subject expressions are equivalent; therefore, there are three rather than six different combinations.) The effect size of the mean trend was the same in all three of these activation patterns. The remaining four activation patterns were nontargets with
1638
C. Habeck et al.
(A)
(B)
Target for B/E1/E2
Shadow of target for B/E1/E2
Target for B/E2/E1
Shadow of target for B/E2/E1
Target for E1/B/E2
Shadow of target for E1/B/E2
B
E1
E2
B
E1
E2
Figure 7: Representative illustration of the task-activity curves in the subject expression of 6 different types of activation patterns used for the Monte Carlo simulations. Column (A) shows processes that display ordinal trends for 3 different task orderings: B-E1-E2, E1-B-E2, and B-E1-E2. Column (B) shows the corresponding shadow processes—activation patterns whose subject expressions show the same monotonic increase in activity on the mean and have the same within-task variances, while allowing a substantial number of subjects to violate the requirement of monotonicity.
intertask correlations of zero, although the direction of a pattern’s mean trend was the same as that in one of the former three activation patterns. For this reason, we refer to each of the latter four activation patterns as a nontargeted shadow process. The elementary model of random between-task changes in pattern expression described in section A.1 was used to generate the family of monotonic task-activity curves for the former three activation patterns. The three sets of three model random variables were identically and independently
Ordinal Trend Canonical Variates Analysis
1639
distributed. Task expressions in the remaining four nontarget activation patterns were independently distributed as well. One hundred thousand simulations were performed to sample the different possible families of monotonic task-activity curves adequately. In the simulations reported here, we used the uniform distribution as the sampling distribution for within-task pattern expressions. However, simulations were also run using a normal sampling distribution, and practically identical results were obtained. A.2.2 OrT Performance. Performance assessment of the OrT analysis is based on the degree to which this CVA has enhanced the salience of targeted activation pattern. Enhanced salience is a relative judgment based on the comparison of the similarities between the activation pattern of the simulated target and the major singular images of OrT/CVA versus the similarity between the target and the major eigen images of the untransformed data YP. The indices of similarity are, respectively, the accuracy with which the first four singular images of OrT/CVA predict the targeted activation pattern in a multiple linear regression analysis and the accuracy with which the first four eigen images of YP predict the target pattern. The respective multiple linear regression coefficients (R2 ) are used as a goodness-of-fit measure for individual data sets. The same task ordering was used to specify the targeted pattern in all data sets. In order to avoid confusion with the examples of Ort/CVA before, we stress again that no inferential assessment was conducted in these simulations; neither was the ordinal trend evaluated in terms of statistical significance, nor was a bootstrap estimation of the robustness of voxel weights performed. The R2 statistic merely captures how much information about the target is captured by the first few principal components. This involves a multiple linear regression that presumes perfect knowledge of the targeted activation pattern—something that is impossible in a real-world neuroimaging context. R2 encodes the upper limit of what is maximally knowable about the target for each design matrix tested. The cumulative distribution functions (CDFs) of the R2 values were tabulated for the entire Monte Carlo simulation for both OrT/CVA and YP. The degree to which OrT/CVA provides a better R2 CDF than YP is the benchmark used to quantify the OrT/CVA enhancement of target salience. We also include a comparison of the OrT CDF with the CDFs associated with two other CVA design matrices that do not assign positive weight to intertask correlations. We used the Helmert design matrix (i.e., performed the SVD on [H(H H)−1/2 ] Y), as well as the design matrix of mean trends (McIntosh et al., 1996), M. M is a 3N × 2 matrix, containing as predictors the mean Helmert contrasts according to
−1 1 M = 1 1 . 0 −2
1640
C. Habeck et al.
The matrix [M(M M)−1/2 ] Y is then submitted to SVD, yielding 2 singular images. These comparisons afford an independent verification that the intertask correlations are essential to maximizing the salience of activation patterns that express monotonic task-activity curves. As discussed before, the Helmert design matrix assigns pattern salience to activation patterns in a manner that weights intertask correlations negatively. Meantrend CVA does not capture subject differences at all and can therefore be expected to perform less well than OrT/CVA. For the Monte Carlo simulations of the worst-case scenario, we therefore anticipated the CVA of mean trend effects as well as CVA with the Helmert design matrix to yield a target recovery that is not only inferior to that of OrT/CVA, but also inferior to that of an analysis of the minimally transformed data YP (demonstrating the unfortunate impact of an ill-chosen design matrix). Figure 8 presents the results of two Monte Carlo simulations of (1) the worst-case scenario, in which the mean trends of the four nontargeted shadow process are equal in size to mean trends of the other three activation patterns, and (2) a less severe scenario, in which the mean trends of the shadow processes are zero. For each of the two simulations, the three CVA design matrices were applied to YP for each of 100,000 data sets Y. For the OrT and Helmert CVAs, the first four singular images were used to predict the simulated target pattern of regional activation in a multiple linear regression analysis. For the CVA mean trend analysis, its two singular images were used in the regression analysis. The cumulative distribution function of the regression R2 was tabulated for 100,000 data sets, for each CVA method. These R2 CDFs were compared with the R2 CDF computed using the first four eigen images of YP. The findings for the CVA mean trend analysis and Helmert CVA are presented in Figures 8A and 8C. Figure 8A depicts the findings for the worst-case scenario. The R2 CDF of the Helmert CVA is shifted to the left of the CDF of YP. The relative positions of these two CDFs indicate that the targeted activation pattern actually has diminished salience in the Helmert CVA approach. The R2 CDF of the mean trend CVA is also shifted to the left of the YP-CDF—albeit slightly more so than the CDF of the Helmert CVA. A similar ordering of the R2 CDFs was obtained for data sets in which the shadow processes exhibited no mean task differences. In Figure 8C, the Helmert R2 CDF is shifted slightly to the left of the CDF of the mean trend CVA, which is shifted to the left of the CDF for YP. OrT/CVA produced better results, which are depicted in Figures 8B and 8D. In Figure 8B, the R2 CDFs are computed for data sets in which the mean trends of the nontargeted shadow process are equal in size to mean trends of the remaining three activation patterns. The median R2 value of the OrT/CVA was 0.72, indicating that for half of the simulated data sets, 72% or more of the variance of the target’s regional pattern weights was
Ordinal Trend Canonical Variates Analysis
1641
(B)
(A)
No mean trend
Matching mean trend Cumulative frequency
1
1
0.8
M
H M I
0.6
0.6
0.4
0.4
0.2
0.2 0
Cumulative frequency
0.8
I
H
0.2
0.4
0.6
0.8
1
0
1
1
0.8
0.8
I
0.6
Q
0.4
0.2
0.2 0.2
0.4
0.6
2
R
0.8
0.4
0.6
1
0.8
I
0.6
0.4
0
0.2
0
0.2
0.4
0.6
0.8
1
Q
1
2
R
Figure 8: OrT sensitivity to ordinal trends relative to that of a PCA on the reduced data matrix YP and two other CVAs—Helmert CVA and mean trend CVA. Target sensitivity charted as CDFs—Q, OrT; I, YP; H, Helmert; and M, mean trend—of the R2 regression statistic for two Monte Carlo simulations involving different shadow processes. (A) Four shadow processes with mean trends equal in size to that of the targeted component process. (B) Four shadow processes with no mean trend. Vertical lines indicate median and 5 percent R2 values for individual CDFs.
accounted for by the first four singular images. Fewer than 5 percent of the data sets produced R2 values below 0.38. The OrT CDF is shifted to the right of the R2 CDF of YP, indicating that the targeted activation pattern has markedly enhanced salience in the OrT approach. In this comparison, there is a 30% improvement in the median R2 value and a 65% improvement in the R2 value below which fewer than 5 percent of the data sets produced poorer predictions. Compared to the Helmert CVA, the OrT/CVA produced
1642
C. Habeck et al.
a 255% improvement in the median R2 value, and a 270% improvement at the 5 percent R2 value. An enhancement of target salience by OrT/CVA was also achieved in the data sets in which the shadow processes exhibited no mean task differences, as depicted in Figure 8D. The median R2 value of OrT/CVA is 0.87, indicating that for half of the simulated data sets, 87% or more of the variance of the target’s regional pattern weights was accounted for by the first four singular images. As before, the OrT CDF is shifted to the right of the R2 CDF of YP. In this comparison, there is a 10% improvement in the median R2 value, and a 20% improvement in the R2 value below which fewer than 5 percent of the data sets produced poorer predictions. Finally, compared to Helmert CVA, OrT/CVA produced a 45% improvement in the median R2 value and a 50% improvement at the 5 percent R2 value. Acknowledgments We thank Eric Zarahn and two anonymous reviewers for a critical reading of the manuscript for this article and many helpful questions and suggestions. This work was supported by federal grants NINDS R01 NS35069, NIA R01 AG16714 and NS02138. References Abbott, A. (2002). Alpine detector fails to confirm Italian sighting of dark matter. Nature, 417(6889), 575–576. Alexander, G. E., Mentis, M. J., Van Horn, J. D., Grady, C. L., Berman, K. F., Furey, M. L., Pietrini, P., Repoport, S. I., Schapiron, M. B., & Moeller, J. R. (1999). Individual differences in PET activation of object perception and attention systems predict face matching accuracy. Neuroreport, 10(9), 1965–1971. Baddeley, A. (1988). Cognitive psychology and human memory. Trends Neurosci., 11(4), 176–181. Baddeley, A. (2003). Working memory: Looking back and looking forward. Nat. Rev. Neurosci., 4(10), 829–839. Barbas, H. (2000). Connections underlying the synthesis of cognition, memory, and emotion in primate prefrontal cortices. Brain Res. Bull., 52(5), 319–330. Barbas, H., Ghashghaei, H., Dombrowski, S. M., & Rempel-Clower, N. L. (1999). Medial prefrontal cortices are unified by common connections with superior temporal cortices and distinguished by input from memory-related areas in the rhesus monkey. J. Comp. Neurol., 410(3), 343–367. Braver, T. S., Cohen, J. D., Nystrom, L. E., Jonides, J., Smith, E. E., & Noll, D. C. (1997). A parametric study of prefrontal cortex involvement in human working memory. Neuroimage, 5(1), 49–62. Cabeza, R., Anderson, N. D., Houle, S., Mangels, J. A., & Nyberg, L. (2000). Agerelated differences in neural activity during item and temporal-order memory
Ordinal Trend Canonical Variates Analysis
1643
retrieval: A positron emission tomography study. J. Cogn. Neurosci., 12(1), 197– 206. Cabeza, R., McIntosh, A. R., Tulving, E., Nyberg, L., & Grady, C. L. (1997). Agerelated differences in effective neural connectivity during encoding and recall. Neuroreport, 8(16), 3479–3483. DeFelipe, J., Elston, G. N., Fujita, I., Fuster, J., Harrison, K. H., Hof, P. R., Kawaguachi, Y., Martin, K. A. C., Kockland, K. S., Thomson, A. M., Wang, S. H., White, E. L., & Yuste, R. (2002). Neocortical circuits: Evolutionary aspects and specificity versus nonspecificity of synaptic connections: Remarks, main conclusions and general comments and discussion. J. Neurocytol., 31(3–5), 387– 416. Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. New York: CRC Press. Eidelberg, D., Moeller, J. R., Antonini, A., Kazumata, K., Nakamura, T., Dhawan, V., Budman, C., & Feigin, A. (1998). Functional brain networks in DYT1 dystonia. Ann. Neurol., 44(3), 303–312. Engle, R. W., Conway, A. R. A., Tuholsky , S. W., & Shisler, R. J. (1995). A resource account of inhibition. Psychological Science, 6, 122–125. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex, 1(1), 1–47. Fernandez-Duque, D., Baird, J. A., & Posner, M. I. (2000a). Awareness and metacognition. Conscious Cogn., 9(2 Pt. 1), 324–326. Fernandez-Duque, D., Baird, J. A., & Posner, M. I. (2000b). Executive attention and metacognitive regulation. Conscious Cogn., 9(2 Pt. 1), 288–307. Friston, K. J., Fletcher, P., Josephs, O., Holmes, A., Rugg, M. D., & Turner, R. (1998). Event-related fMRI: Characterizing differential responses. Neuroimage, 7(1), 30– 40. Friston, K. J., Frith, C. D., Liddle, P. F., & Frackowiak, R. S. (1991). Comparing functional (PET) images: The assessment of significant change. J. Cereb. Blood Flow Metab., 11(4), 690–699. Friston, K. J., Frith, C. D., Liddle, P. F., & Frackowiak, R. S. (1993). Functional connectivity: The principal-component analysis of large (PET) data sets. J. Cereb. Blood Flow Metab., 13(1), 5–14. Friston, K. J., Holmes, A. P., Worsley, K. J., Poline, J. P., Frith, C. D., & Frackowiak, R. S. J. (1996). Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping, 4(1), 58–73. Friston, K. J., Josephs, O., Zarahn, E., Holmes, A. P., Rouquette, S., & Poline, J. (2000). To smooth or not to smooth? Bias and efficiency in fMRI time-series analysis. Neuroimage, 12(2), 196–208. Fukuda, M., Mentis, M. J., Ma, Y., Dhawan, V., Antonini, A., Lang, A. E., Lozano, A. M., Hammerstad, J., Lyons, K., Koller, W. C., Moeller, J. R., & Eidelberg, D. (2001). Networks mediating the clinical effects of pallidal brain stimulation for Parkinson’s disease: A PET study of resting-state glucose metabolism. Brain, 124(Pt. 8), 1601–1609. Gerstein, G. L., Perkel, D. H., & Subramanian, K. N. (1978). Identification of functionally related neural assemblies. Brain Res., 140(1), 43–62.
1644
C. Habeck et al.
Ghilardi, M., Ghez, C., Dhawan, V., Moeller, J., Mentis, M., Nakamura, T., Antonini, A., & Eidelberg, D. (2000). Patterns of regional brain activation associated with different forms of motor learning. Brain Res., 871(1), 127–145. Grady, C. L., McIntosh, A. R., & Craik, F. I. (2003). Age-related differences in the functional connectivity of the hippocampus during memory encoding. Hippocampus, 13(5), 572–586. Habeck, C., Rakitin, B. C., Moeller, J., Scarmeas, N., Zarahn, E., Brown, T., & Stern, Y. (2004). An event-related fMRI study of the neurobehavioral impact of sleep deprivation on performance of a delayed-match-to-sample task. Brain Res. Cogn. Brain Res., 18(3), 306–321. Horwitz, B. (1991). Functional interactions in the brain: Use of correlations between regional metabolic rates. J. Cereb. Blood Flow Metab., 11(2), A114–120. Kosslyn, S. M., Cacioppo, J. T., Davidson, R. J., Hugdahl, K., Lovallo, W. R., Spiegel, D., & Rose, R. (2002). Bridging psychology and biology: The analysis of individuals in groups. Am. Psychol., 57(5), 341–351. Krakauer, J. W., Ghilardi, M. F., Mentis, M., Barnes, A., Veytsman, M., Eidelberg, D., & Ghez, C. (2003). Differential cortical and subcortical activations in the learning of rotations and gains for reaching: A PET study. J. Neurophysiol., 91(2), 924–933. Mayberg, H. S., Liotti, M., Brannan, S. K., McGinnis, S., Mahurin, R. K., Jerabek, P. A., Silva, J. A., Tekell, J. L., Lancaster, J. L., & Fox, P. T. (1999). Reciprocal limbic-cortical function and negative mood: Converging PET findings in depression and normal sadness. Am. J. Psychiatry, 156(5), 675–682. McIntosh, A. R. (1999). Mapping cognition to the brain through neural interactions. Memory, 7(5–6), 523–548. McIntosh, A. R., Bookstein, F. L., Haxby, J. V., & Grady, C. L. (1996). Spatial pattern analysis of functional brain images using partial least squares. Neuroimage, 3(3 Pt. 1), 143–157. McIntosh, A. R., & Gonzalez-Lima, F. (1994). Network interactions among limbic cortices, basal forebrain, and cerebellum differentiate a tone conditioned as a Pavlovian excitor or inhibitor: Fluorodeoxyglucose mapping and covariance structural modeling. J. Neurophysiol., 72(4), 1717–1733. McIntosh, A. R., Rajah, M. N., & Lobaugh, N. J. (1999). Interactions of prefrontal cortex in relation to awareness in sensory learning. Science, 284(5419), 1531–1533. Mellet, E., Tzourio-Mazoyer, N., Bricogne, S., Mazoyer, B., Kosslyn, S. M., & Denis, M. (2000). Functional anatomy of high-resolution visual mental imagery. J. Cogn. Neurosci., 12(1), 98–109. Naatanen, R., Tervaniemi, M., Sussman, E., Paavilainen, P., & Winkler, I. (2001). “Primitive intelligence” in the auditory cortex. Trends Neurosci., 24(5), 283–288. Nakamura, T., Ghilardi, M. F., Mentis, M., Dhawan, V., Fukuda, M., Hacking, A., Moeller, J. R., Ghez, C., & Eidelberg, D. (2001). Functional networks in motor sequence learning: Abnormal topographies in Parkinson’s disease. Hum. Brain Mapp., 12(1), 42–60. Ostriker, J. P., & Steinhardt, P. (2003). New light on dark matter. Science, 300(5627), 1909–1913. Petersson, K. M., Nichols, T. E., Poline, J. B., & Holmes, A. P. (1999). Statistical limitations in functional neuroimaging. I. Non-inferential methods and statistical models. Philos. Trans. R. Soc. Lond. B Biol. Sci., 354(1387), 1239–1260.
Ordinal Trend Canonical Variates Analysis
1645
Pochon, J. B., Levy, R., Fossati, P., Lehericy, S., Poline, J. B., Pillon, B., Le Bihan, D., & Dubois, D. B. (2002). The neural system that bridges reward and cognition in humans: An fMRI study. Proc. Natl. Acad. Sci. USA, 99(8), 5669–5674. Stern, Y., Habeck, C., Moeller, J. R., Scarmeas, N., Anderson, K. E., Hilton, H. J., Flynn, J., Sackeim, H., & van Heertum, R. (In press). Brain networks associated with cognitive reserve in healthy young and old adults. Cereb. Cortex. Sternberg, S. (1966). High-speed scanning in human memory. Science, 153(736), 652– 654. Sternberg, S. (1969). Memory-scanning: Mental processes revealed by reaction-time experiments. Am. Sci., 57(4), 421–457. Ungerleider, L. G., Doyon, J., & Karni, A. (2002). Imaging brain plasticity during motor skill learning. Neurobiol. Learn. Mem., 78(3), 553–564. Venables, W. N., & Ripley, B. D. (1999). Modern statistics with S-Plus (3rd ed.). New York: Springer-Verlag. Worsley, K. J., Poline, J. B., Friston, K. J., & Evans, A. C. (1997). Characterizing the response of PET and fMRI data using multivariate linear models. Neuroimage, 6(4), 305–319. Zarahn, E. (2000). Testing for neural responses during temporal components of trials with BOLD fMRI. Neuroimage, 11(6 Pt. 1), 783–796.
Received October 13, 2003; accepted December 6, 2004.
LETTER
Communicated by Michael Carter
Investigating the Fault Tolerance of Neural Networks Elko B. Tchernev
[email protected] Rory G. Mulvaney
[email protected] Dhananjay S. Phatak
[email protected] Computer Science and Electrical Engineering Department, University of Maryland Baltimore County, Baltimore, MD 21250, U.S.A.
Particular levels of partial fault tolerance (PFT) in feedforward artificial neural networks of a given size can be obtained by redundancy (replicating a smaller normally trained network), by design (training specifically to increase PFT), and by a combination of the two (replicating a smaller PFT-trained network). This letter investigates the method of achieving the highest PFT per network size (total number of units and connections) for classification problems. It concludes that for nontoy problems, there exists a normally trained network of optimal size that produces the smallest fully fault-tolerant network when replicated. In addition, it shows that for particular network sizes, the best level of PFT is achieved by training a network of that size for fault tolerance. The results and discussion demonstrate how the outcome depends on the levels of saturation of the network nodes when classifying data points. With simple training tasks, where the complexity of the problem and the size of the network are well within the ability of the training method, the hidden-layer nodes operate close to their saturation points, and classification is clean. Under such circumstances, replicating the smallest normally trained correct network yields the highest PFT for any given network size. For hard training tasks (difficult classification problems or network sizes close to the minimum), normal training obtains networks that do not operate close to their saturation points, and outputs are not as close to their targets. In this case, training a larger network for fault tolerance yields better PFT than replicating a smaller, normally trained network. However, since fault-tolerant training on its own produces networks that operate closer to their linear areas than normal training, replicating normally trained networks ultimately leads to better PFT than replicating fault-tolerant networks of the same initial size.
Neural Computation 17, 1646–1664 (2005)
© 2005 Massachusetts Institute of Technology
Investigating the Fault Tolerance of Neural Networks
1647
1 Introduction Fault tolerance of feedforward artificial neural networks (ANNs) has been investigated by many researchers (a sampling can be found in Sequin & Clay, 1990, Petsche & Dickinson, 1990; Segee & Carter, 1991, 1994; Neti, Schneider, & Young, 1992; Phatak & Koren, 1992, 1995; Protzel, Palumbo, & Arras, 1993, Murray & Edwards, 1994; Bishop, 1995; Phatak, 1999). Our prior work (Phatak, 1994; Phatak and Koren, 1992, 1995) related fault tolerance to the amount of redundancy required to achieve it. It was demonstrated that for the standard or typical artificial neural network (ANN) architectures wherein the units (or neurons) perform a weighted sum of inputs from other units, followed by a monotonic nonlinear squashing to generate their outputs, less than TMR (triple modular redundancy) is not sufficient to achieve complete fault tolerance. A simple method of replicating a seed network was shown to be one possible way of enhancing fault tolerance of ANNs. This method requires a large amount of redundancy even if the size of the seed network (which gets replicated) is kept minimal. A better alternative might be to use enhanced training algorithms that target fault tolerance. This was analyzed in Phatak (1994) and revealed that using permuted samples or cross validation during training does not lead to a discernible partial fault tolerance (PFT) enhancement. On the other hand, providing initial redundancy and modifying the training algorithm to utilize the extra parameters does improve the PFT to some extent. However, a brute force method of replications proposed in Phatak and Koren (1992, 1995) seems to achieve a higher PFT for the same level of redundancy (as compared with the fault-tolerant gradient descent training), which was a counter-intuitive result. A later work (Tchernev, Mulvaney, & Phatak, 2004) analyzed the exact replication factors needed to obtain perfect fault tolerance for the n − k − n encoder/decoder problem. It was shown that the traditional n − log2 n − n architecture, with larger initial size but operating closer to saturation, achieves fault tolerance at a significantly smaller final size than the minimal n − 2 − n architecture. In this letter, we experimentally investigate the relationships between network size, initial PFT, and replicated PFT, and look for optimal-sized seed networks that yield the highest PFT.
2 Experiments The problems being investigated are the two-class six-area problem, the two spirals problem, and the vowel problem. The two-class six-area problem (see Figure 1) is a specially constructed classification problem with a known minimal size of its solving network: a single hidden layer of three nodes. The two spirals problem (training and testing data sets together) is shown in Figure 4, and the vowel problem is taken from the UCI Machine Learning
1648
E. Tchernev, R. Mulvaney, and D. Phatak
repository (Blake & Merz, 1998). These are problems ranging from very easy (the two-class six-area problem) through moderate (the two spirals problem) to hard (the vowel problem). 3 Method For all problems, a strictly layered structure is assumed (all units in a layer connect to only the units in the next adjacent layer, and no layer-skipping connections to other units are present). Networks with various numbers of hidden units are trained with normal gradient descent as well as a modified version that specifically targets enhanced fault tolerance. Each network is then replicated from 1 to 20 times, and the resulting PFT is evaluated for each of the replication sizes under the single fault assumption (i.e., only a single link or unit can fail). Training for fault tolerance is tantamount to solving a constrained optimization problem, as discussed in Phatak (1994) and Phatak and Tchernev (2002). Instead of trying to solve it outright, as in Neti et al. (1992), the penalty function method approach is used. It is described in Luenberger (2003), and it incorporates the constraints into the objective function. Thenew objective function becomes E = p o (top − yop )2 + α h f p o (top − yopfh )2 where top , yop = target and actual outputs of output unit o on pattern p, yopfh = actual output of output unit o on pattern p upon fault f of hidden unit h, and α = weight of the extra terms relative to normal terms. Note that now there are two sets of terms in this objective function E. The first summation includes the normal error terms. The second summation accrues the errors that arise due to faults and is the penalty function. When E is minimized, the error between target outputs and the network outputs on faults is also minimized. Thus, the addition of these terms forces the search algorithm to look for a solution with better fault tolerance. These terms in the objective function that encourage fault tolerance are henceforth referred to as extra terms. The (nonnegative) parameter α is the weight of the penalty function (or the penalty) relative to the normal error terms, and typically, α < 1. As elaborated in Phatak and Koren (1992, 1995), replication involves adding as many copies of the hidden nodes and all their links as the replication factor is and adjusting the output nodes’ biases, while preserving the input and output node counts. For the links, three types of faults are considered: stuck at +MAX, at −MAX, and at 0, where MAX is the largest weight magnitude in the trained network. For each hidden unit, two types of faults are considered: its output stuck at +Output Max and at −Output Max. For a unit, considering a stuck-at-0 fault on the output is not necessary: if the other two faults can be tolerated, then a stuck-at-0 fault on the output of that unit can also be tolerated.
Investigating the Fault Tolerance of Neural Networks
1649
By definition, PFT is the fraction of outputs that remain correct when all single faults are considered. To generate this value, every weight and unit in the tested network is set stuck at each fault type. For each fault, all inputoutput patterns are applied, and the number of outputs that remain correct is accumulated. This total is divided by the total number of cases, which equals [#outputs × #io patterns × (3 × #weights + 2 × #hidden units)], where #X denotes number of X. The fraction (PFT) is calculated for different numbers of replications and investigated as a function of the total number of hidden units in the resulting network. Note that the magnitude of observed PFT change (and difference between pairs of PFTs) can be quite small—for example, the vowel problem, with 10 inputs, 11 outputs, and 528 training patterns, one nonreplicated network with a single 32 hidden-unitslayer (672 weights) can have a PFT step of 1 in 11,894,784 cases, which is about 8.4*10−8 . This can decrease with bigger seed networks, replication, and when averaging over more than one network. For each training method and size of the seed network, sample networks are generated and trained, and for each, the PFT is evaluated for an increasing number of replications. Only the results from the networks that have 100% classification accuracy on the training data set are taken, averaged, and discussed in this article. (The reason is that by definition, PFT is a correctness measure, and only initially correct networks reach 100% PFT on replication.) The focus of the investigation is on the relationship between PFT and final size of the network in nodes, since network size is both a measure and a constraint in terms of practical use. Replication, on the other hand, is computationally free compared to the cost of training a network. 4 Results 4.1 The Two-Class Six-Area Problem. The chosen network architecture has two inputs, one output, and one hidden layer with a variable number of nodes. The labels on the graphs in Figure 1 correspond to the number of nodes in the hidden layer. Since this is an easy problem for backpropagation training, the smallest network size to be successfully trained with both normal and fault-tolerant training is the theoretically minimal one of three hidden nodes. The decision lines of these theoretical solution nodes can be seen on Figure 1 together with the data points. 4.1.1 Initial PFT. Table 1 shows the initial (prereplication) averaged PFTs and 95% confidence intervals of the networks that were successfully trained (out of totally 50 runs per size). For hidden node counts of 3, 4, 6, 8, 16, and 24, the corresponding number of successfully trained networks was 35, 42, 47, 49, 49, and 46 for normal backpropagation, and 28, 34, 41, 48, 46, and 46 for fault-tolerant training. The table shows the fraction of correct outputs across all faults, as they change with network size. The two data sets correspond
1650
E. Tchernev, R. Mulvaney, and D. Phatak
Two-class six-area training set 1.5 1 0.5 Class 1 0
Class 2
-0.5 -1 -1.5 -1.5
-1
-0.5
0
0.5
1
1.5
Figure 1: Training data set for the two-class six-area problem. Table 1: Average PFT and Confidence Intervals of the Correct Runs of the TwoClass Six Area Problem, Normal and Fault-Tolerant Training, by Hidden Node Count. Hidden
3
4
6
8
16
24
bp PFT 95% ft PFT 95%
0.698 0.000 0.701 0.000
0.751 0.001 0.756 0.001
0.816 0.003 0.820 0.002
0.853 0.003 0.856 0.002
0.918 0.002 0.918 0.002
0.940 0.001 0.942 0.001
to the normal backpropagation training (bp) and its fault-tolerant variant (ft). The smallest network to solve the problem, the minimal three-hiddennode network, has a total of six nodes and corresponds to the first (left-most) data column of the table. As expected, PFT increases with network size, the networks become increasingly redundant, and fault-tolerant training produces better PFT than normal backpropagation. The curve of diminishing returns is clearly seen as bigger networks demonstrate less PFT gain. The only network size where the normal and the fault-tolerant training produce statistically identical PFT is at 16 hidden nodes; the two-tailed t-test distance for that pair is 69%.
Investigating the Fault Tolerance of Neural Networks
1651
PFT on replication, normal
fraction correct
1
3 4
0.9
6 0.8
8 16
0.7
24 ft
0.6 1
10 100 total number of units
1000
PFT on replication, fault-tolerant
fraction correct
1
3 4
0.9
6 0.8
8 16
0.7
24 ft
0.6 1
10 100 total number of units
1000
Figure 2: PFT of normally trained and fault-tolerant trained networks with different numbers of hidden-layer units, when using replication, plotted by resulting network size, two-class six-area problem.
4.1.2 Replication PFT. Plots of the entire PFT range, up to 100% PFT, of the successfully trained networks, when using replication, are shown in Figure 2 for plain backpropagation and for fault-tolerant backpropagation. The legend keys correspond to the number of hidden units in the seed networks that are replicated. The 95% confidence intervals would be too small to see on the plots at the data scale so are not included. The initial (nonreplicated) PFT for fault-tolerant backpropagation is included on both graphs, labeled ft, and demonstrates that replicating a smaller network dominates fault-tolerant training for the same network size, regardless of the replicated net’s training method. The graphs where PFT approaches 100% on replication (perfect fault tolerance) are shown in Figure 3. The resolution on the vertical axis is increased significantly to allow a clear view. The averaged point of reaching perfect
1652
E. Tchernev, R. Mulvaney, and D. Phatak PFT on replication, normal 1 fraction correct
3 4 6 8 16 24
0.995 10
100 total number of units
1000
PFT on replication, fault-tolerant 1 fraction correct
3 4 6 8 16 24
0.995 10
100 total number of units
1000
Figure 3: Normal and fault-tolerant trained networks with different number of hidden-layer units achieve perfect PFT when using replication, plotted by resulting network size, two-class six-area problem.
PFT is the result of just one network achieving 100% PFT after all others; even so, the variances are small, and the 95% confidence error bars, while visible at this scale, would only hinder the identification of the data points. Overall, for a given final size, the smallest working network achieves the best PFT on replication. These graphs show that for certain network sizes, the best PFT is not exhibited by the smallest replicated network; nevertheless, it is the smallest networks that eventually reach 100% PFT. The difference between the PFT of fault-tolerant training and that of normal backpropagation is small and decreases with the number of replications. Eventually, the smallest correct trained networks for both types of training achieve 100% PFT at the same final size. 4.2 The Two Spirals Problem. The training and testing data sets are shown in Figure 4. The chosen network architecture has two inputs, one
Investigating the Fault Tolerance of Neural Networks
7 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
1653
training testing training + testing +
Figure 4: Data sets of the two spirals problem. Table 2: Average PFT and Confidence Intervals of the Correct runs of the Two Spirals Problem, Normal and Fault-Tolerant Training, by Hidden Node Count. Hidden
24
32
48
64
96
bp PFT 95% ft PFT 95%
— — 0.789 0.000
0.775 0.005 0.793 0.016
0.807 0.004 0.832 0.005
0.841 0.005 0.855 0.006
0.877 0.004 0.892 0.007
output, and one hidden layer with a variable number of nodes. The labels on the graphs correspond to the number of nodes in the hidden layer. The minimal size of a network of such architecture to solve the problem is not known; the smallest size to actually solve the problem had 32 hidden nodes for normal and 24 hidden nodes for fault-tolerant training. 4.2.1 Initial PFT. Table 2 shows the initial (prereplication) averaged PFTs and 95% confidence intervals of the networks that were successfully trained (out of totally 50 runs per size). For hidden node counts of 24, 32, 48, 64, and 96, the corresponding number of successfully trained networks was 0, 12, 12, 13, and 8 for normal backpropagation and 1, 3, 12, 5, and 3 for fault-tolerant training. The table shows the fraction of correct outputs across all faults as they change with network size. The two data sets correspond to the normal backpropagation training (bp) and its fault-tolerant variant (ft). The first (leftmost) data column in the table corresponds to the smallest network to solve the problem, which has 24 hidden and 27 total nodes, for fault-tolerant training. PFT increases with network size, and fault-tolerant training achieves a
1654
E. Tchernev, R. Mulvaney, and D. Phatak PFT on replication, normal
fraction correct
1
32 48
0.9
64 0.8
96
0.7 10
ft 100 1000 total number of units
10000
PFT on replication, fault tolerant 1 fraction correct
24 32
0.9
48 64
0.8
96 ft
0.7 10
100 1000 total number of units
10000
Figure 5: PFT of normally trained and fault-tolerant-trained networks with different number of hidden-layer units, when using replication, plotted by resulting network size, two spirals problem.
higher PFT than normal backpropagation; the difference between the training methods is statistically significant. 4.2.2 Replication PFT. Plots of the entire PFT range, up to 100% PFT of the successfully trained networks, when using replication, are shown in Figure 5 for plain backpropagation and for fault-tolerant training. The legend keys correspond to the number of hidden units in the seed networks that are replicated. The 95% confidence error bars, while visible at this scale, would only hinder the identification of the data points and cause visual clutter so are not included. The PFT of the nonreplicated fault-tolerant trained networks is included on both graphs, labeled ft, and demonstrates that replicating a smaller, normally trained network is dominated by fault-tolerant training for the same
Investigating the Fault Tolerance of Neural Networks
1655
PFT on replication, normal 1 fraction correct
32 48 0.9995 64 96 0.999 100
1000 total number of units
10000
PFT on replication, fault tolerant 1 fraction correct
24 32 48
0.9995
64 96 0.999 100
1000 total number of units
10000
Figure 6: Normally trained and fault-tolerant-trained networks with different numbers of hidden-layer units achieve perfect PFT when using replication, plotted by resulting network size, two spirals problem.
network size. Replicating a smaller fault-tolerant-trained network is similarly dominated by the bigger fault-tolerant-trained ones, except for the case of 24 hidden nodes. Regardless, after several replications, the trend reverses, and for a given final size, the smaller seed network achieves the better PFT. This is illustrated by plotting the areas close to perfect PFT on Figure 6, where it can be seen how the replicated smaller networks achieve 100% PFT at smaller final sizes. The difference in PFT between networks obtained by replicating corresponding sizes of fault-tolerant trained and normal seed networks, which is positive initially, reduces to insignificance with replication but passes through a statistically significant area where replicated large fault-toleranttrained networks have worse PFT than replicated normally trained networks of the same size.
1656
E. Tchernev, R. Mulvaney, and D. Phatak
Table 3: Average PFT and Confidence Intervals of the Correct Runs of the Vowel Problem, Coded Outputs, Normal and Fault-Tolerant Training, by Hidden Node Count. Hidden
32
48
64
96
bp PFT 95% ft PFT 95%
0.966 0.001 0.967 0.017
0.983 0.002 0.990 0.003
0.992 0.001 0.994 0.003
0.997 0.001 — —
4.3 The Vowel Problem. The vowel problem has 10 inputs and must classify its input patterns into 11 classes. The outputs of a neural network that solves it can encode the solutions either in exclusive form (with 11 output units, only 1 active at a time) or in coded form (with 4 output units, the class coded in binary). Runs were performed with one and with two hidden layers, but the complexity of the network with one hidden layer was such that no correct solutions were found for the fault-tolerant training case. The results presented here are for the two-layer exclusive case (10 input units, 2 hidden layers of variable equal number of units, 11 output units) and for the 2-layer coded case (10 input units, 2 hidden layers of variable equal number of units, 4 output units). The labels on the graphs correspond to the total number of nodes in the hidden layers; each layer contains half as many nodes. The minimal size of networks of such architecture to solve the problem is not known; the smallest sizes to actually solve the problem had 32 hidden nodes (two layers of 16 nodes) for each of the four training method–output combinations. The vowel problem was testing the more computationally intensive faulttolerant training with extra terms. Not only was no solution found for any size network with one hidden layer, but the maximum size that this method was able to successfully train was 64 hidden nodes (two layers of 32 nodes each). Normal training (which operates in a smaller state space) fared a little better and was able to obtain solutions in the hardest case (one hiddenlayer network), and to train successfully a 96-hidden-units network (two layers of 48 nodes each) in the easiest case (two hidden layers with coded outputs). 4.3.1 Initial PFT. The initial (prereplication) PFTs of the networks that were successfully trained (out of totally 20 runs per size) are shown in Tables 3 and 4 for the coded and the separate outputs architectures, respectively. For the coded case, for hidden node counts of 32, 48, 64, and 96, the corresponding number of successfully trained networks was 2, 5, 10, and 4 for normal backpropagation, and 2, 7, 5, and 0 for fault-tolerant training.
Investigating the Fault Tolerance of Neural Networks
1657
Table 4: Average PFT and Confidence Intervals of the Correct Runs of the Vowel Problem, Exclusive Outputs, Normal and Fault-Tolerant Training, by Hidden Node Count. Hidden
32
48
64
bp PFT 95% ft PFT 95%
0.981 0.002 0.985 0.000
0.991 0.001 0.995 0.001
0.996 0.000 0.997 0.000
PFT on replication, fault tolerant
PFT on replication, normal 1
0.98
32 48 64
0.98
96 ft
0.96
fraction correct
fraction correct
1
32 48 64 ft
0.96 10
100
1000
10
total number of units
PFT on replication, fault tolerant, magnified
PFT on replication, normal, magnified
1.00000 32 48 64 96
0.99999 200 400 600 total number of units
fraction correct
fraction correct
1.00000
100 1000 total number of units
32 48 64
0.99999 200
400
600
total number of units
Figure 7: PFT of normally trained and fault-tolerant-trained networks with different numbers of hidden-layer units achieve perfect PFT when using replication, plotted by resulting network size, vowel problem, coded outputs.
For the exclusive case, for hidden node counts of 32, 48, and 64, the corresponding number of successfully trained networks was 4, 3, and 1 for normal backpropagation and 1, 2, and 1 for fault-tolerant training.
1658
E. Tchernev, R. Mulvaney, and D. Phatak
PFT difference, fault tolerant - normal 0.008 32
fraction correct
0.006
48
0.004
64
0.002 0 -0.002
0
200 400 total number of units
600
fraction correct
PFT difference, magnified
0 32 48 64
-0.00002 200
400 total number of units
600
Figure 8: Difference between the PFT of fault-tolerant and normally trained networks with different number of hidden-layer units when using replication, plotted by resulting network size, vowel problem, coded outputs.
The tables show the averaged fraction of correct outputs across all faults, as a function of the hidden nodes count, with 95% confidence intervals. The smallest networks to solve the problem have 32 hidden nodes in two layers of 16 nodes each and correspond to the first (left-most) data columns of the tables. PFT increases with network size, and fault-tolerant training achieves better PFT than normal backpropagation. 4.3.2 Replication PFT. Plots of the entire PFT range of the successfully trained coded output networks, when using replication, are shown in Figure 7 for both normal and fault-tolerant training. The legend keys correspond to the number of hidden units in the seed networks that are replicated. The 95% confidence error bars are not included in these plots, as they would be invisible at the data scale.
Investigating the Fault Tolerance of Neural Networks PFT on replication, fault tolerant
PFT on replication, normal
0.98
1 fraction correct
fraction correct
1
1659
32 48
0.98
64
32 48 64 ft
ft
0.96
0.96
10
10
100 1000 total number of units
PFT on replication, fault tolerant, magnified
PFT on replication, normal, magnified
1.00000
32 48 64
0.99999 100 1000 total number of units
32 fraction correct
1.00000 fraction correct
100 1000 total number of units
48 64
0.99999 100 1000 total number of units
Figure 9: PFT of normally trained and fault-tolerant-trained networks with different numbers of hidden-layer units achieve perfect PFT when using replication. Plotted by resulting network size, vowel problem, exclusive outputs.
The initial PFT for fault-tolerant training is included on two of the graphs, labeled ft, and demonstrates that replicating a smaller, normally trained network is dominated by fault-tolerant training for the same network size. The data where PFT approaches 100% are shown in the magnified graphs. It can be seen that after enough replications, the initial levels of PFT have shifted. The smallest network to reach 100% PFT for normal training is the replicated 32 hidden units network; the smallest one to solve the problem. For fault-tolerant training, however, the smallest network to reach perfect fault tolerance is not the replicated result of the smallest initial solution. Actually, the initial relationship between the fault tolerances is preserved in the replications almost to the 100% mark. Until the last few replications, it is the biggest initial network that has the best fault tolerance on replication and is overtaken on the finish line by the two smaller networks.
1660
E. Tchernev, R. Mulvaney, and D. Phatak
PFT difference, fault tolerant - normal
fraction correct
0.004 32
0.003
48
0.002
64
0.001 0 -0.001
0
200 400 total number of units
600
PFT difference, magnified 0.00002 fraction correct
32 48 64 0
-0.00002 0
200 400 total number of units
600
Figure 10: Difference between the PFT of fault-tolerant and normally trained networks with different numbers of hidden-layer units when using replication. Plotted by resulting network size, vowel problem, exclusive outputs.
The difference between the PFT of the fault-tolerant and of the normally trained networks is shown in Figure 8 as an entire data range graph and as a strongly magnified region close to zero. It can be seen that initial better PFT notwithstanding, the replicated normally trained networks exhibit better fault tolerance than the extra-terms-trained one after several replications, and that relationship is retained. In two of the three cases, the normally trained network achieves 100% PFT earlier than the fault-tolerant one. The PFTs of the replicated successfully trained exclusive output networks are shown in Figure 9 for both normal and fault-tolerant training. The initial PFT for fault-tolerant training is included on both graphs, labeled ft, and demonstrates that replicating a smaller, normally trained network is dominated by fault-tolerant training for the same network size.
Investigating the Fault Tolerance of Neural Networks
1661
The data where PFT approaches 100% are shown on the magnified graphs. It can be seen that after enough replications, the initial levels of PFT have shifted. The smallest network to reach 100% PFT for normal training is the replicated 48 hidden units one—neither the smallest nor the largest to solve the problem. For fault-tolerant training, it is again the middle-sized network that is the first to reach perfect fault tolerance. The initial relationship between the fault tolerances is preserved in the replications almost to the 100% mark only for the normally trained case. Until the last few replications, it is the biggest initial network that has the best fault tolerance on replication and is overtaken on the finish line by the middle-sized network. This is not the case for fault-tolerant training, where the superiority of the middle-sized network is established after the first few replications. In none of the cases does the smallest correct network challenge the PFT of the larger ones; it has the worst PFT across all final sizes. The difference between the PFT of the extra-terms and the normally trained networks is shown in Figure 10 as an entire data range graph and as a strongly magnified region close to zero. It can be seen that initial better PFT notwithstanding, the replicated normal trained networks exhibit better fault tolerance than the extra-termstrained networks after several replications, and that relationship is retained. However, in only one of the three cases does the normally trained network achieve 100% PFT earlier than the fault-tolerant one. In the other two cases, the difference diminishes so that they achieve 100% PFT at the same final size.
5 Summary The data indicate the following trends. As the size of the seed network (which is later replicated) is increased, the initial PFT (without any replications) improves. The improvement exhibits diminishing returns and tapers off as the size of the initial network is further increased, regardless of whether the problem being solved is easy or hard. Replicating correct seed networks (with 100% classification rates on the training dataset) leads to the formation of networks with 100% PFT with respect to the training data set. Whether the smallest resultant network of this kind is obtained from the smallest seed network or not depends on the difficulty of the problem and the training method. The easier the problem and the training method, the more likely it is that the smallest correct network will achieve 100% final PFT at the smallest final size. Fault-tolerant training alone improves the PFT of any given seed network. For easy problems, replication PFT dominates fault-tolerant training for the same final size, while the converse is true for hard problems.
1662
E. Tchernev, R. Mulvaney, and D. Phatak
On replication, fault-tolerant seed networks produce networks with better PFT than networks obtained by replicating normal seed networks, but only for the first several replications. Eventually, the normal networks take over and in general reach 100% PFT at a smaller final size.
6 Discussion and Conclusion It is important to realize that the fault tolerance of a neural network depends on the area of operation of its units when classifying data points. Consider, for simplicity, that the last hidden layer before some output performs a mapping of the input space into a hypercube of dimensionality k, where k is the number of hidden-layer units that connect to the output. For classification problems, the output unit has the function of a hyperplane of dimensionality k−1, which splits the volume of the hypercube in two classification subspaces. In order for the network to be fault tolerant, each fault-induced change in the position of a data point in that hypercube must leave it on the same side of the hyperplane as it originally was. Replicating the hidden nodes of a network has the effect of reducing the influence of any one node or connection, so that a fault will reduce the amount of displacement in the hyperspace that each data point undergoes. Clearly, the closer to the vertices of the hypercube the points lie (the more saturated the outputs of the hidden units that produce it), the fewer replications are needed to completely eliminate the effect of a fault and achieve 100% fault tolerance. In the extreme case of threshold units and unit weights, triple redundancy would be sufficient to guarantee that a point does not leave its hypercube vertex on both weight and node failures. This effect is the reason for the drastically fewer replications (and total nodes) needed to achieve fault tolerance for the n − k − n problem, as derived in Tchernev et al. (2004). It is reasonable to posit that backpropagation training on easy problems (like the two-class six-area problem) will tend to find solutions that classify cleanly and drive the hidden and output units close to their saturation levels. Achieving perfect fault tolerance would require approximately the same number of replications from any starting size, which would favor the smallest seed network to reach the smallest solution overall. On the other hand, backpropagation on harder problems might find minimal solutions that barely classify and drive most of their units in their linear area, requiring more replications to become fault tolerant. This interaction of the two factors determines a point of diminishing returns, where the smaller initial size cannot compensate for the increased number of replications that are required. The conclusion with respect to normal backpropagation, then, would be that there exists an optimal seed network size that, when replicated, achieves the smallest perfectly tolerant net. For easy problems, this coincides with the minimal solution network. Training is able to find solutions that operate in a sufficiently saturated state; the large number of required replications
Investigating the Fault Tolerance of Neural Networks
1663
is offset by the small size, so that replicating this network is the almost optimal way to go. Larger networks, even if operating in a more saturated state initially and requiring fewer replications, lose out because of their final size. For harder problems, the smallest networks to solve them have nonsaturated networks close to the classification thresholdes; they require a larger number of replications, so that the smallest fault-tolerant final size is produced by some nonminimal seed network. Fault-tolerant training with extra terms, as implemented, does not lead to the same solutions as normal backpropagation. The extra terms constrain the saturation of the nodes in the network by leading to solutions where each connection has a smaller influence overall. This explains why, even though the initial fault tolerance of these networks is better than that of the normally trained ones, they eventually lose out on replication and do not achieve better fault tolerance for the same final size. In simple problems, where the solving networks are small or minimal, there is not enough redundancy to allow a search for fault tolerance, and that search, performed anyway, prevents the network from reaching a sufficiently saturated operating state. Therefore, not only is a fault-tolerant-trained network worse in PFT than a replicated normal one obtained from a smaller seed, but replication does not increase its fault tolerance as fast as replicating a more saturated normally trained one would. For hard problems, where the solving networks are larger than the (unknown) minimal ones, sufficient extra capacity exists to allow an efficient search for fault tolerance, so that the resulting network is more fault tolerant than a normal one replicated to the same size. However, because of its inherently nonsaturated operation, replicating the fault-tolerant-trained network cannot produce the same PFT gain, and the eventually smallest fault-tolerant network is the result of replicating a normally trained one. In conclusion, it can be said that for sufficiently hard problems, if 100% PFT is not required, the smallest network to achieve any particular level of PFT would be a fault-tolerance-trained one. This has serious limitations, however, since the method is extremely computationally intensive to be of realistic use for hard real-world classification problems. It is clear that replicating a seed network does not produce the smallest overall network for a given PFT level (unless starting from extremely saturated or threshold units), but there is no realistic alternative. Whether any other method for network construction and/or search (for example, genetic algorithms or genetic programming) can achieve 100% PFT for a realistic problem without using replication is an open question that is subject to further investigation.
Acknowledgments This work was supported by NSF grants ECS-9875705 and ECS-0196362.
1664
E. Tchernev, R. Mulvaney, and D. Phatak
References Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7, 108–116. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine: University of California. Luenberger, D. G. (2003). Linear and nonlinear programming. Norwell, MA: Kluwer Academic Publishers. Murray, A. F., & Edwards, P. J. (1994). Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5, 792–802. Neti, C., Schneider, M. H., & Young, E. D. (1992). Maximally fault tolerant neural networks. IEEE Transactions on Neural Networks, 3, 14–23. Petsche, T., & Dickinson, B. W. (1990). Trellis codes, receptive fields, and fault tolerant, self-repairing neural networks. IEEE Transactions on Neural Networks, 1, 154–166. Phatak, D. S. (1994). Fault tolerance of feed-forward artificial neural nets and synthesis of robust nets. Doctoral dissertation, University of Massachusetts. Phatak, D. S. (1999). Relationship between fault tolerance, generalization and the Vapnik-Chervonenkis (VC) dimension of feedforward ANNs. In Proceedings of the International Joint Conference on Neural Networks (pp. 705–709). Washington, DC. Phatak, D. S., & Koren, I. (1992). Fault tolerance of feedforward neural nets for classification tasks. In Proceedings of the International Joint Conference on Neural Networks (Vol. 2, pp. 386–391). Baltimore, MD. Phatak, D. S., & Koren, I. (1995). Complete and partial fault tolerance of feedforward neural nets. IEEE Transactions on Neural Networks, 6, 446–456. Phatak, D. S., & Tchernev, E. (2002). Synthesis of fault tolerant neural networks. In Proceedings of the International Joint Conference on Neural Networks (pp. 1475– 1480). Honolulu, HI. Protzel, P. W., Palumbo, D. L., & Arras, M. K. (1993). Performance and fault-tolerance of neural networks for optimization. IEEE Transactions on Neural Networks, 4, 600– 614. Segee, B. E., & Carter, M. J. (1991). Fault tolerance of pruned multilayer networks. In Proceedings of the International Joint Conference on Neural Networks (pp. 447–452). Seattle, WA. Segee, B. E., & Carter, M. J. (1994). Comparative fault tolerance of parallel distributed processing networks. IEEE Transactions on Computers, 43, 1323–1329. Sequin, C. H., & Clay, R. D. (1990). Fault tolerance in artificial neural networks. In Proceedings of the International Joint Conference on Neural Networks (pp. 703-708). San Diego, CA. Tchernev, E., Mulvaney, R., & Phatak, D. S. (2004). Perfect fault tolerance of the n-k-n network (Tech. Rep. No. TR-CS-04-14). Baltimore, MD: University of Maryland Baltimore County.
Received May 7, 2004; accepted January 5, 2005.
REVIEW
Communicated by Terrence J. Sejnowski
How Close Are We to Understanding V1? Bruno A. Olshausen
[email protected] Redwood Neuroscience Institute, Menlo Park, CA 94025, and Center for Neuroscience, University of California at Davis, Davis, CA 95616, U.S.A.
David J. Field
[email protected] Department of Psychology, Cornell University, Ithaca, NY 14853, U.S.A.
A wide variety of papers have reviewed what is known about the function of primary visual cortex. In this review, rather than stating what is known, we attempt to estimate how much is still unknown about V1 function. In particular, we identify five problems with the current view of V1 that stem largely from experimental and theoretical biases, in addition to the contributions of nonlinearities in the cortex that are not well understood. Our purpose is to open the door to new theories, a number of which we describe, along with some proposals for testing them. 1 Introduction The primary visual cortex (area V1) of mammals has been the subject of intense study for at least four decades. Hubel and Wiesel’s original studies in the early 1960s created a paradigm shift by demonstrating that the responses of single neurons in the cortex could be tied to distinct image properties such as the local orientation of contrast (Hubel & Wiesel, 1959, 1968). Since that time, the study of V1 has become something of a miniature industry, to the point where the annual Society for Neuroscience meeting now routinely devotes multiple sessions entirely to V1 anatomy and physiology. Without doubt, much has been learned from these efforts. However, as we shall argue here, there remains a great deal that is still unknown about how V1 works and its role in visual system function. We believe it is quite probable that the correct theory of V1 is still far afield from the currently proposed theories. It may seem surprising to some that we should take such a stance. V1 does, after all, have a seemingly ordered appearance: a clear topographic map and an orderly arrangement of ocular dominance and orientation columns. Many neurons are demonstrably tuned for stimulus features such as orientation, spatial frequency, color, direction of motion, and disparity. And there has even emerged a fairly well-agreed-on “standard model” for V1 in which simple cells compute a linearly weighted sum of the input over Neural Computation 17, 1665–1699 (2005)
© 2005 Massachusetts Institute of Technology
1666
Image
I(x,y,t )
B. Olshausen and D. Field Response normalization
Receptive field
-
+
-
linear response
- or /
Pointwise non-linearity
Response
r (t )
K(x,y,t ) neighboring neurons
Figure 1: Standard model of V1 simple cell responses. The neuron computes a weighted sum of the image over space and time, and this result is normalized by the responses of neighboring units and passed through a pointwise nonlinearity (see e.g., Carandini, Heeger, & Movshon, 1997).
space and time (usually a Gabor-like function), which is then normalized by the responses of neighboring neurons and passed through a pointwise nonlinearity (see Figure 1). Complex cells are similarly explained in terms of a summation over the outputs of a local pool of simple cells with similar tuning properties but different positions or phases. A variety of models have been proposed for the response normalization (Heeger, 1991; Geisler & Albrecht, 1997; Schwartz & Simoncelli, 2001; Cavanaugh, Bair, & Movshon, 2002a), but the net result is often to think of V1 as a kind of “Gabor filter bank.” There are numerous papers showing that this basic model fits much of the existing data well, and many scientists have come to accept this as a working model of V1 function (see, e.g., Lennie, 2003a for a discussion). Indeed, such models are widely used to predict psychophysical performance (Graham & Nachmias, 1971; Watson, Barlow, & Robson, 1983; Anderson, Burr, & Morrone, 1991), and they have been shown to provide efficient representations of natural scenes (Olshausen & Field, 1996; Bell & Sejnowski, 1997). But behind this picture of apparent orderliness lies an abundance of unexplained phenomena, a growing list of untidy findings, and an increasingly uncomfortable feeling among many about how the experiments that have led to our current view of V1 were conducted in the first place. The main problem stems from the fact that cortical neurons are highly nonlinear—that is, they emit all-or-nothing action potentials, not analog values. They also adapt, so their response properties depend on the history of activity. Most important, cortical pyramidal cells have highly elaborate dendritic trees, and realistic biophysical models that include voltage-gated channels suggest that each thin branch could act as a nonlinear subunit, so that any one neuron could be computing many different nonlinear combinations of its inputs (Hausser & Mel, 2003; Polsky, Mel, & Schiller, 2004), in addition to being sensitive to coincidences (Softky & Koch, 1993; Azouz & Gray, 2000, 2003). Everyone knows that neurons are nonlinear, but few have acknowledged the implications for studying cortical function. Unlike linear systems, where
How Close Are We to Understanding V1?
1667
there exist mathematically tractable textbook methods for system identification, nonlinear systems cannot be teased apart using some straightforward, structuralist approach. That is, there is no unique “basis set” with which one can probe the system to characterize its behavior in general.1 Nevertheless, the structuralist approach has formed the bedrock of V1 physiology for the past four decades. Researchers have probed neurons with spots, edges, gratings, and a variety of mathematically elegant functions in the hope that the true behavior of neurons can be explained in terms of some simple function of these components. However, the evidence that this approach has been successful is lacking. We simply have no reason to believe that a population of interacting neurons can be reduced in this way. For any complex system, it seems reasonable to begin where the system acts rationally: to study the behavior under conditions where one’s models are relatively effective. But for a neural system, that leaves the question as to whether such behavior represents the relevant aspect of the neurons activity: Does this help us understand how neurons operate under natural conditions? Much of our understanding of V1 is derived from recording from one neuron at a time using simple stimuli (edges, gratings, spots). From this body of experiments has emerged the standard model that forms the basis for our conceptual understanding of V1. In recent years, a number of innovative studies have moved away from this basic approach, recording from multiple neurons with complex, ecologically relevant stimuli. Are these studies simply adding minor correction factors to our understanding, or will they require us to completely revamp the current theories? Are the current models close to accounting for the majority of responses in the majority of neurons in V1? How close are we to understanding V1? In this review, we present our reasons for believing that we may have far to go in understanding V1. We identify five fundamental problems with the current view of V1 function that stem largely from experimental and theoretical biases, in addition to the contributions of nonlinearities in the cortex that are not well understood. Furthermore, we attempt to quantify the level of our current understanding by considering two important factors: an estimate of the fraction of V1 neuron types that are typically characterized in experimental studies and the fraction of variance explained in the responses of these neurons under natural viewing conditions. Together, these two factors lead us to conclude that at present, we can rightfully claim to understand only 10% to 20% of how V1 actually operates under normal conditions. Our aim in pointing these things out is not simply to tear down the current framework. We ourselves have attempted to account for some aspects of the 1 The Volterra series expansion is often touted as a general approach for characterizing nonlinear systems, but it has been of little practical value in analyzing neural systems because it requires estimating many higher-order moments. In addition, it is an overly general “black box” approach that does not easily allow one to incorporate prior knowledge about the types of nonlinearities known to exist in the nervous system.
1668
B. Olshausen and D. Field
standard model in terms of efficient coding principles (sparse coding), so obviously we believe that we have made a good start. Rather, our goal is to show how much room there is for new theories and where the weaknesses in the current theories might lie. In the second half of the review, we describe a few of our favorite alternatives to the standard theories. A central conclusion that emerges from this exercise is that we need to begin seriously studying how V1 behaves with natural scenes, using multiunit recording techniques, in addition to explicitly describing any potential biases in the gathering of data. We believe this approach can help to reveal not just how much we know about neural coding in the visual pathway but also how much we do not know. 2 Five Problems with the Current View 2.1 Biased Sampling of Neurons. The vast majority of our knowledge about V1 function has been obtained from single unit recordings in which a single microelectrode is brought into close proximity with a neuron in cortex. Ideally, when doing this, one would like to obtain an unbiased sample from any given layer of cortex. But some biases are difficult to avoid. For instance, neurons with large cell bodies will give rise to extracellular action potentials that have larger amplitudes and propagate over larger distances than neurons with small cell bodies. Without careful spike sorting, the smaller extracellular action potentials may easily become lost in the background when in the vicinity of neurons with large extracellular action potentials. This creates a bias in sampling that is not easy to dismiss. Even when a neuron has been successfully isolated, detailed investigation of the neuron may be bypassed if it does not respond “rationally” to standard test stimuli or fit the stereotype of what the investigator believes the neuron should do. This is especially true for higher visual areas such as V4, but it is also true for V1. Such neurons are commonly regarded as “visually unresponsive.” It is difficult to know how frequently such neurons are encountered because often they simply go unreported, or else it is simply stated that only visually responsive units were used for analysis. While it is admittedly difficult to characterize the information processing capabilities of a neuron that seems unresponsive, it is still important to know in what way these neurons are unresponsive. What are the statistics of activity? Do they tend to appear bursty or tonic? Do they tend to be encountered in particular layers of cortex? And most important, are they merely unresponsive to bars and gratings, or are they also equally uninterpretable in their responses to a wider variety of stimuli, such as natural images? A seasoned experimentalist who has recorded from hundreds of neurons would probably have some sense of these things. But for the many readers not directly involved in collecting the data, there is no way of knowing these unreported aspects of V1 physiology. It is possible that someone may eventually come up with a theory that could account for some of these unresponsive neurons, but this cannot happen if no one knows they are there.
1669 measured mean rate (spikes/sec)
How Close Are We to Understanding V1?
0.08
15 10 5 0
0.06
(B).
0.04 0.02 0 0
5
10
15
20
firing rate (spikes/sec)
25
30
fraction of population
(A).
probability
0.1
20
0
2
4
6
8
10
2
4
6
8
10
1 0.8 0.6 0.4 0.2 0
0
threshold (spikes/sec)
Figure 2: (A) Exponential firing rate distribution with a mean of 1 Hz (dashed line denotes mean). (B) Resulting overall mean rate of the measured population (top) and fraction of the population captured (bottom) as a result of recording from neurons only above a given mean firing rate (threshold). A log-normal distribution of mean firing rates was assumed, with rate r = 10u , u ∼ N (−0.55, 0.5).
A related bias that arises in sampling neurons is that the process of hunting for neurons with a single microelectrode will typically steer one toward neurons with higher firing rates. One line of evidence suggesting that this is a significant bias comes from work estimating mean firing rates in the cortex based on energy consumption. Attwell and Laughlin (2001) and Lennie (2003b) calculate that the average activity must be relatively low—less than 1 Hz in primate cortex. However, in the single-unit literature, one finds many studies in which even the spontaneous or background rates are well above 1 Hz. This suggests that the more active neurons are substantially overrepresented (Lennie, 2003b). What makes matters worse is that if we assume V1 neurons exhibit a roughly exponential firing rate distribution, as has been demonstrated for natural scenes and other stimuli (Baddeley et al., 1997), then a mean firing rate of 1 Hz would yield the distribution shown in Figure 2A.2 With such a distribution, only a small fraction of neurons would exhibit the sorts of firing rates normally associated with a robust response. For example, the total probability for firing rates of even 5 Hz and above is 0.007, meaning that 2 Our own analysis of the firing rate distribution, as measured from the PSTH in response to repeated presentation of natural movies (measured from V1 of anaesthetized cats—J. Baker, S. C. Yen, C. M. Gray, personal communication to the authors, 2004), suggests that the distribution is actually power-law, which would mean that it is even more heavily skewed toward zero.
1670
B. Olshausen and D. Field
one would have to wait 1 to 2 minutes on average in order to observe a 1-second interval containing five or more spikes. It seems possible that such neurons could either be missed altogether or else purposely bypassed because they do not yield enough spikes for data analysis. For example, the overall mean firing rate of V1 neurons in the Baddeley et al. study was 4.0 Hz (SD 3.6 Hz), suggesting that these neurons constitute a subpopulation that were perhaps easier to find but not necessarily representative of the population as a whole. Interestingly, the authors point out that even this rate is considered low (which they attribute to anaesthesia), as previous studies (Legendy & Salcman, 1985) report the mean firing rate to be 8.9 Hz (SD 7.0 Hz). Given the variety of neurons in V1, it seems reasonable to presume there exists a heterogeneous population of neurons with different mean firing rates. If we assume some distribution over these rates, then it is possible to obtain an estimate of the fraction of the population characterized given a particular criterion response. And from that, we can calculate what the observed mean rate would be for that fraction. The result of such an analysis, assuming a log-normal distribution of mean rates with an overall mean of 1 Hz, is shown in Figure 2B. As one can see, an overall mean of 4 Hz implies that the selection criterion was somewhere between 1 and 2 Hz, which would capture less than 20% of the population. Neurophysiological studies of the hippocampus provide an interesting lesson about the sorts of biases introduced by low firing rates. Prior to the use of chronic implants, in which the activity of neurons could be monitored for extended periods while a rat explored its environment, the granule cells of the dentate gyrus were thought to be mostly high-rate “theta” cells (e.g., Rose, Diamond, & Lynch, 1983). But it eventually became clear that the majority are actually very low-rate cells (Jung & McNaughton, 1993) and that for technical reasons only high-rate interneurons were being detected in the earlier studies (W. E. Skaggs, personal communication to the authors, January 2004). In fact, Thompson and Best (1989) found that nearly twothirds of all hippocampal neurons that showed activity under anaesthesia became silent in the awake, behaving rat. This overall pattern appears to be upheld in macaque hippocampus, where the use of chronic implants now routinely yields neurons with overall firing rates below 0.1 Hz (Barnes et al., 2003), which differs by nearly two orders of magnitude from the “low baseline rates” of 8.1 Hz reported by Wirth et al. (2003) using acutely implanted electrodes. The dramatic turn of events afforded by the application of chronic implants combined with natural stimuli and behavior in the hippocampus can only make one wonder what mysteries could be unraveled when similar techniques are applied to visual cortex. What is the natural state of activity during free viewing of natural scenes, where the animal is actively exploring its environment? What are the actual average firing rates and other statistics of activity among layer 2/3 pyramidal cells? What are the huge numbers
How Close Are We to Understanding V1?
1671
of granule cells in macaque layer 4, which outnumber the geniculate fiber inputs by 30 to 1, doing? Do they provide a sparser code than their geniculate counterparts? And what about the distribution of actual receptive field sizes? Current estimates show that most parafoveal neurons in V1 have receptive field sizes on the order of 0.1 degree. But based on retinal anatomy and psychophysical performance, one would expect to find a substantial number of neurons with receptive fields an order of magnitude smaller, ca. 0.01 degree (Olshausen & Anderson, 1995). Such receptive field sizes are extremely rare, if not nonexistent, in the existing data on macaque V1 neurons collected using acute recording techniques (De Valois, Albrecht, & Thorell, 1982; Parker & Hawken, 1988). Overall, then, one can identify at least three different biases in the sampling of neurons: 1. Preference for neurons with large cell bodies and large extracellular action potentials 2. Preference for “visually responsive” neurons 3. Preference for neurons with high firing rates So where does this leave us? Let us be conservative. If we assume that 5% to 10% of neurons are missed because they have weak extracellular action potentials, another 5% to 10% are discarded because they are not visually unresponsive, and 50% to 60% are missed because of low firing rates (assuming a conservative threshold of 0.5 Hz in Figure 2), then even allowing for some overlap among these populations would yield the generous estimate that 40% of the population has actually been characterized (although we would not be surprised if that number is as low as 20%). 2.2 Biased Stimuli. Much of our current knowledge of V1 neural response properties is derived from experiments using reduced stimuli. Often these stimuli are ideal for characterizing linear systems—spots, white noise, or sine wave gratings—or else they are designed around preexisting notions of how neurons should respond. The hope is that the insights gained from studying neurons using these reduced stimuli will generalize to more complex situations, such as natural scenes. But of course there is no guarantee that this is the case. And given the nonlinearities inherent in neural responses, we have every reason to be skeptical. Sine wave gratings are ubiquitous tools in visual system neurophysiology and psychophysics. In fact, the demand for using these stimuli is so high that some companies produce lab equipment with specialized routines designed for this purpose (e.g., Cambridge Research Systems). But sine waves are special only because they are eigenfunctions of linear, time- or spaceinvariant systems. For nonlinear systems, they bear no particular meaning and occupy no special status. In the auditory domain, sine waves could be justified from the standpoint that many natural sounds are produced by
1672
B. Olshausen and D. Field
oscillating membranes. However, in the visual world, there are few things that naturally oscillate either spatially or temporally. The Fourier basis set is just one of many possible basis sets, and if the system is nonlinear, no one basis set will necessarily provide a proper account of the system. Bars of light, Gabor functions, Walsh patterns, or any other basis set will suffer from similar problems requiring assumptions of the types of nonlinearities that are present. The Gabor function has been argued to provide a good model of cortical receptive fields (Field & Tolhurst, 1986; Jones & Palmer, 1987). However, the methods used to measure the receptive field in the first place generally search for the best-fitting linear model. They are not tests of how well the receptive field model actually describes the response of the neuron. Not until recent work by Gallant and colleagues (David, Vinje, & Gallant, 2004) have these models been tested in ecological conditions. And as we discuss below, the results demonstrate that these models often fail to adequately capture the actual behavior of neurons. The use of white noise and m-sequences can provide some advantage over the traditional linear systems approach, as they can provide a wider range of stimuli than a simple basis set and are thus capable of mapping out the nonlinearity of a system if the nonlinearities take on particular forms (e.g., Nykamp & Ringach, 2002). In addition, by analyzing the eigenvectors of the spike-triggered covariance matrix, one can recover fairly complex nonlinear models, such as the hypothetical subunits composing a complex cell, or suppressive dimensions in the stimulus space (Touryan, Lau, & Dan, 2002; Rust, Schwartz, Movshon, & Simoncelli, 2004). However, there is only one way to map a nonlinear system with complete confidence: present the neuron with all possible stimuli. The scope of this task is truly breathtaking. Even an 8 × 8 pixel patch with 6 bits of gray level requires searching 2384 > 10100 possible combinations (a google of combinations). If we allow for temporal sensitivity and include a sequence of 10 such patches, we are exceeding 101000 . With the estimated number of particles in the universe estimated to be in the range of 1080 , it should be clear that this is far beyond what any experimental method could explore. In theory, a nonlinear neuron could behave quite rationally for all but a handful of these stimuli, so unless this handful has been measured, there is no way to be certain the neuron has been adequately characterized. The use of independent white noise can theoretically present a neuron with all possible stimuli. However, 10 hours of recording from a single neuron with a patch like that above at 30 frames per second will present just 106 out of the 101000 possible stimuli. Using such a tiny fraction of the possible stimuli allows mapping of the nonlinearities only if the nonlinearities are quite smooth. The deeper question is whether one can predict the responses of neurons from some combinatorial rule of the responses derived from a reduced set of stimuli. The response of the system to any reduced set of stimuli cannot be guaranteed to provide the information needed to predict the response to an arbitrary combination of those stimuli. Of course, we will never know
How Close Are We to Understanding V1?
1673
this until it is tested, and that is precisely the problem: the central assumption of the elementwise, reductionist approach has yet to be thoroughly tested. We believe that the solution to these problems is to turn to natural scenes. Our intuitions for how to reduce stimuli should be guided by the sorts of structure that occur in natural scenes, not arbitrary (or even elegant) mathematical functions or stimuli that are conceptually simple or happen to be easy to generate on a monitor. Since it is impossible to map out the response to all possible stimuli, some assumptions about the nature of the nonlinearity and the stimulus space must be made. The assumption we believe is appropriate is that the nonlinearities relevant to visual processing are most likely to be revealed when the system is presented with ecologically relevant stimuli. Traditionally, experimentalists have been reluctant to use natural scenes as stimuli because they seem highly variable and uncontrolled. But in recent years there has been significant progress in modeling the structure of natural images (Simoncelli & Olshausen, 2001), and already a number of studies have used some of the basic properties of natural scenes (1/ f 2 power spectrum, contrast distributions, texture statistics, etc.) to develop parametric descriptions of natural images that can be used to generate experimental stimuli (e.g., Knill, Field, & Kersten, 1990; Heeger & Bergen, 1995). In addition, there have been some recent attempts to map out the nonlinearities in response to natural images (Sharpee, Rust, & Bialek, 2004). And the development of several adaptive stimulus techniques looks to be a promising avenue for determining the relevant stimulus for sensory neurons (Foldiak, Xiao, Keysers, Edwards, & Perrett, 2004; Edin, Machens, Schutze, & Herz, 2004; O’Connor, Petkov, & Sutter, 2004). In summary, then, there are two main reasons for using natural scenes as stimuli: (1) by devoting resources to relevant ecological stimuli, the experimentalist has a greater chance of finding and mapping the nonlinearities relevant to the function of neurons, and (2) the responses to natural scenes provide an ecologically meaningful test of any neural model. Even if nonecological stimuli are used to map a neuron’s behavior, the true test that the characterization is correct is to demonstrate that one can predict the neurons behavior in ecological conditions. 2.3 Biased Theories. Currently in neuroscience, there is an emphasis on telling a simple story. This often encourages investigators to demonstrate when a theory explains data, not when a theory provides a poor model. In addition, editorial pressures can encourage one to make a tidy picture out of data that may actually be quite messy. This, of course, runs the risk of forcing a picture that does not actually exist. Theories then emerge that are centered around explaining a particular subset of published data, or which can be conveniently proven rather than being motivated by functional considerations: How does this help the brain to solve the real problems of vision? For instance, early work demonstrating the spatial frequency selectivity of neurons (e.g., Blakemore & Campbell, 1969) led a number of investigators
1674
B. Olshausen and D. Field
toward a Fourier view of the cortex. Such work led to thousands of studies devoted to questions regarding frequency tuning and the relevance of this tuning to the human detection and discrimination of sinusoidal gratings. This left us with complex theories for how we detect gratings, but with little understanding of how such a system would function in the natural world. Another example is the classification of V1 neurons into the categories of simple, complex, and hypercomplex or end-stopped. Simple cells are noted for having oriented receptive fields organized into explicit excitatory and inhibitory subfields, whereas complex cells are tuned for orientation but are relatively insensitive to position and the sign of contrast (black-white edge versus white-black edge). Hypercomplex cells display more complex shape selectivity, and some appear most responsive to short bars or the terminations of bars of light (so-called end-stopping). Are these categories real, or a result of the particular way neurons were stimulated and the data analyzed? A widely accepted theory that accounts for the distinction between simple and complex cells is that simple cells compute a (mostly linear) weighted sum of image pixels, whereas complex cells compute a sum of the squared and half-rectified outputs of simple cells of the same orientation—the socalled energy model (Adelson & Bergen, 1985). This theory is consistent with measurements of response modulation in response to drifting sine wave gratings, otherwise known as the F1/F0 ratio (Skottun et al., 1991). From this measure, one finds clear evidence for a bimodal distribution of neurons, with simple cells having ratios greater than one and complex cells having ratios less than one. Recently, however, it has been argued that this particular nonlinear measure tends to exaggerate or even introduce bimodality rather than reflecting an actual intrinsic property of the data (Mechler & Ringach, 2002). When receptive fields are instead characterized by the degree of overlap between zones activated by increments or decrements in contrast, one obtains a continuous, unimodal distribution when the overlap is expressed as the normalized distance between the zones, but a bimodal distribution when expressed as an overlap index (sum of widths minus the separation divided by sum of widths plus the separation) (Mata & Ringach, 2005; Kagan, Gur, & Snodderly, 2002). In addition, the energy model of complex cells does a poor job accounting for complex cells with a partial overlap of activating zones. Thus, the way in which response properties are characterized can have a profound effect on the resulting theoretical framework that is adopted to explain the results. The notion of two classes of neurons, simple and complex, has been firmly planted in the minds of modelers and experimentalists alike, but a closer examination of the data reveals that this classification scheme may actually be an artifact of the lens through which we view the data. The notion of end-stopped neurons introduces even more questions when one considers the structure of natural images. Most natural scenes are not littered with line terminations or short bars (see Figure 3, middle). Indeed, at the scale of a V1 receptive field, the structures in this image are
How Close Are We to Understanding V1?
1675
100
100
120
120
140
140
180
200
220
240
180
200
220
240
Figure 3: A natural scene (left) and an expanded section of it (middle). Far right shows the information conveyed by an array of complex cells at four different orientations. The length of each line indicates the strength of response of a complex cell at that location and orientation. The solid black line shows the location of the boundary of the log in the original image. Note that very few complex cells of the appropriate orientation are responding along this contour.
quite complex and defy the simple line-drawing-like characterization of a “blocks world.” Where in such an image would one expect an end-stopped neuron to fire? By asking this question, one could possibly be led to a more ecologically relevant theory of these neurons than suggested by simple laboratory stimuli. Another theory bias often embedded in investigations of V1 function is the notion that simple cells, complex cells, and hypercomplex cells are actually coding for the presence of edges, corners, or other two-dimensional (2D) shape features in images. However, much of this thinking is derived from a rather cartoon view of images. Computer vision studies provide clear evidence of the fallacy of the purely bottom-up approach. One cannot compute the presence even of simple edges of an object purely from the luminance discontinuities (i.e., using a filter such as a simple or complex cell model). As an example, Figure 3 demonstrates the result of processing a natural scene with the standard energy model of a complex cell. Far from making contours explicit, this representation creates a cluttered array of orientation signals that make it difficult to discern what is actually going on in the scene. Our perception of crisp contours, corners, and junctions in images is largely a post hoc phenomenon that is the result of massive inferential computations performed by the cortex, which are heavily informed by context and high-level knowledge. It could well be that our initial introspections about scene structure are a poor guide as to the actual problems faced by the cortex. In order to properly understand V1 function, our theories will need to be guided by functional considerations and an appreciation for the ambiguities contained in natural images rather than being biased by simplistic notions of feature detection that are suggested by the responses of a select population
1676
B. Olshausen and D. Field
of neurons recorded using simplified stimuli. One of the most challenging problems facing the cortex is that of inferring a representation of 3D surfaces from the 2D image (Nakayama, He, & Shimojo, 1995; see also section 3.4). This is not an easy problem to solve and still lies beyond the abilities of modern computer vision. It seems quite likely that V1 plays a role in solving this problem, but understanding how it does so will require going beyond bottom-up filtering models to consider how top-down information is used in the interpretation of images (Olshausen, 2003; Lee & Mumford, 2003; see also section 3.5 below). 2.4 Interdependence and Contextual Effects. It has been estimated that roughly 5% of the excitatory input in layer 4 of V1 arises from the lateral geniculate nucleus (LGN), with the majority resulting from intracortical inputs (Peters & Payne, 1993; Peters, Payne, & Budd, 1994). Thalamocortical synapses have been found to be stronger, making them more likely to be effective physiologically (Ahmed, Anderson, Douglas, Martin, & Nelson, 1994). Nevertheless, based on visually evoked membrane potentials, Chung and Ferster (1998) have argued that the geniculate input is responsible for just 35% of a layer 4 neuron’s response. This leaves 65% of the response determined by factors outside the direct feedforward input. Using optical imaging methods, Arieli, Sterkin, Grinvald, and Aertsen (1996) showed that the ongoing population activity can account for 80% of an individual V1 neuron’s response variance, and recent work using multielectrode arrays has shown that the ongoing activity V1 neurons is only slightly modified by visual input (Fiser, Chiu, & Weliky, 2004). Thus, we are left with the real possibility that somewhere between 60% and 80% of the response of a V1 neuron is a function of other V1 neurons, or inputs other than those arising from LGN. It should also be noted that recent evidence from the early blind has demonstrated that primary visual cortex has the potential for a wide range of multimodal input. Sadato et al. (1996) and Amedi, Raz, Pianka, Malach, and Zohary (2003) demonstrated that both tactile braille reading and verbal material can activate visual cortex in those who have been blind from an early age, even though no such activation occurs in those with normal sight. This implies that in the normal visual system, primary visual cortex has the potential for interactions with quite high-level sources of information. That V1 neurons are influenced by context—the spatiotemporal structure outside the classical receptive field (CRF)—is by now well known and has been the subject of many investigations over the past decade. Knierim and Van Essen (1992) showed that many V1 neurons are suppressed by a field of oriented bars outside the classical receptive field of the same orientation, and Sillito, Grieve, Jones, Cudeiro, and Davis (1995) have shown that one can introduce quite dramatic changes in orientation tuning based on the orientation of gratings outside the CRF. Other investigators have probed the spatial specificity of the surround using grating patches and demonstrated fairly specific zones of suppression (Walker, Ohzawa, & Freeman, 1999; Cavanagh,
How Close Are We to Understanding V1?
1677
Bair, & Movshan, 2002b). And these studies, in addition to others (see Series, Lorenceau, & Fr´egnac, 2003, for a review), have likely tapped only a portion of the interdependencies and contextual effects that actually exist. The problem in teasing apart contextual effects in such a piecemeal fashion is that one faces a combinatorial explosion in the number of possible spatial and featural configurations of surrounding stimuli such as bars or gratings. What we really want to know is how neurons respond within the sorts of context encountered in natural scenes. For example, given the results of Knierim and Van Essen (1992) using bar stimuli, or Sillito et al. (1995) using gratings, what should we reasonably expect to result from the sorts of context seen in the natural scene of Figure 3? Indeed, it is not even clear whether one can answer the question since the contextual structure here is so much richer and more diverse than that which has been explored experimentally. Some of the initial studies exploring the role of context in natural scenes have demonstrated pronounced nonlinear effects that tend to sparsify activity in a way that would have been hard to predict from the existing reductionist studies (Vinje & Gallant, 2000). More studies along these lines are needed, and most important, we need to understand how and why the context in natural scenes produces such effects. Another striking form of interdependence exhibited by V1 neurons is in the synchrony of activity. Indeed, the fact that one can even measure largescale signals such as the local field potential or electroencephalogram (EEG) implies that large numbers of neurons must be acting together. Gray, Konig, ¨ Engel, and Singer (1989) demonstrated gamma band synchronization between neurons in cat V1 when bars moved through their receptive fields in similar directions, suggesting that synchrony is connected to a binding or segmentation process. More recently, Worg ¨ otter ¨ et al. (1998) have shown that receptive field sizes change significantly with the degree of synchrony exhibited in the EEG, and Maldonado, Babul, Singer, Rodriguez, and Grun ¨ (2004) have shown that periods of synchronization preferentially occur during periods of fixation as opposed to during saccades or drifts. However, what role synchrony plays in the normal operation of V1 neurons is entirely unclear, and it is fair to say that this aspect of response variance remains a mystery. 2.5 Ecological Deviance. We have argued above for experiments that measure the responses of neurons in ecological conditions even when no model is capable of predicting the results—or, we should say, especially if no model can predict the results. Publishing findings only in conditions when a particular model works would be poor science. It is important to know not only where the current models can successfully predict neural behavior, but also under what conditions they break down and why. And as we have emphasized above, it is most important to know how they fare under ecological conditions. If the current models fail to predict neural responses under such conditions, then the literature should reflect this.
1678
B. Olshausen and D. Field
In the past few years, a number of labs have begun using natural scenes as stimuli when recording from neurons in the visual pathway (Dan, Atick, & Reid, 1996; Baddeley et al., 1996; Keysers, Xiao, Foldiak, & Perrett, 2001; Vinje & Gallant, 2002; Ringach, Hawken, & Shapley, 2002; Smyth, Willmore, Baker, Thompson, & Tolhurst, 2003; David et al., 2004). In particular, the Gallant lab at UC Berkeley has taken the approach of attempting to determine how well one can predict the responses of V1 neurons to natural stimuli using a variety of different models. However, assessing how well these models fare, and what it implies about our current understanding of V1, is difficult for at least three reasons. First, one must make several assumptions (either implicitly or explicitly) regarding what aspects of the response are relevant to the model. Spike counts will show significant variability over repeated trials (Tolhurst, Movshon, & Dean, 1983). One can take the average over a number of presentations, but this implicitly assumes that the variability can be attributed to noise. This can be questioned, especially considering that in many cases, individual spikes have been shown to have relatively high reliability (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). The trial-to-trial variability could well be due to internally generated dynamics that plays an important role in information processing that we simply do not as yet understand (Arieli et al., 1996; Fiser et al., 2004; see also section 3.1). Furthermore, to take averages, one must make assumptions regarding the temporal window over which the average is computed. Second, these studies are best performed with an awake, behaving animal. In such conditions, there are limitations to the spatial and temporal accuracy with which the gaze can be measured. When averaging across presentations of stimuli, one must ascertain whether or not the same stimulus was actually presented. Again, one must make an assumption as to what spatiotemporal window to use. The third problem is that whatever model is chosen, one is always subject to the criticism that the model is not sufficiently elaborate. Thus, any inability to predict the neuron’s response might be argued to be simply due to some missing element in the model. For example, David et al. (2004) have explored two different types of models: a linearized spatiotemporal receptive field model, in which the neuron’s response is essentially a weighted sum of the image pixels over space and time, and a phase-separated Fourier model, which allows one to capture the phase invariance nonlinearity of a complex cell. These models can typically explain between 20% and 40% of the response variance. Correcting for intertrial variability improves matters somewhat and it is possible that with more trials and the addition of other nonlinearities such as contrast normalization, adaptation, and response saturation, the fraction of variance explained could rise even more above these levels (and this is a current direction of these studies).
(A).
spike count (35 msec bins)
How Close Are We to Understanding V1?
1679
1 PSTH model
0.8 0.6 0.4 0.2 0 0
5
10
15
20
25
30
time (sec)
18 ms
53 ms
88 ms
123 ms
158 ms
193 ms
(B).
Figure 4: Activity of a V1 neuron in anesthetized cat in response to a natural movie. (A) The PSTH of the neuron’s response (dashed line), together with the predicted response (solid line) generated from the model: r (t) = α h( x k(x, t) ∗ I (x, t) + θ ) p + r0 . The function h( ) is a half-wave rectifying function, and the parameters α, p, θ , and r0 are fit to minimize the squared error with the data. The resulting correlation coefficient in this case is 0.36. Average spike counts were obtained by averaging across 100 trials in 35 ms bins (corresponding to the frame rate). (B) The kernel k(x, t) was measured via reverse correlation with an m-sequence and is shown here as a series of frames in 35 ms intervals, with the center time of the interval displayed above each frame.
We believe such reports are critically important for several reasons. First, such results create a benchmark for showing how well the standard or basic models actually predict ecologically relevant data. Second, these are well-established models that have been given a fair run for their money. One could imagine any number of improvements to these models, and it will be interesting to see if they fare better, but in the meantime, these results provide a useful baseline for comparison. Furthermore, these are the data that represent the ultimate goal of any computational model, and so they are crucial to presenting a complete picture of V1 function. Given the nature of the errors, we do not believe that the addition of simple response nonlinearities such as contrast normalization is likely to improve matters much. Given these results with both linear and Fourier power models, our conjecture is that the best-case scenario is that the percentage of variance explained is likely to asymptote at 30% to 40% with the standard model. One of the reasons for our pessimism is due to the way in which these models fail. For example, Figure 4 shows data collected from the laboratory of Charles Gray at Montana State University, Bozeman, in which the activities of V1 neurons in anesthetized cat are recorded in response to repeated presentations of a natural movie (C. M. Gray, J. Baker, & S. C. Yen personal
1680
B. Olshausen and D. Field
communication to the authors, 2004). Figure 4A shows the peristimulus time histogram (PSTH) of a typical V1 simple cell, whose receptive field as measured from an M-sequence kernel is similar to those found in the literature—a Gabor-like function that translates over time (i.e., space-time inseparable). Superimposed on this is the predicted response generated by convolving the neuron’s space-time receptive field (see Figure 4B) with the movie, and putting the result through a point-wise nonlinearity (including a gain factor and offset term). The neuron tends to exhibit sparse, punctate responses, some of which are predicted by the receptive field model and others not. In most cases, the model response undershoots the PSTH, and this cannot simply be addressed by increasing the gain or narrowing the response of the model, because there are many other episodes where the model predicts responses of equal magnitude in which there is little or no response from the neuron. One could possibly obtain a better fit to the data by including additional terms modeling suppression (Rust et al., 2004) and temporal adaptation (Lesica, Boloori, & Stanley, 2003), or even a spiking mechanism (Paninski, Pillow, & Simoncelli, 2004), but we believe it is useful to see how much the linear, driving term of the model alone fares under these circumstances. Moreover, these additions are essentially single-neuron mechanisms. What seems to be suggested by our initial informal observations of multiple simultaneously recorded units is that a more complex network nonlinearity is at work here, and that describing any one neuron’s behavior will require including the influence of other simultaneously recorded neurons. An important lesson of these findings is that simply mapping out receptive fields does not provide a complete understanding of V1 response properties. For example, Ringach et al. (2002) have shown that it is possible to map out receptive fields using natural scenes, and they show that it is even possible to recover some nonlinear effects such as cross-orientation inhibition with this technique. However, the resulting receptive field models were not tested by comparing their predictions to the actual activity of neurons in response to natural movies. Without doing so, it is difficult to assess how well such models capture the function of the neuron. Unfortunately, journals are often unprepared to publish results when a study demonstrates the failure of a model, unless the study also presents a competing model that works well. Part of this may seem understandable since a model might fail for a variety of reasons. However, until a benchmark is placed in the literature, it is impossible to determine how good a model actually is. And given the magnitude of the task before us, it could take years before a good model emerges. In the meantime, what would be most helpful is to accumulate a database of single-unit or multiunit data (stimuli and neural responses) that would allow modelers to test their best theory under ecological conditions. Finally, it should be noted that better success has been obtained in using receptive field models to predict the responses of neurons to natural scenes
How Close Are We to Understanding V1?
1681
in the LGN (Dan et al., 1996), or the response of cortical neurons to purely static images (Smyth et al., 2003), although they are still far from making perfect predictions. This would seem to suggest that much of the difficulty in predicting responses in cortex has to do with the effects of the massive, recurrent intracortical circuitry that is engaged during natural vision. 2.6 Summary. Table 1 presents a summary of the five problems we have identified with the current view of V1 that has emerged from the data collected to date, along with some of the solutions that we have suggested could possibly help in obtaining a more complete picture of V1 function. Given the limitations described above, is it possible to quantify how well we currently understand V1 function? We attempt to estimate this as follows: [Fraction understood] = Fraction of variance explained from neurons recorded
× [Fraction of population recorded] . If we consider that roughly 40% of the population of neurons in V1 has actually been recorded from and characterized, together with our conjecture that 30% to 40% of the response variance of these neurons can be explained under natural conditions using the currently established models, then we are left to conclude that we can currently account for 12% to 16% of V1 function. Thus, approximately 85% of V1 function has yet to be explained (see Figure 5).3 3 New theories Given the above observations, it becomes clear that there is so much unexplored territory that it is very difficult to rule out theories at this point (although there are some obvious bounds dictated by neural architecture, such as fan-in/fan-out and the spatial extent of axonal and dendritic arbors). In the sections below, we discuss some of the theories that are plausible given our current data. The goal here is not to provide a detailed review of the theories currently in the literature. Rather, it is to provide a few examples of the range of theories that are consistent with the experimental data. It must be emphasized that considering that there may exist a large family of neurons with unknown properties and given the low level of prediction for the neurons studied, there is still considerable room for theories dramatically different from those theories presented here.
3
We have primarily drawn on the Gallant lab’s data for obtaining the percentage of variance explained, and so we are assuming that their methods for isolating neurons are subject to the same biases in sampling discussed earlier.
Use chronically implanted electrodes; parallel recording arrays
Solution
Use natural scenes, ecologically relevant stimuli
Large neurons; visually Use of reduced responsive neurons; stimuli such as neurons with high bars, spots, and firing rates gratings
Biased Stimuli
Problem
Biased Sampling
Consider more functional/computational theories that solve problems of vision
Simple/complex cells; data-driven theories
Biased Theories
Examine how context affects responses in natural scenes
Influence of intracortical input; effect of context; synchrony
Interdependence and Context
Develop models that can account for responses to natural images
Responses to natural scenes deviate from predictions of standard models
Ecological Deviance
Table 1: Five Problems with the Current view of V1 and Some Possible Solutions for Obtaining a More Complete Picture.
1682 B. Olshausen and D. Field
How Close Are We to Understanding V1?
1683
Variance explained
1.0
~0.4
~85% of V1 function not understood
0.3-0.4
0 0
1.0
Proportion of cells studied Figure 5: 85% of V1 function remains to be understood.
3.1 Dynamical Systems and the Limits of Prediction. Imagine tracking a single molecule within a hot gas as it interacts with the surrounding molecules. The particular trajectory of one molecule will be erratic and fundamentally unpredictable without knowledge of all other molecules with potential influence. Even if we presumed that the trajectory of the particular molecule was completely deterministic and following simple laws, in a gas with large numbers of interacting molecules, one could never provide a prediction of the path of a single molecule except over very short distances. In theory, the behavior of single neurons may have similar limitations. To make predictions of what a single neuron will do in the presence of a natural scene may be fundamentally impossible without knowledge of the surrounding neurons. The nonlinear dynamics of interacting neurons may put bounds on how accurately the behavior of any neuron can be predicted. And at this time, we cannot say where that limit may be. What is fascinating in many ways, then, is that neurons are as predictable as they are. For example, work from the Gallant lab has shown that under conditions where a particular natural scene sequence is repeated to a fixating macaque monkey, a neuron’s response from trial to trial is fairly reliable (e.g., Vinje & Gallant, 2000). This clearly suggests that the response is dependent in large part on the stimulus, certainly more than a molecule in the gas model. So how do we treat the variability that is not explained by the stimulus? We may find that the reliability of a local group of neurons is more predictable than a single neuron, which would then require multielectrode recording to attempt to account for the remaining variance. For example, Arieli et al. (1996) have shown that much of the intertrial variability may be explained in terms of large-scale fluctuations in ongoing activity
1684
B. Olshausen and D. Field
of the surrounding population of neurons measured using optical recording, and Fiser et al. (2004) have similarly shown that ongoing population activity as measured with multielectrode arrays is only loosely modulated by visual input. However, what role these large-scale fluctuations play in the normal processing of natural scenes has yet to be investigated. 3.2 Sparse, Overcomplete Representations. One effort to explain many of the nonlinearities found in V1 is based on the idea that neurons are attempting to achieve some degree of gain control (Geisler & Albrecht 1992). Because any single neuron lacks the dynamic range to handle the range of contrasts in natural scenes, it is argued, the contrast response must be normalized. Here we provide a different line of reasoning to explain the observed response nonlinearities of V1 neurons (further details are provided by Olshausen & Field, 1997, and Field & Wu, 2004). We argue that the spatial nonlinearities serve primarily to reduce the linear dependencies that exist in an overcomplete code, and as we shall see, this leads to a fundamentally different set of predictions about the population activity. Consider the number of vectors needed to represent a particular set of data with dimensionality D (e.g., an 8 × 8 pixel image patch would have D = 64). No matter what form the data take, such data never require more than D linearly independent vectors to represent it. A system where data with dimensionality D are spanned by D vectors is described as critically sampled. Such critically sampled systems (e.g., orthonormal bases) are popular in the image coding community as they allow any input pattern to be represented uniquely, and the transform and its inverse are easily computed. The wavelet code, for example, has seen widespread use, and wavelet-like codes similar to that of the visual system have been shown to provide very high efficiency in terms of sparsity when coding natural scenes (e.g., Field, 1987). Some basic versions of independent component analysis (ICA) also attempt to find a critically sampled basis that minimizes the dependencies among the vectors, and the result is a wavelet-like code with tuning much like the neurons in V1 (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). However, the visual system is not using a critically sampled code. In cat V1, for example, there are 25 times as many output fibers as there are input fibers from the LGN, and in macaque V1, the ratio is on the order of 50 to 1. Such overcomplete codes have one potential problem: the vectors are not linearly independent. Thus, if neurons were to compute their output simply from the inner product between their weight vector and the input, their responses will be correlated. Figure 6A shows an example of a two-dimensional data space represented by three neurons with linearly dependent weight vectors. Even assuming the outputs of these units are half-rectified so they produce only positive values, the data are redundantly represented by such a code. The only way to remove this linear dependence is through a nonlinear transform.
How Close Are We to Understanding V1?
(A).
(B).
90 120
60
150
0
210
330 240
300 270
(C).
90 120
0
210
330 240
300 270
60
150
30
180
90 120
60
150
30
180
1685
30
180
0
210
330 240
300 270
Figure 6: Overcomplete representation. (A) The iso-response contours of three linear neurons (with half-wave rectification) having linearly dependent weight vectors. A stimulus falling anywhere along a given contour will result in the same response from the neuron. A stimulus falling in the upper half-plane will result in responses on all three neurons, even though only two would be required to uniquely determine its position in the space. (B) Curving the response contours removes redundancy among these neurons. Now only two neurons will code for a stimulus anywhere in this space. (C) A full tiling of the 2D stimulus space now requires eight neurons, which would be overcomplete as a linear code, but critically sampled given this form of nonlinear response.
One of the nonlinear transforms that will serve this goal is shown in Figure 6B. Here, we show the iso-response curves for the same three neurons. This curvature represents an unusual nonlinearity. For example, consider the responses of a unit to two different stimuli: the first stimulus aligned with the neuron’s weight vector and a second stimulus separated by 90 degrees. The second stimulus will have no effect on the neuron on its own since its vector is orthogonal to that of the neuron. However, when added to the first vector, the combined stimulus will be on a lower iso-response curve (i.e., the neuron will have reduced its activity). In other words, the response curvature of the neuron results in a nonlinearity with the characteristic nonclassical, suppressive behavior: stimuli that on their own have no effect on the neuron (stimuli orthogonal to the principal direction of the neuron) can modulate the behavior of an active neuron. This general nonlinearity comes in several forms and includes end-stopping and cross-orientation inhibition, and is what is typically meant by the term nonclassical surround. Indeed, as Zetzsche, Krieger, and Wegmann (1999) note, this curvature is simply a geometrical interpretation of such behaviors. With the addition of a compressive nonlinearity, this curvature results in the behavior described as contrast normalization. In contrast to the gain control or divisive normalization theory, we argue that the nonlinearities observed in V1 neurons are present primarily to allow a large (overcomplete) population of neurons to represent data using a small number of active units, a process we refer to as sparsification. The goal is not to develop complete independence, as the activity of any neuron
1686
B. Olshausen and D. Field
partially predicts the lack of activity in neighboring neurons. However, the code allows for expanding the dimensionality of the representation without incurring the linear dependencies that would be present in a nonorthogonal code. Importantly, this model predicts that the nonlinearities are a function of the angle between the neuron’s weight vector and those surrounding it. Future multielectrode recordings may provide the possibility to test this theory. From the computational end, we have found that our sparse coding network (Olshausen & Field, 1996, 1997) produces nonlinearities much like those proposed. Our hope, then, is that many of the nonlinearities that have been observed in V1 can eventually be explained within one general framework of efficient coding. 3.3 Contour Integration. There is now considerable physiological and anatomical evidence showing that V1 neurons have a rather selective connection pattern both within and between layers. For example, research investigating the lateral projections of pyramidal neurons in V1 has shown that the long-range lateral connections project primarily to regions of the cortex with similar orientation columns, as well as to similar ocular dominance columns and cytochrome oxidase blobs (Malach, Amir, Harel, & Grinvald, 1993; Yoshioka, Blasdel, Levitt, & Lund, 1996). Early studies exploring the horizontal connections in V1 discovered that selective longrange connections extend laterally for 2 to 5 mm parallel to the surface (Gilbert & Wiesel, 1979), and studies on the tree shrew (Rockland & Lund, 1983; Bosking, Zhang, Schofield, & Fitzpatrick, 1997), primate (e.g., Malach et al., 1993; Sincich & Blasdel, 2001), ferret (Ruthazer & Stryker, 1996), and cat (e.g., Gilbert & Weisel, 1989) have all demonstrated significant specificity in the projection of these lateral connections. A number of neurophysiological studies also show that colinearly oriented stimuli presented outside the classical receptive field have a facilitatory effect (Kapadia, Ito, Gilbert, & Westheimer, 1995; Kapadia, Westheimer, & Gilbert, 2000; Polat, Mizobe, Pettet, Kasamatsu, & Norcia, 1998). The results demonstrate that when a neuron is presented with an oriented stimulus within its receptive field, a second collinear stimulus will sometimes increase the response rate of the neuron while the same oriented stimulus presented orthogonal to the main axis of orientation (displaced laterally) will produce inhibition, or at least less facilitation. These results suggest that V1 neurons have an orientation- and positionspecific connectivity structure beyond what is usually included in the standard model. One line of research suggests that this connectivity helps resolve the ambiguity of contours in scenes and is involved in the process of contour integration (e.g., Field, Hayes, & Hess, 1993). This follows from work showing that the amplification of locally coaligned, oriented elements provides an effective means of identifying contours in natural scenes (Parent & Zucker, 1989; Sha’ashua & Ullman, 1988; Ben-Shahar & Zucker, 2004). This
How Close Are We to Understanding V1?
1687
type of mechanism could work in concert with the sparsification nonlinearities mentioned above, since the facilitatory interactions would primarily occur among elements that are nonoverlapping—that is, receptive fields whose weight vectors are orthogonal. An alternative theoretical perspective is that the effect of these orientation- and position-specific connections should be mainly suppressive, with the goal of removing dependencies among neurons that arise due to the structure in natural images (Schwartz & Simoncelli, 2001). In contrast to the contour integration hypothesis, which proposes that the role of horizontal connections is to amplify the structure of contours, this model would attempt to attenuate the presence of such structure in the V1 representation. Although this may be a desirable outcome in terms of redundancy reduction, we would argue that the cortex has objectives other than redundancy reduction per se (Barlow, 2001). Chief among these is to provide a meaningful representation of image structure that can be easily read out and interpreted by higher-level areas. Finally, it is important to note, with respect to the discussion in the previous section, that the type of redundancy we are talking about here is due to long-range structure in images beyond the size of a receptive field, not that which is simply due to the overlap among receptive fields. Thus, we propose that the latter should be removed via sparsification, while the former should be amplified by the long-range horizontal connections in V1. 3.4 Surface Representation. We live in a three-dimensional world, and the fundamental causes of images that are of behavioral relevance are surfaces, not two-dimensional features such as spots, bars, edges, or gratings. Moreover, we rarely see the surface of an object in its entirety. Occlusion is the rule, not the exception, in natural scenes. It thus seems quite reasonable to think that the visual cortex has evolved effective means to parse images in terms of the three-dimensional structure of the environment: surface structure, foreground-background relationships, and so forth. Indeed, there is now a strong body of psychophysical evidence showing that 3D surfaces and figure-ground relationships constitute a fundamental aspect of intermediate-level representation in the visual system (Nakayama et al., 1995; see also Figure 7). Nevertheless, it is surprising how little V1 physiology has actually been devoted to the subject of three-dimensional surface representation. Some recent studies in extrastriate cortex have begun to yield interesting findings (Nguyenkim & DeAngelis, 2003; Zhou, Friedman, & von der Heydt, 2000; Bakin, Nakayama, & Gilbert, 2000), but V1’s involvement in surface representation remains a mystery. Although many V1 neurons are disparity selective, this by itself does not tell us how surface structure is represented or how figure-ground relationships of the sort depicted in Figure 7 are resolved.
1688
B. Olshausen and D. Field
Figure 7: The three line strokes at left are interpreted as different objects depending on the arrangement of occluders. Thus, pattern completion depends on resolving figure-ground relationships. At what level of processing is this form of completion taking place? Since it would seem to demand access to high-resolution detail in the image, it cannot simply be relegated to high-level areas.
At first sight, it may seem preposterous to suppose that V1 is involved in computing three-dimensional surface representations. But again, given how little we actually know about V1, combined with the importance of 3D surface representations for guiding behavior, it is a plausible hypothesis to consider. In addition, problems such as occlusion demand resolving figure-ground relationships in a relatively high-level representation where topography is preserved (Lee & Mumford, 2003). There is now beginning to emerge physiological evidence supporting this idea. Neurons in V1 have been shown to produce a differential response to the figure versus background in a scene of texture elements (Lamme, 1995; Zipser, Lamme, & Schiller, 1996), and a substantial fraction of neurons in V1 are selective to border ownership (Zhou et al., 2000). In addition, Lee, Mumford, Romero, and Lamme (1998) have demonstrated evidence for a medial axis representation of surfaces in which V1 neurons become most active along the skeletal axis of an object. It seems quite possible that such findings are just the tip of the iceberg. 3.5 Top-Down Feedback and Disambiguation. Although our perception of the visual world is usually quite clear and unambiguous, the raw image data that we start out with is not. Looking back at Figure 3, one can see that even the presence of a simple contour can be ambiguous in a natural scene. The problem is that information at the local level is insufficient to determine whether a change in luminance is due to an object boundary, simply part of a texture, or a change in reflectance. Although boundary junctions are also quite crucial to the interpretation of a scene, a number of studies have shown that human observers are poor judges of what constitutes a boundary or junction when these features are shown in isolation (Elder, Beniaminov, & Pintilie, 1999; McDermott, 2004). Thus, the calculation of what forms a boundary is dependent on the context, which provides
How Close Are We to Understanding V1?
1689
information about the assignment of figure and ground, surface layout, and so forth. Arriving at the correct interpretation of an image, then, constitutes something of a chicken- and-egg problem between lower and higher levels of image analysis. The low-level shape features that are useful for identifying an object—edges, contours, surface curvature, and the like—are typically ambiguous in natural scenes, so they cannot be computed directly based on a local analysis of the image. Rather, they must be inferred based on global context and higher-level knowledge. However, the global context itself will not be clear until there is some degree of certainty about the presence of low-level shape features. A number of theorists have thus argued that recognition depends on information circulating through corticocortical feedback loops in order to disambiguate representations at both lower and higher levels in parallel (Mumford, 1994; Ullman, 1995; Lewicki & Sejnowski, 1996; Rao & Ballard, 1999; Young, 2000; Lee & Mumford, 2003; Hawkins & Blakeslee, 2004). An example of disambiguation at work in the visual cortex can be seen in the resolution of the aperture problem in computing the direction of motion. Because receptive fields limit the field of view of a neuron to just a portion of an object, it is not possible for any one neuron to signal with certainty the true direction of the object in a purely bottom-up fashion. Pack, Berezovskii, and Born (2001) have shown that the initial phase of response of neurons in MT signals the direction of motion directly orthogonal to a contour and that the latter phase of the response reflects the actual direction of the object that the contour is part of, presumably from the interaction with other neurons viewing other parts of the object. Interestingly, this effect does not occur under anesthesia. A similar delayed-response effect has been demonstrated in end-stopped V1 neurons as well (Pack, Livingstone, Duffy, & Born, 2003). Recent evidence from fMRI points to a disambiguation process occurring in V1 during shape perception (Murray, Kersten, Olshausen, Schrater, & Woods, 2002). Subjects viewed a translating diamond that was partially occluded so that the vertices are invisible, resulting in a bistable percept in which the line segments forming the diamond are seen moving independently in one case, and coherently in the direction of the object motion in the other case. When subjects experience the coherent motion and shape percept, activity in lateral occipital cortex (LOC) increases while activity in V1 decreases. This is consistent with the idea that when neurons in LOC are representing the diamond, they feed back this information to V1 so as to refine the otherwise ambiguous representations of contour motion. If the refinement of activity attenuates the many incorrect responses while amplifying the few that are consistent with the global percept, the net effect could be a reduction as seen in the BOLD signal measured by fMRI. An alternative interpretation for the reduction in V1 is based on the idea of predictive coding (Rao & Ballard, 1999), in which higher areas actually subtract their predictions from lower areas.
1690
B. Olshausen and D. Field
There exists a rich set of feedback connections from higher levels into V1, but little is known about the computational role of these connections. Recent experiments in which higher areas are cooled to look at the effect on activity in lower areas seem to suggest that these connections play a role in enhancing the salience of stimuli (Hupe et al., 1998), and Shapley (2004) has concluded that top-down feedback is necessary to account for the spatial extent of surround inhibition. But we would argue that feedback has a far more important role to play in disambiguation, and as far as we know, no one has yet investigated the effect of feedback using such cooling techniques under normal conditions that would require disambiguation (e.g., natural scenes). 3.6 Dynamic Routing. A challenging problem faced by any visual system is that of forming object representations that are invariant to position, scale, rotation, and other common deformations of the image data. The currently accepted, traditional view is that complex cells constitute the first stage of invariant representation by summing over the outputs of simple cells whose outputs are half-rectified and squared—the classical “energy model” (Adelson & Bergen 1985). In this way, the neuron’s response changes only gradually as an edge is passed over its receptive field. This idea forms the basis of so-called Pandemonium models, in which a similar feature extraction and pooling process is essentially repeated at each stage of visual cortex (see Tarr, 1999, for a review). However, the Pandemonium model cannot provide a complete account of perception because it does not preserve information about relative phase or the spatial relationships among features. Clearly, though, we have conscious access to this information. The ability to navigate, grasp, and interact with foreign objects implies that we have the ability to perceive spatial relationships among features without ever doing “object recognition.” In addition, resolving figure-ground relationships and occlusion demands that higher levels of analysis have access to information about spatial relationships as well. One of us has proposed a model for forming invariant representations that preserves relative spatial relationships by explicitly routing information at each stage of processing (Olshausen, Anderson, & Van Essen, 1993). Rather than passively pooling, information is dynamically linked from one stage to the next by a set of control neurons that progressively remap information into an object-centered reference frame. It is thus proposed that there are two distinct classes of neurons: those conveying image and feature information and those controlling the flow of information. The former corresponds to the invariant part, the latter to the variant part. The two are combined multiplicatively, so that mathematically it is equivalent to a bilinear model (e.g., Tenenbaum & Freeman, 2000; Grimes & Rao, 2005). Is it possible that dynamic routing occurs in V1 and underlies the observed shift-invariant properties of complex cells? If so, there are at least
How Close Are We to Understanding V1?
1691
two things we would expect to see: (1) that at any given moment, a complex cell is effectively connected to only one or a small fraction of simple cells to which it is physically connected, and (2) that there are control neurons that dynamically gate these connections. Interestingly, the observed invariance properties of complex cells are just as consistent with the idea of routing as they are with pooling. What could possibly distinguish between these models is to look at the population activity: if the complex cell outputs are the result of passive pooling, then one would expect a dense, distributed representation of contours among the population of complex cells. If information is dynamically routed, though, the representation at the complex cell level would remain sparse. The control neurons, on the other hand, would look something like contrast normalized simple cells, which represent phase independent of magnitude (Zetzsche & Rohrbein, 2001). One of the main predictions of the dynamic routing model is that the receptive fields of the invariant neurons would be expected to shift depending on the state of the control neurons. Such effects have been seen in V4, where some neurons shift their receptive fields depending on where the animal is directing its attention (Moran & Desimone, 1985; Connor, Preddie, Gallant, & Van Essen, 1997. And in V1, Brad Motter has shown that neurons appear to shift their receptive fields in order to compensate for the small eye movements that occur during fixation (Motter & Poggio, 1990; Motter, 1995), although Gur and Snodderly (1997) provide evidence to the contrary. Thus, there exists some evidence for dynamic routing in visual cortex, but further experiments are needed in order to characterize how and to what extent this occurs in V1 under normal viewing conditions. 4 Conclusion Our goal in this review has been to point out that there are still substantial gaps in our knowledge of V1 function and, more important, that there is more room for new theories to be considered than the current conventional wisdom might allow. We have identified five specific problems with the current view of V1, emphasizing the need for using natural scenes in experiments, in addition to multiunit recording methods, in order to obtain a more representative picture of V1 function. While the single-unit, structuralist approach has been a useful enterprise for getting a handle on basic response properties, we feel that its usefulness as a tool for investigating V1 function has been nearly exhausted. It is now time to dig deeper, using richer, ecologically relevant experimental paradigms, and developing theories that can help to elucidate how the cortex performs the computationally challenging problems of vision. As we explore the response properties of V1 neurons using natural scenes, we are likely to uncover some interesting new phenomena that defy explanation with current models. It is at this point that we should be prepared to revisit the structuralist approach in order to tease apart what is going on.
1692
B. Olshausen and D. Field
Reductionism does have its place, but it needs to be motivated by functionally and ecologically relevant questions, similar to the European tradition in ethology (Tinbergen, 1972). At what point will we actually understand V1? This is obviously a difficult question to answer, but we believe at least three ingredients are required: (1) an unbiased sample of neurons of all types, firing rates, and layers of V1; (2) the ability to observe simultaneously the activities of hundreds of neurons in a local population; and (3) the ability to predict, or at least qualitatively model, the responses of the population under natural viewing conditions. Given the extensive feedback connections into V1, in addition to the projections from pulvinar and other sources, it seems unlikely that we will ever understand V1 in isolation. Thus, our investigations must also be guided by how V1 fits into the bigger picture of thalamo-cortical function. Acknowledgments We thank Bill Skaggs for discussions on hippocampal physiology, Charlie Gray and Jonathan Baker for sharing preliminary data, Jack Gallant for clarifying the issues involved in predicting neural responses, and Jeff Johnson and Issac Trotts for comments on the manuscript. We also thank the two anonymous reviewers for providing many useful suggestions and pointers to relevant literature. This work was supported by NGIA contract HM 1582-05-C-0007 to D.J.F. Many of the ideas in this review were first developed in Olshausen, B. A., & Field, D. J. (2005). In T. J. Sejnowski & L. van Hemmen (Eds.), Twenty-three problems in systems neuroscience. New York: Oxford University Press. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America, A, 2, 284–299. Ahmed, B., Anderson, J. C., Douglas, R. J., Martin, K. A., & Nelson, J. C. (1994). Polyneuronal innervation of spiny stellate neurons in cat visual cortex. J. Comp. Neurol., 341, 39–49. Amedi, A., Raz, N., Pianka, P., Malach, R., & Zohary, E. (2003). Early ‘visual’ cortex activation correlates with superior verbal memory performance in the blind. Nat. Neurosci., 6, 758–766. Anderson, S. J., Burr, D. C., & Morrone, M. C. (1991). Two-dimensional spatial and spatial-frequency selectivity of motion-sensitive mechanisms in human vision. Journal of the Optical Society of America A, 8, 1340–1351. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Attwell, D., & Laughlin, S. B. (2001). An energy budget for signaling in the grey matter of the brain. J. Cereb. Blood Flow Metab., 21, 1133–1145.
How Close Are We to Understanding V1?
1693
Azouz, R., & Gray, C. M. (2000). Dynamic spike threshold reveals a mechanism for synaptic coincidence detection in cortical neurons in vivo. Proc. Natl. Acad. Sci. U.S.A., 97, 8110–8115. Azouz, R., & Gray, C. M. (2003). Adaptive coincidence detection and dynamic gain control in visual cortical neurons in vivo. Neuron, 37, 513–523. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B, 264, 1775–1783. Bakin, J. S., Nakayama, K., & Gilbert, C. D. (2000). Visual responses in monkey areas V1 and V2 to three-dimensional surface configurations. J. Neurosci., 20, 8188–8198. Barlow, H. B. (2001). Redundancy reduction revisited. Network: Computation in Neural Systems, 12, 241–253. Barnes, C. A., Skaggs, W. E., McNaughton, B. L., Haworth, M. L., Permenter, M., Archibeque, M., & Erickson, C. A. (2003). Chronic recording of neuronal populations in the temporal lobe of awake young adult and geriatric primates. Program No. 518.8. Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Bell, A. J., & Sejnowski, T. J. (1997). The independent components of natural images are edge filters. Vision Research, 37, 3327–3338. Ben-Shahar, O., & Zucker, S. W. (2004). Geometrical computations explain projection patterns of long-range horizontal connections in visual cortex. Neural Computation, 16, 445–476. Blakemore, C., & Campbell, F. W. (1969). On the existence of neurones in the human visual system selectively sensitive to the orientation and size of retinal images. J. Physiol., 203, 237–260. Bosking, W. H., Zhang, Y., Schofield, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17, 2112–2127. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17, 8621–8644. Cavanaugh, J. R., Bair, W., & Movshon, J. A. (2002a). Nature and interaction of signals from the receptive field center and surround in macaque V1 neurons. J. Neurophys., 88, 2530–2546. Cavanaugh, J. R., Bair, W., & Movshon, J. A. (2002b). Selectivity and spatial distribution of signals from the receptive field surround in macaque V1 neurons. J. Neurophys., 88, 2547–2556. Chung, S., & Ferster, D. (1998). Strength and orientation tuning of the thalamic input to simple cells revealed by electrically evoked cortical suppression. Neuron, 20, 1177–1189. Connor, C. E., Preddie, D. G., Gallant, J. L., & Van Essen, D. C. (1997). Spatial attention effects in macaque area V4. J. Neurosci., 17, 3201–3214. Den, Y., Atick, J. J., & Reid, R. C. (1996). Efficient coding of natural scenes in the lateral geniculate nucleus: Experimental test of a computational theory. Journal of Neuroscience, 16, 3351–3362. David, S. V., Vinje, W. E., & Gallant, J. L. (2004). Natural stimulus statistics alter the receptive field structure of V1 neurons. J. Neurosci., 24, 6991–7006.
1694
B. Olshausen and D. Field
De Valois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Res., 22, 545–559. Edin, F., Machens, C. K., Schutze, H., & Herz, A. V. (2004). Searching for optimal sensory signals: Iterative stimulus reconstruction in closed-loop experiments. J. Comput. Neurosci., 17, 47–56. Elder, J. H., Beniaminov, D., & Pintilie, G. (1999). Edge classification in natural images. Investigative Ophthalmology and Visual Science, 40, S357. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4, 2379–2394. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: Evidence for a local “association field.” Vision Research, 33, 173–193. Field, D. J., & Tolhurst, D. (1986). The structure and symmetry of simple-cell receptivefield profiles in the cat’s visual cortex. Proc. R. Soc. Lond. B. Biol. Sci., 228, 379–400. Field, D. J., & Wu, M. (2004). An attempt towards a unified account of non-linearities in visual neurons. Journal of Vision, 4, 283a. Fiser, J., Chiu, C., & Weliky, M. (2004). Small modulation of ongoing cortical dynamics by sensory input during natural vision. Nature, 431, 573–578. Foldiak, P., Xiao, D., Keysers, C., Edwards, R., & Perrett, D. I. (2004). Rapid serial visual presentation for the determination of neural selectivity in area STSa. Prog. Brain Res., 144, 107–116. Geisler, W. S., & Albrecht, D. G. (1992). Cortical neurons: Isolation of contrast gain control. Vision Research, 32, 1409–1410. Geisler, W. S., & Albrecht, D. G. (1997). Visual cortex neurons in monkeys and cats: Detection, discrimination and identification. Visual Neuroscience, 14, 897–919. Gilbert, C. D., & Wiesel, T. N. (1979). Morphology and intracortical projections of functionally characterised neurones in the cat visual cortex. Nature, 280, 120–125. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9, 2432–2442. Graham, N., & Nachmias, J. (1971). Detection of grating patterns containing two spatial frequencies: A test of single-channel and multiple-channels models. Vision Research, 11, 251–259. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Grimes, D. B., & Rao, R. P. (2005). Bilinear sparse coding for invariant vision. Neural Comput., 17, 47–73. Gur, M., & Snodderly, D. M. (1997). Visual receptive fields of neurons in primary visual cortex (V1) move in space with the eye movements of fixation. Vision Res., 37, 257–265. Hausser, M., & Mel, B. (2003). Dendrites: Bug or feature? Current Opinion in Neurobiology, 13, 372–383. Hawkins, J., & Blakeslee, S. (2004). On Intelligence. New York: Holt. Heeger, D. J. (1991). Computational model of cat striate physiology. In M. S. Landy & A. Movshan (Eds.), Computational models of visual perception (pp. 119–133). Cambridge, MA: MIT Press. Heeger, D. J., & Bergen, J. R. (1995, August). Pyramid based texture analysis/synthesis. Computer Graphics Proceedings, 229–238.
How Close Are We to Understanding V1?
1695
Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. J. Physiol., 148, 574–591. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. J. Physiol., 195, 215–243. Hupe, J. M., James, A. C., Payne, B. R., Lomber, S. G., Girard, P., & Bullier, J. (1998). Cortical feedback improves discrimination between figure and background by V1, V2 and V3 neurons. Nature, 394, 784–787. Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol., 58, 1233– 1258. Jung, M. W., & McNaughton, B. L. (1993). Spatial selectivity of unit activity in the hippocampal granular layer. Hippocampus, 3, 165–182. Kagan, I., Gur, M., & Snodderly, D. M. (2002). Spatial organization of receptive fields of V1 neurons of alert monkeys: Comparison with responses to gratings. J. Neurophysiol., 88, 2557–2574. Kapadia, M. K., Ito, M., Gilbert, C. D., & Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context: Parallel studies in human observers and in V1 of alert monkeys. Neuron, 15, 843–856. Kapadia, M. K., Westheimer, G., & Gilbert, C. D. (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. Journal of Neurophysiology, 84, 2048–2062. Keysers, C., Xiao, D. K., Foldiak, P., & Perrett, D. I. (2001). The speed of sight. J. Cogn. Neurosci., 13, 90–101. Knierim, J. J., & Van Essen, D. C. (1992). Neuronal responses to static texture patterns in area V1 of the alert macaque monkey. J. Neurophys., 67, 961–980. Knill, D. C., Field, D., & Kersten, D. (1990). Human discrimination of fractal images. J. Opt. Soc. Am. A, 7, 1113–1123. Lamme, V. A. (1995). The neurophysiology of figure-ground segregation in primary visual cortex. J. Neurosci., 15, 1605–1615. Lee, T. S., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am. A, 20, 1434–1448. Lee, T. S., Mumford, D., Romero, R., & Lamme, V. A. (1998). The role of the primary visual cortex in higher level vision. Vision Res., 38, 2429–2454. Legendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in spike trains of spontaneously active striate cortex neurons. J. Neurophysiol., 53, 926– 939. Lennie, P. (2003a). Receptive fields. Curr. Biol., 13, R216–219. Lennie, P. (2003b). The cost of cortical computation. Curr. Biol., 13, 493–497. Lesica, N. A., Boloori, A. S., & Stanley, G. B. (2003). Adaptive encoding in the visual pathway. Network, 14, 119–135. Lewicki, M. S., & Sejnowski, T. J. (1996). Bayesian unsupervised learning of higher order structure. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci. USA, 90, 10469–10473.
1696
B. Olshausen and D. Field
Maldonado, P., Babul, C., Singer, W., Rodriguez, E., & Grun, ¨ S. (2004). Synchrony and oscillations in primary visual cortex of monkeys viewing natural images. Manuscript submitted for publication. Mata, M. L., & Ringach, D. L. (2005). Spatial overlap of “on” and “off” subregions and its relation to response modulation ratio in macaque primary visual cortex. J. Neurophys., 93, 919–928. McDermott, J. (2004). Psychophysics with junctions in real images. Perception, 33, 1101–1127. Mechler, F., & Ringach, D. L. (2002). On the classification of simple and complex cells. Vision Res., 42, 1017–1033. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extra striate cortex. Science, 229, 782–784. Motter, B. C. (1995). Receptive field border stabilization during visual fixation. Investigative Ophthalmology and Visual Science, 36, S691. Motter, B. C., & Poggio, G. F. (1990). Dynamic stabilization of receptive fields of cortical neurons (VI) during fixation of gaze in the macaque. Exp. Brain Res., 83, 37–43. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch, & J. L. Davis (Eds.), Large scale neuronal theories of the brain (pp. 125–152). Cambridge, MA: MIT Press. Murray, S. O., Kersten, D., Olshausen, B. A., Schrater, P., & Woods, D. L. (2002). Shape perception reduces activity in human primary visual cortex. Proceedings of the National Academy of Sciences, USA, 99(23), 15164–15169. Nakayama, K., He, Z. J., & Shimojo, S. (1995). Visual surface representation: A critical link between lower-level and higher level vision. In S. M. Kosslyn & D. N. Osherson (Eds.), In an invitation to cognitive science (pp. 1–70). Cambridge, MA: MIT Press. Nguyenkim, J. D., & DeAngelis, G. C. (2003). Disparity-based coding of threedimensional surface orientation by macaque middle temporal neurons. J Neurosci., 23(18), 7117–7128. Nykamp, D. Q., & Ringach, D. L. (2002). Full identification of a linear-nonlinear system via cross-correlation analysis. Journal of Vision, 2, 1–11. O’Connor, K. N., Petkov, C. I., & Sutter, M. L. (2004). Stimulus optimization for auditory cortical neurons. Society for Neuroscience Abstracts, 529.14. Olshausen, B. A. (2003). Principles of image representation in visual cortex. In L. M. Chalupa & J. S. Werner (Eds.), The visual neurosciences (pp. 1603–1615). Cambridge, MA: MIT Press. Olshausen, B. A., & Anderson, C. H. (1995). A model of the spatial-frequency organization in primate striate cortex. In J. M. Bower (Ed.), The neurobiology of computation: Proceedings of the Third Annual Computation and Neural Systems Conference (pp. 275–280). Norwell, MA: Kluwer. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13, 4700–4719. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607– 609.
How Close Are We to Understanding V1?
1697
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pack, C. C., Berezovskii, V. K., & Born, R. T. (2001). Dynamic properties of neurons in cortical area MT in alert and anaesthetized macaque monkeys. Nature, 414, 905–908. Pack, C. C., Livingstone, M. S., Duffy, K. R., & Born, R. T. (2003). End-stopping and the aperture problem: Two-dimensional motion signals in macaque V1. Neuron, 39, 671–680. Paninski, L., Pillow, J. W., & Simoncelli, E. P. (2004). Maximum likelihood estimation of a stochastic integrate-and-fire neural encoding model. Neural Comput., 16, 2533– 2561. Parent, P., & Zucker, S. (1989). Trace inference, curvature consistency and curve detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 823–839. Parker, A. J., & Hawken, M. J. (1988). Two-dimensional spatial structure of receptive fields in monkey striate cortex. Journal of the Optical Society of America A, 5, 598–605. Peters, A., & Payne, B. R. (1993). Numerical relationships between geniculocortical afferents and pyramidal cell modules in cat primary visual cortex. Cereb. Cortex, 3, 69–78. Peters, A., Payne, B. R., & Budd, J. (1994). A numerical analysis of the geniculocortical input to striate cortex in the monkey. Cereb. Cortex., 4, 215–229. Polat, U., Mizobe, K., Pettet, M. W., Kasamatsu, T., & Norcia, A. M. (1998). Collinear stimuli regulate visual responses depending on cell’s contrast threshold. Nature, 391, 580–584. Polsky, A., Mel, B. W., & Schiller, J. (2004). Computational subunits in thin dendrites of pyramidal cells. Nat. Neurosci., 7, 621–627. Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci., 2, 79– 87. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ringach, D., Hawken, M., & Shapley, R. (2002). Receptive field structure of neurons in monkey primary visual cortex revealed by stimulation with natural image sequences. Journal of Vision, 2, 12–24. Rockland, K. S., & Lund, J. S. (1983). Intrinsic laminar lattice connections in primate visual cortex. Journal of Comparative Neurology, 216, 303–318. Rose, G., Diamond, D., & Lynch, G. S. (1983). Dentate granule cells in the rat hippocampal formation have the behavioral characteristics of theta neurons. Brain Res., 266, 29–37. Rust, N. C., Schwartz, O., Movshon, J. A., & Simoncelli, E. P. (2004). Spike-triggered characterization of excitatory and suppressive stimulus dimensions in monkey V1. Neurocomputing, 58-60C, 793–799. Ruthazer, E. S., & Stryker, M. P. (1996). The role of activity in the development of long-range horizontal connections in area 17 of the ferret. Journal of Neuroscience, 16, 7253–7269. Sadato, N., Pascual-Leone, A., Grafman, J., Ibanez, V., Deiber, M. P., Dold, G., & Hallett, M. (1996). Activation of the primary visual cortex by braille reading in blind subjects. Nature, 380, 526–528.
1698
B. Olshausen and D. Field
Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4, 819–825. Series, P., Lorenceau, J., & Fr´egnac, Y. (2003). The “silent” surround of V1 receptive fields: Theory and experiments. J. Physiol. Paris., 97, 453–474. Sha’ashua, A., & Ullman, S. (1988). Structural Saliency: The detection of globally salient structures using a locally connected network. In Proceedings of the International Conference on Computer Vision, Tampa, Florida (pp. 321–327). Washington, DC: IEEE Computer Society Press. Shapley, R. (2004). A new view of the primary visual cortex. Neural Networks, 17, 615–623. Sharpee, T., Rust, N. C., & Bialek, W. (2004). Analyzing neural responses to natural signals: Maximally informative dimensions. Neural Computation, 16, 223–250. Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro, J., & Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378, 492–496. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annu. Rev. Neurosci., 24, 1193–1216. Sincich, L. C., & Blasdel, G. G. (2001). Oriented axon projections in primary visual cortex of the monkey. Journal of Neuroscience, 21, 4416–4426. Skottun, B. C., De Valois, R. L., Grosof, D. H., Movshon, J. A., Albrecht, D. G., & Bonds, A. B. (1991). Classifying simple and complex cells on the basis of response modulation. Vision Res., 31, 1079–1086. Smyth, D., Willmore, B., Baker, G. E., Thompson, I. D., & Tolhurst, D. J. (2003). The receptive-field organization of simple cells in primary visual cortex of ferrets under natural scene stimulation. J. Neurosci., 23, 4746–4759. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Tarr, M. J. (1999). News on views: Pandemonium revisited. Nat. Neurosci., 2, 932–935. Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation, 12, 1247–1283. Thompson, L. T., & Best, P. J. (1989). Place cells and silent cells in the hippocampus of freely-behaving rats. J. Neurosci., 9, 2382–2390. Tinbergen, N. (1972). The animal in its world: Explorations of an ethologist. Cambridge, MA: Harvard University Press. Tolhurst, D. J., Movshon, J. A., & Dean, A. F. (1983). The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res., 23, 775–785. Touryan, J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. J. Neurosci., 22, 10811–10818. Ullman, S. (1995). Sequence seeking and counter streams: A computational model for bidirectional information flow in the visual cortex. Cereb. Cortex., 5, 1–11. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265, 359–366. Vinje, W. E., & Gallant, J. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287, 1273–1276. Vinje, W. E., & Gallant, J. L. (2002). Natural stimulation of the nonclassical receptive field increases information transmission efficiency in V1. J. Neurosci., 22, 2904– 2915.
How Close Are We to Understanding V1?
1699
Walker, G. A., Ohzawa, I., & Freeman, R. D. (1999). Asymmetric suppression outside the classical receptive field of the visual cortex. Journal of Neuroscience, 19, 10536– 10553. Watson, A. B., Barlow, H. B., & Robson, J. G. (1983). What does the eye see best? Nature, 302, 419–422. Wirth, S., Yanike, M., Frank, L. M., Smith, A. C., Brown, E. N., & Suzuki, W. A. (2003). Single neurons in the monkey hippocampus and learning of new associations. Science, 300, 1578–1581. Worg ¨ otter, ¨ F., Suder, K., Zhao, Y., Kerscher, N., Eysel, U. T., & Funke, K. (1998). Statedependent receptive-field restructuring in the visual cortex. Nature, 396, 165–168. Yoshioka, T., Blasdel, G. G., Levitt, J. B., & Lund, J. S. (1996). Relation between patterns of intrinsic lateral connectivity, ocular dominance, and cytochrome oxidasereactive regions in macaque monkey striate cortex. Cerebral Cortex, 6, 297–310. Young, M. P. (2000). The architecture of visual cortex and inferential processes in vision. Spatial Vision, 13, 137–146. Zetzsche, C., Krieger, G., & Wegmann, B. (1999). The atoms of vision: Cartesian or polar? J. Opt. Soc. Am. A, 16, 1554–1565. Zetzsche, C., & Rohrbein, F. (2001). Nonlinear and extra-classical receptive field properties and the statistics of natural scenes. Network, 12, 331–350. Zhou, H., Friedman, H. S., & von der Heydt, R. (2000). Coding of border ownership in monkey visual cortex. Journal of Neuroscience, 20, 6594–6611. Zipser, K., Lamme, V. A., & Schiller, P. H. (1996). Contextual modulation in primary visual cortex. J. Neurosci., 16, 7376–7389.
Received May 6, 2004; accepted January 31, 2005.
NOTE
Communicated by Rajesh Rao
Motion Contrast Classification Is a Linearly Nonseparable Problem Alireza S. Mahani
[email protected] Ralf Wessel
[email protected] Physics Department, Washington University, St. Louis, MO 63130, U.S.A.
Sensitivity to image motion contrast, that is, the relative motion between different parts of the visual field, is a common and computationally important property of many neurons in the visual pathways of vertebrates. Here we illustrate that, as a classification problem, motion contrast detection is linearly nonseparable. In order to do so, we prove a theorem stating a sufficient condition for linear nonseparability. We argue that nonlinear combinations of local measurements of velocity at different locations and times are needed in order to solve the motion contrast problem.
1 Introduction Many neurons in the visual pathways of vertebrates, such as cells in the middle temporal area of primates (Allman, Miezin, & McGuinness, 1985) and cells in the deep layers of the avian optic tectum (Frost & Nakayama, 1983), are sensitive to image motion contrast. Motion contrast sensitivity is believed to play a major role in important perceptual functions such as figure-ground segregation (Nakayama, 1985). Here we illustrate that as a classification problem, motion contrast detection is linearly nonseparable. Section 2 defines the concept of linear separability for a binary classification problem. Section 3 offers a sufficient condition for a problem to be linearly nonseparable. Section 4 uses the result of section 3 to prove that motion contrast detection is linearly nonseparable. Section 5 generalizes the nonseparability of the motion contrast problem to include a certain type of nonlinear preprocessing. In section 6 we discuss what happens when image intensity is used as input for calculating motion contrast. Section 7 offers concluding remarks. Neural Computation 17, 1700–1705 (2005)
© 2005 Massachusetts Institute of Technology
Motion Contrast Classification
1701
2 Defining Linear Separability Suppose that we have a set of n d-dimensional samples x 1 , . . . , x n , n1 of which are in the subset 1 and n2 = n − n1 in the subset 2 . If we form a linear combination of the components of xi , we obtain the scalar t . xi . yi = w
(2.1)
This operation produces a set of n samples y1 , . . . , yn from the input samples x 1 , . . . , x n . The two subsets 1 and 2 are linearly separable if there exists a and a scalar γ such that vector w ∀xi ∈ 1 , yi > γ
and ∀xi ∈ 2 , yi < γ ,
(2.2)
with yi ’s defined in equation 2.1. In other words, two classes are linearly separable if they can be completely separated by a hyperplane in the feature space. If such a hyperplane cannot be found, the two classes are called linearly nonseparable. For more on linear separability, see Duda, Hart, and Stork (2001). 3 A Sufficient Condition for Linear Nonseparability Given the above definition of linear separability, it is easy to prove the following theorem. Theorem 1. Two classes are linearly nonseparable if we can find two samples from each class, x 1 , x 2 ∈ 1 and x 3 , x 4 ∈ 2 , and two numbers α and β, such that α x 1 + (1 − α) x 2 = β x 3 + (1 − β) x 4 ,
(3.1)
0 ≤ α, β ≤ 1.
(3.2)
and γ such that w t . x 1 > Proof. If the two classes are separable, there are w t t t . x 2 > γ , w . x 3 < γ , and w . x 4 < γ . From these four inequalities, we γ, w t .(α x 1 + (1 − α) x 2 ) > (1 − α + α)γ = γ and w t .(β x 3 + (1 − conclude that w β) x 4 ) < (1 − β + β)γ = γ . These in turn indicate that α x 1 + (1 − α) x 2 = β x 3 + (1 − β) x 4 , which leads to contradiction. As a special case, we can have α = β = 0.5. Equation 3.1 will now read x 1 + x 2 = x 3 + x 4 ; the sums of the two samples in each class are the same.
1702
A. Mahani and R. Wessel
Motion contrast
No motion contrast
x1
x3
x2
x4
x1 + x 2
x3 + x 4
Figure 1: Pairs of points in feature space for motion contrast versus no motion contrast (top two rows). Arrows indicate the velocity of each dot, which is either v or −v. Both pairs have the same sum. This, according to theorem 1, is sufficient to prove the nonseparability of this classification problem.
4 Linear Nonseparability of the Motion Contrast Classification Problem Using theorem 1, we can easily demonstrate the nonseparability of motion contrast detection problem for two dots. Figure 1 illustrates a case where the sums of two samples from the two classes (with and without motion contrast) are equal. In terms of equation 3.1, we have x 1 = (v, −v), x 2 = (−v, v) and x 3 = (v, v), x 4 = (−v, −v), and therefore x 1 + x 2 = x 3 + x 4 = (0, 0). According to theorem 3.1, this proves that the problem is linearly nonseparable. The same argument can be applied to an object against a background, ignoring the occlusion and edge effects. 5 Point-Wise Static Preprocessing We have observed that a simple linear transformation of the input cannot separate the coherent and noncoherent cases. It can easily be shown that even if this linear transformation is preceded by a point-wise static nonlinear transformation, separation of the two classes of inputs still cannot be achieved. A point-wise static transformation, if considered in isolation, is nothing more than a scalar function of a scalar variable: g(v). The qualifiers “static” and “point-wise” are meant to emphasize that as a transformation applied to a dynamic scalar field, such as a time-varying image intensity field, it does not integrate (in the general sense) the values of the function over time or space to produce the output. Again, consider the two-dot scenario with x 1 = (v, −v), x 2 = (−v, v) (motion contrast) and x 3 = (v, v), x 4 = (−v, −v) (no motion contrast). If we apply g( ) to each of these samples, we find x 1 = (g(v), g(−v)), x 3 = (g(v), g(v)),
x 2 = (g(−v), g(v)), x 4 = (g(−v), g(−v)).
(5.1)
Motion Contrast Classification
1703
Obviously, we still have x 1 + x 2 = x 3 + x 4 . Therefore, the problem remains linearly nonseparable.
6 Image Intensity Field as Input So far we have assumed that motion contrast detection uses the velocity field as input. It is indeed possible to detect motion contrast by directly using the 1 and image intensity. For example, consider two dots moving at velocities v 2 . A reasonable definition of the motion contrast (MC) of this arrangement v 2 |. Clearly, v 1 = dr1 /dt and v 2 = dr2 /dt with r1 and r2 being is MC = | v1 − v the displacement vectors for the two objects. We can now rephrase mo 2 | = |dr1 /dt − dr2 /dt| = |d(r1 − r2 )/dt|. The tion contrast as MC = | v1 − v last expression shows that we can detect motion contrast between two moving dots by monitoring the change in their distance rather than calculating their velocities first. Dellen, Clark, and Wessel (2004) have suggested a more general method for detecting motion contrast that does not need velocity information. Is motion contrast detection still linearly nonseparable if we use intensity instead of velocity information? The answer is yes. An argument very similar to that presented here can be made for the intensity-based approach. A point-wise static nonlinearity applied to the intensity field does not change the linear nonseparability either. This is because if we consider two widely separated objects in motion and apply the point-wise static transformation to the resulting intensity field, we arrive at two objects with possibly distorted profiles moving at the same velocities as the original intensity field, and the distortion is independent of velocities. Therefore, if the original problem is linearly nonseparable, so is the new one.
7 Discussion In summary, we showed that motion contrast detection cannot happen through a linear integration of local inputs, even if these local inputs have been preprocessed by a static point-wise transformation (linear or nonlinear). In the primate visual cortex, the earliest area with motion contrast sensitive neurons is the middle temporal cortex (area MT). The main pathway leading from the retina to area MT passes through the lateral geniculate nucleus (LGN) and primary visual cortex (area V1 ). The presence of multiple stages between where the image intensity map is represented and where motion contrast is represented has naturally led many researchers to assume that in primates, velocity field is calculated prior to the computation of motion contrast. As a result, in building cortical models of image motion processing, most of the attention and effort has been devoted to estimating local image velocity (Poggio & Reichardt, 1973; Adelson & Bergen, 1985;
1704
A. Mahani and R. Wessel
Simoncelli & Heeger, 1998). All these models require nonlinear processing of image intensity in order to estimate local image velocity. The highly nonlinear (phase-invariant) response of complex cells in area V1 (Skottun et al., 1991) is an important example of such nonlinearities. In this article, however, we began by assuming that the velocity field was provided as input to the motion contrast detection problem. Therefore, our conclusion that nonlinearities other than the point-wise static type are needed to perform the motion contrast detection task is independent of the well-known fact that estimation of velocity field requires nonlinear processing. Simoncelli and Heeger (1998) use a nonlinear operation called divisive normalization to reduce the dependency of the response of their model MT neurons on image contrast. In its original form, this nonlinearity is restricted to the classical (center) receptive field of the neuron. The authors have suggested, however, that a similar normalization that pools neurons with opposite-direction preferences in the center and surround areas may result in motion contrast sensitivity. While our result cannot confirm the above hypothesis regarding the origin of motion contrast sensitivity in the MT neurons, they are certainly consistent with each other, as divisive normalization involves nonlinear interactions across space and therefore is not point-wise. Physiologically plausible ways of implementing such nonlinearities, using inhibitory feedback, have been suggested and (indirectly) tested (for example, see Carandini, Heeger, & Movshon, 1997). In the nonmammalian visual system, deep tectal neurons with motion contrast sensitivity are suspected to be only one synapse away from the retina (Luksch, 2003). While this observation does not exclude the possibility of other inputs to such cells carrying the local velocity information, it certainly motivates research on possible ways to extract motion contrast information directly from the image intensity (Dellen et al., 2004). In this article, we argued that such a direct solution would still require nonlinear processing of a similar nature to that required while using local velocity as input. Anatomical (Luksch, 2003) and electrophysiological (Wang, 2003) studies suggest the existence of a number of other neural structures and pathways, both within and outside the optic tectum, with a potentially nonlinear role in shaping the complex response properties of deep tectal neurons, including their motion contrast sensitivity. Examples are the network of horizontal neurons in the optic tectum and the feedback loop between the optic tectum and its midbrain satellite, nucleus isthmi.
Acknowledgments We thank Anders E. Carlsson and the anonymous reviewers for their critical reading of the manuscript. This work was supported by grants from the Whitehall Foundation and the McDonnell Center for Higher Brain Function to R.W.
Motion Contrast Classification
1705
References Adelson E. H., & Bergen J. R. (1985). Spatiotemporal models for the perception of motion. Journal of the Optical Society of America A, 2, 284–299. Allman, J. M., Miezin, F. M., & McGuinness, E. (1985). Stimulus specific responses from beyond the classical receptive field: Neurophysiological mechanisms for local-global comparisons in visual neurons. Annual Review of Neuroscience, 8, 407–430. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17, 8621–8644. Dellen, B. K., Clark, J. W., & Wessel R. (2004). Motion contrast computation without directionally selective motion sensors. Physical Review E, 70, 031907. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Frost, B. J., & Nakayama, K. (1983). Single visual neurons code opposing motion independent of direction. Science, 220, 744–745. Luksch, H. (2003). Cytoarchitecture of the avian optic tectum: Neuronal substrate for cellular computation. Reviews in the Neurosciences, 14, 85–106. Nakayama, K. (1985). Biological image motion processing: A review. Vision Research, 25, 625–660. Poggio, T., & Reichardt, W. (1973). Considerations on models of movement detection. Kybernetik, 13, 223–227. Simoncelli, E. P., & Heeger, D. J. (1998). A model of neuronal responses in visual area MT. Vision Research, 38, 743–761. Skottun, B. C., De Valois, R. L., Grosof, D. H., Movshon, J. A., Albrecht, D. G., & Bonds, A. B. (1991). Classifying simple and complex cells on the basis of response modulations. Vision Research, 31, 1079–1086. Wang, S. R. (2003). The nucleus isthmi and dual modulation of the receptive field of tectal neurons in non-mammals. Brain Research Reviews, 41, 13–25.
Received April 23, 2004; accepted January 27, 2005.
NOTE
Communicated by Aapo Hyvarinen
Edgeworth-Expanded Gaussian Mixture Density Modeling Marc M. Van Hulle
[email protected] K. U. Leuven, Laboratorium voor Neuro- en Psychofysiologie B-3000 Leuven, Belgium
Instead of increasing the order of the Edgeworth expansion of a single gaussian kernel, we suggest using mixtures of Edgeworth-expanded gaussian kernels of moderate order. We introduce a simple closed-form solution for estimating the kernel parameters based on weighted moment matching. Furthermore, we formulate the extension to the multivariate case, which is not always feasible with algebraic density approximation procedures.
1 Introduction The Edgeworth expansion of the one-dimensional gaussian density function has been popular ever since it was introduced in independent component analysis (ICA) (Comon, 1994; Amari, Cichocki, & Yang, 1996) and projection pursuit (Jones & Sibson, 1987). However, objections have been raised to the Edgeworth expansion that it would not estimate well the structure near the centroid of the density (Friedman, 1987), that it would be sensitive to outliers (Huber, 1985), and that it would not always result in positive definite approximations (Draper & Tierney, 1972). Other algebraic density approximations have been introduced, often as alternatives to the Edgeworth expansion, such as the normal inverse gaussian density (Barndorff-Nielsen, 1978), the generalized lambda distribution (Ramberg & Schmeiser, 1974), and the approximative maximum entropy approach (Hyv¨arinen, 1998). But they have their limitations as well. The obvious way to improve the density estimation accuracy is to increase the order of the Edgeworth expansion, but this does not lead to an immediate improvement; more seriously, it potentially renders the estimate even more sensitive to outliers. Rather than increasing the order of the Edgeworth expansion of a single gaussian kernel, we suggest using mixtures of Edgeworth-expanded gaussian kernels of moderate order. This leads to a simple closed-form solution for estimating the kernel parameters. Furthermore, we formulate the extension to the multivariate case. Neural Computation 17, 1706–1714 (2005)
© 2005 Massachusetts Institute of Technology
Edgeworth-Expanded Gaussian Mixture Density Modeling
1707
2 Gaussian Mixture Let v be a random scalar in V ⊆ generated from the probability density p(v). The homogeneous, heteroscedastic gaussian mixture density estimate of the unknown probability density is then p(v) ≈ p˜ (v|W, σ ) =
1 p(v|i, wi , σi ), N i
(2.1)
with N the number of gaussian kernels used, and with −(v − wi )2 p(v|i, wi , σi ) = √ . exp 2σi2 2πσi 1
(2.2)
A standard procedure to estimate the parameters W = (w1 , . . . , w N ) and
σ = (σ1 , . . . , σ N ) is by minimizing the (average) negative log likelihood for the sample S = {vn |n = 1, . . . , M} (Redner & Walker, 1984), F = −log L = −
1 log p˜ (vn |W, σ ), M n
(2.3)
through an expectation-maximization (EM) approach (Dempster, Laird, & Rubin, 1977), which leads to the following fixed-point update rules: P(i|vn )vn wi = n , n n P(i|v ) P(i|vn )(vn − wi )2 2 σi = n , n n P(i|v )
(2.4)
, and where we have with P(i|v) the posterior probabilities P(i|v) = p(v|i) j p(v| j) omitted the kernel parameters to simplify the notation. The wi and σi2 can also be regarded as weighted first and second moments, weighted by the ith kernel’s posterior probability.
3 Edgeworth Expansion The Edgeworth series expansion of the density p(v) around its best gaussian estimate φ p (i.e., with the same mean µ and standard deviation σ as p) is
1708
M. Van Hulle
Table 1: Mean Squared Error (MSE) and Kullback-Leibler Divergence (KL) Density Estimation Performance. MSE/KL [×10−4 ] pdf Cauchy λ = 1 Gaussian σ = 1 Laplacian Ø = 1 Triangular Uniform [−1, 1]
N = 1 kernel k=1
k=3
k=4
N = 3 kernels k=5
50.5/97.8 30.4/80.3 30.4/80.3 30.4/80.3 (0)/0.676 (0)/0.676 (0)/0.676 (0)/0.675 28.3/23.5 12.7/17.3 12.7/17.3 12.7/17.3 8.53/2.49 4.38/1.55 4.38/1.55 4.39/1.55 115/11.9 69.5/11.4 69.5/11.4 69.5/11.4
k=1
k=3
21.0/75.5 3.83/2.89 8.69/20.3 6.70/1.12 45.9/12.4
20.1/67.9 1.36/1.23 3.75/15.6 1.67/0.980 32.0/5.35
Notes: One hundred thousand data points were drawn from a given density (pdf), using N = 1 kernel and the first k = 1, 3, 4, and 5 terms in the Edgeworth expansion, equation 3.1, and using N = 3 gaussian (k = 1) and N = 3 Edgeworth kernels (k = 3). By (0), we mean smaller than 10−5 . The triangular pdf is isosceles with unit height.
given by Barndorff-Nielsen and Cox (1989): 1 1 p(v) ≈ φ p (v) 1 + κ3 H3 (v) + κ4 H4 (v) 3! 4! 10 1 + κ32 H6 (v) + κ5 H5 (v) + . . . , 6! 5!
(3.1)
with κi the ith standardized cumulant and Hi the Hermite polynomial of order i, H3 (v) = z3 − 3z, H4 = z4 − 6z2 + 3, H5 = z5 − 10z3 + 15z, H6 = z6 − 15z4 + 45z2 − 15, with z the standardized scalar z = v−µ . σ The disadvantage of the Edgeworth expansion is that it does not estimate well the structure of the density near the centroid (Friedman, 1987) and that it is sensitive to outliers (Huber, 1985). Furthermore, the Edgeworth expansion cannot be made arbitrarily good by including terms of higher and higher orders (e.g., see Table 1 for N = 1 kernel and five different densities). In an attempt to remedy this, we propose using kernel mixtures consisting of low-order Edgeworth-expanded gaussian kernels. 4 Edgeworth-Expanded Gaussian Mixture We take for the p(v|i) in equation 2.1 the Edgeworth expanded kernels in equation 3.1, up to the first three terms between the brackets (thus including the third- and fourth-order Hermite polynomials), 1 1 1 p(v) ≈ φ pi (v) 1 + κ3i H3 (v) + κ4i H4 (v) , (4.1) N i 3! 4! and estimate the cumulants through the following weighted moments, weighted by the posterior probabilities P(i|v) (for the derivation, see the
Edgeworth-Expanded Gaussian Mixture Density Modeling
1709
appendix): P(i|vn )vn , µ1i = n n n P(i|v ) P(i|vn )(vn − wi )2 µ2i = n , n n P(i|v ) P(i|vn )(vn − wi )3 µ3i = n , n n P(i|v ) P(i|vn )(vn − wi )4 µ4i = n , n n P(i|v )
(4.2)
with µ1i ≡ wi , µ2i ≡ σi2 , and note that2 the standardized third and fourth cuµ −3µ mulants are κ3i = µσ3i3 and κ4i = 4i σ 4 2i . We determine the kernel parameters i i using the traditional EM scheme. When reconsidering the densities, we obtain for N = 3 kernels, without and with Edgeworth expansion, the results listed in the last two columns of Table 1, respectively. We observe that the Edgeworth-expanded mixtures yield better results compared to gaussian mixtures and to the case of one (Edgeworth expanded or not) gaussian kernel (except when the input density is gaussian since there a single gaussian kernel is obviously optimal). The result is depicted graphically for the uniform distribution in Figure 1. Note the smaller ripple in the Edgeworth-expanded case. Another practical observation is that the risk of negative densities is reduced with Edgeworthexpanded mixtures compared to the case of a single Edgeworth-expanded kernel.
0.8
|
0.6
|
0.4
|
|
0.2
|
|
B
|
p
p
0.8
|
A
0.6 0.4 0.2
| -1
| 0
| 1
| 2 v
0.0 | -2
|
0.0 | -2
| -1
| 0
| 1
| 2 v
Figure 1: Gaussian mixture density estimate (dashed line) (A) and Edgeworthexpanded mixture density estimate (dashed line) (B) for the case of three kernels (thin solid lines) and the uniform input distribution [−1, 1] (thick solid line).
1710
M. Van Hulle
5 Multivariate Case In case v = [v1 , . . . , vd ] ∈ V ⊆ Rd is a random vector drawn from the probability density p(v), we need to consider multivariate Edgeworth expansions. The Edgeworth expansion of p(v), up to order five about its best normal estimate, is given by (Barndorff-Nielsen & Cox, 1989), p(v) ≈ φ p (v) 1 +
+
i, j,k,l, p,q
1 1 κi, j,k Hi jk (v) + κi, j,k,l Hi jkl (v) 3! 4! i, j,k i, j,k,l 1 κi, j,k κl, p,q Hi jklpq (v) , 72
(5.1)
with Hi jk the i jkth Hermite polynomial, with i, j, k the corresponding input dimensions, i, j, k ∈ {1, . . . , d}, and κi, j,k the corresponding standardized κ cumulant, κi, j,k = √ 2i jk 2 2 , with κi jk the third cumulant over input dimenσi σ j σk sions i, j, k, and where the sum over all combinations i, j, k is considered, and Hi jkl the i jklth Hermite polynomial over input dimensions i, j, k, l and κ the corresponding standardized cumulant κi, j,k,l , κi, j,k,l = √ 2 i jkl2 2 2 , with σi σ j σk σl κi jkl the fourth cumulant over input dimensions i, j, k, l (for the connection between moments and cumulants in the multivariate case; see McCullagh, 1987). We can then consider mixtures of kernels in equation 5.1, as done previously, for the one-dimensional case. As a simplification, one could consider only the third- and fourth-order Hermite polynomials for the largest second moment(s) of each kernel. 6 Other Expansions Several methods exist for approximating algebraic functions of random variables. Following the tradition of adopting flexible transforms for densities combined with moment matching is the class of the normal inverse gaussian density (NIG) (Barndorff-Nielsen, 1978). In the case of a single flexible “kernel,” the parameters can be obtained by solving a nonlinear system of equations, for example, one for the mean, the variance, the skewness, and the kurtosis of the data set (thus, unlike the Edgeworth expansion, no closedform solution). The NIG kernel leads to a slightly improved performance with respect to (the tails of) the Edgeworth expansion (Eriksson, Forsberg, & Ghysels, 2004), but at the expense of requiring a parameter estimation algorithm. Kernel mixture density modeling is not feasible since then we would have more parameters than equations. Also, the inverse gaussian density transform is not extendable to the multivariate case.
Edgeworth-Expanded Gaussian Mixture Density Modeling
1711
Another class is the generalized lambda distribution (GλD) (Ramberg & Schmeiser, 1974). Finding its parameters from moments requires minimizing a system of two bivariate nonlinear equations, which can be a challenging problem (for review, see Lakhany & Mausser, 2000). Again, since no closed-form solution is available, kernel mixture density modeling is not feasible. Also, since GλD works with quantile functions, it is in essence limited to the one-dimensional case. Finally, we mention the approximative maximum entropy approach (Hyv¨arinen, 1998), which yields a density expansion of a gaussian similar to Edgeworth’s, but in terms of a series of functions rather than a series of Hermite polynomials of increasing order. These functions can be selected so that the estimation of their coefficients becomes less sensitive to outliers. Kernel mixture density modeling is in principle feasible since the firstand higher-order weighted moments with respect to these functions can be defined (similar to our procedure). However, the technique is not readily extendable to the multivariate case, since the multivariate orthonormal functions need to be supplied by the user. 7 Conclusion Rather than increasing the order of the Edgeworth expansion of a single gaussian kernel, we introduced the Edgeworth-expanded gaussian mixture of moderate order and applied it to density estimation in an EM format. We showed the improved density estimation accuracy for the univariate case and formulated the extension to the multivariate case. We acknowledge that the practical value of using Edgeworth-expanded kernels decreases with increasing numbers of kernels in the mixture. Hence, for example, our approach could be applied to density-based clustering and classification where the class conditionals are preferably modeled with small numbers of kernels. The Edgeworth-based density estimate could also serve as a starting point to derive differential entropy estimates, which could then be used in independent component analysis and projection pursuit. Finally, it is clear that instead of expanding gaussian kernels, one could equally well choose another type of reference kernel. However, using the gaussian mixture itself as the reference “kernel” leads to unwieldy Hermite polynomials with no guarantee of orthogonality and no immediate clue as to how to compute the coefficients in the expansion. Appendix: Derivation of Equation 4.2 We have the following parameters to optimize: the first and second moments of the gaussian kernels, µ1i and µ2i , ∀i, and the standardized third and fourth cumulants, κ3i and κ4i , ∀i, since these appear in the truncated Edgeworth expansion. Since the third and fourth cumulants cannot be made explicit from their zero derivatives of the log likelihood, we consider the following
1712
M. Van Hulle
constrained optimization problem: min{F } subject to : 2 µ4i − 3σi4 µ3i 2 2 2 (κ3i ) − ≤ 0 and (κ4i ) − ≤ 0, ∀i, σi4 σi3
(A.1)
with F the average negative log likelihood, and solve it with the Lagrange multipliers technique: 1 1 log p(v|i, wi , σi , κ3i , κ4i ) M n N i 2 µ4i − 3σi4 µ3i 2 2 2 λ3i (κ3i ) − λ4i (κ4i ) − + + σi4 σi3 i i
Fconstr = −
(A.2) with the λ3i and λ4i , ∀i, the Lagrange multipliers. Note that the constraints are the differences between the squared third and fourth cumulants and the squared expressions of the moments from which they are computed. The solution for µ1i and µ2i is identical to that of nthe traditional gausφ (v ) sian mixture, equation 2.4, when assuming that p˜pi(vn ) ≈ P(i|vn ), with φ pi representing the best gaussian estimate of p(v|i, wi , σi , κ3i , κ4i ). Consider now the derivatives with respect to κ3i : 1 φ pi (vn ) ∂ Fconstr =− H3i + λ3i 2 κ3i = 0, ∂κ3i 3!NM n p˜ (vn ) with H3i the third-order Hermite polynomial of kernel i, H3i (v) = z3 − 3z, φ (vn ) i z = v−w . If we assume that p˜pi(vn ) ≈ P(i|vn ) and divide the equation by σi P(i|vn )z n n = 0, we obtain n P(i|v ), and note that P(i|vn ) n
0=− =−
1 P(i|vn )z3 2 κ3i + λ3i n) n 3!NM n P(i|v n n P(i|v ) 2 κ3i 1 µ3i + λ3i . n 3!NM σi3 n P(i|v )
(A.3)
Assuming that the slack variable in the inequality constraint is zero, then tells us λ3i = 0, and we have an equality constraint. The equality constraint P(i|vn ) that κ3i = µσ3i3 , so that from equation A.3, it follows that λ3i = 2n3!NM . Asi suming that the slack variable is nonzero, λ3i = 0, we obtain from equation
Edgeworth-Expanded Gaussian Mixture Density Modeling
1713
A.3 the trivial solution µσ3i3 = 0, a solution we need to reject when the sample i has a nonzero weighted third moment. Reversing the inequality sign in the constraint leads to the same conclusion. Analogously, from ∂ F∂κconstr = 0, we obtain that 4i 0=−
1 4!NM
2 κ4i µ4i − 3 + λ4i , 4 n σi n P(i|v )
(A.4) µ −3σ 4
so that for a zero slack variable, we obtain κ4i = 4i σ 4 i . The nonzero slack i variable solution needs to be rejected since the sample has a nonzero weighted fourth moment. Reversing the inequality sign leads to the same conclusion. Acknowledgments I am supported by research grants received from the Belgian Fund for Scientific Research–Flanders (G.0248.03 and G.0234.04), the Interuniversity Attraction Poles Programme–Belgian Science Policy (IUAP P5/04), the Flemish Regional Ministry of Education (Belgium) (GOA 2000/11), and the European Commission (IST-2001-32114, IST-2002-001917, and NEST-2003012963). References Amari, S.-I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Barndorff-Nielsen, O. E. (1978). Hyperbolic distributions and distributions on hyperbolae. Scandinavian Journal of Statistics, 5, 151–157. Barndorff-Nielsen, O. E., & Cox, D. R. (1989). Inference and asymptotics. London: Chapman and Hall. Comon, P. (1994). Independent component analysis—a new concept? Signal Process., 36(3), 287–314. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Royal Statistical Soc., B, 39, 1–38. Draper, N. R., & Tierney, D. E. (1972). Regions of positive and unimodal series expansion of the Edgeworth and Gram-Charlier approximations. Biometrika, 59, 463–465. Eriksson, A., Forsberg, L., & Ghysels, E. (2004). Approximating the probability functions of random variables: A new approach. Presented at the 2004 Far Eastern Meeting of the Econometric Society, Yonsei, Korea. Available online at: http://www.unc.edu/∼eghysels/papers/APR4.pdf. Friedman, J. (1987). Exploratory projection pursuit. J. American Statistical Association, 82(397), 249–266. Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2), 435–475.
1714
M. Van Hulle
Hyv¨arinen, P. (1998). New approximations of differential entropy for independent component analysis and projection pursuit. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural processing systems, 10 (pp. 273–279). Cambridge, MA: MIT Press. Jones, M., & Sibson, R. (1987). What is projection pursuit? J. Royal Statistical Society A, 150, 1–36. Lakhany, A., & Mausser, H. (2000). Estimating the parameters of the generalized lambda distribution. ALGO Research Quarterly, 3(3), 47–58. McCullagh, P. (1987). Tensor methods in statistics. London: Chapman and Hall. Ramberg, J., & Schmeiser, B. (1974). An approximate method for generating asymmetric random variables. Communications of the ACM, 17(2), 78–82. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev., 26, 195–239.
Received November 23, 2004; accepted January 27, 2005.
LETTER
Communicated by Herbert Jaeger
Movement Generation with Circuits of Spiking Neurons Prashant Joshi
[email protected] Wolfgang Maass
[email protected] Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010 Graz, Austria
How can complex movements that take hundreds of milliseconds be generated by stereotypical neural microcircuits consisting of spiking neurons with a much faster dynamics? We show that linear readouts from generic neural microcircuit models can be trained to generate basic arm movements. Such movement generation is independent of the arm model used and the type of feedback that the circuit receives. We demonstrate this by considering two different models of a two-jointed arm, a standard model from robotics and a standard model from biology, that each generates different kinds of feedback. Feedback that arrives with biologically realistic delays of 50 to 280 ms turns out to give rise to the best performance. If a feedback with such desirable delay is not available, the neural microcircuit model also achieves good performance if it uses internally generated estimates of such feedback. Existing methods for movement generation in robotics that take the particular dynamics of sensors and actuators into account (embodiment of motor systems) are taken one step further with this approach, which provides methods for also using the embodiment of motion generation circuitry, that is, the inherent dynamics and spatial structure of neural circuits, for the generation of movement.
1 Introduction Using biologically realistic neural circuit models to generate movements is not so easy, since these models are made of spiking neurons and dynamic synapses, which exhibit a rich inherent dynamics on several temporal scales. This tends to be in conflict with movement tasks that require sequences of precise motor commands on a relatively slow timescale. However, we show that without the construction of any particular circuit, training a linear readout to take a suitable weighted sum (with fixed weights after training) of the output activity of a fairly large number of neurons in a generic neural microcircuit model provides a very general paradigm for movement generation. It is obviously reminiscent of a number of experimental results (see, e.g., Neural Computation 17, 1715–1738 (2005)
© 2005 Massachusetts Institute of Technology
1716
P. Joshi and W. Maass
Wessberg et al., 2000) that show that a suitable weighted sum of the activity from a fairly large number of cortical neurons in monkeys predicts quite well the trajectory of hand positions for a variety of arm movements. Obviously the neural microcircuit model assumes here a similar role as a kernel for support vector machines in machine learning (for details, see Maass, Legenstein, & Bertschinger, 2004, and Maass, Natschl¨ager, & Markram, 2004). This letter demonstrates that controllers made from generic neural microcircuits are functionally generic in the sense that readouts from such circuits can learn to control the arm regardless of the model used to describe the arm dynamics, the type of feedbacks used (visual or proprioceptive), and the type of movements that are generated. This is shown here by teaching the same generic neural circuit to generate reaching movements for two different models with different kinds of feedbacks. The first model used (model 1) is the standard model of a two-joint robot arm described in Slotine and Li (1991). The other model (Todorov, 2000, 2003) comes from biology and relates the activity of neurons in the cortical motor area M1 to the kinematics of the arm (model 2). It turns out that both the spatial organization of information streams, especially the population coding of slowly varying input variables, and the inherent dynamics of the generic neural microcircuit model have a significant impact on its capability to generate movements. In particular, it is shown that the inherent dynamics of neural microcircuits allows these circuits to cope with rather large delays for proprioceptive and sensory feedback. In fact, it turns out that the performance of this generic neurocontroller is optimal for feedback delays that lie in the biologically realistic range of 50 to 280 ms. Furthermore, it is shown that other readout neurons from the same neural microcircuit model can be trained simultaneously to estimate results of such feedback, and that in the absence of real feedback, the precision of reaching movements can be improved significantly if the circuit gets access to these estimated feedbacks. This work complements preceding work where generic neural microcircuit models were used in an open loop for a variety of sensory processing tasks (Buonomano & Merzenich, 1995; Maass, Natschl¨ager, & Markram, 2002; Maass, Natschl¨ager, & Markram, 2004). It turns out that the demands on the precision of real-time computations carried out by such circuit models are substantially higher for closed-loop applications such as those considered in this article. The paradigm for movement generation discussed in this letter is somewhat related to preceding work (Ijspeert, Nakanishi, & Schaal, 2003), where a fixed parameterized system of differential equations was used instead of neural circuits, and to the melody generation and prediction of chaotic time series with artificial neural networks in discrete time of J¨ager (2002) and J¨ager and Haas (2004). In these other models, no effort is made to choose a movement generator whose inherent dynamics has a similarity to that of biological neural circuits. It has not yet been sufficiently
Movement Generation with Circuits of Spiking Neurons
1717
investigated whether feedback, especially feedback with a realistic delay, can have similarly beneficial consequences in these other models. No effort was made in this article to make the process by which the neural circuit model (more specifically, the readouts from this circuit) learns to generate specific movement primitives biologically realistic. Hence, the results of this article provide evidence only that a generic neural microcircuit can hold the information needed to generate certain movement primitives and that it can generate a suitable slow dynamics with high precision. The structure of this letter is as follows. Section 2 describes the neural microcircuit model. This is followed by the description of the robot arm model (model 1) in section 3. Sections 4, 5, and 6 present results of computer simulations for model 1. Section 7 repeats the experiment described in section 4 for the biologically motivated arm model (model 2). Finally, we discuss robustness issues related to our new paradigm for movement generation in section 8. A preliminary version of some results from this letter (for movements of just one fixed temporal duration, and without model 2) was presented at a conference (Joshi & Maass, 2004). 2 Generic Neural Microcircuit Models In contrast to common artificial neural network models, neural microcircuits in biological organisms consist of diverse components, such as different types of spiking neurons and dynamic synapses, that are each endowed with an inherently complex dynamics of its own. This makes it difficult to construct neural circuits out of biologically realistic computational units that solve specific computational problems, such as generating arm movements to various given targets. In fact, the generation of a smooth arm movement appears to be particularly difficult for a circuit of spiking neurons, since the dynamics of arm movements takes place on a timescale of hundreds of milliseconds, whereas the inherent dynamics of spiking neurons takes place on a much faster timescale. We show that this problem can be solved, even with a generic neural microcircuit model whose internal dynamics has not been adjusted or specialized for the task of creating arm movements, by taking as activation command for a muscle at any time t a weighted sum w × z(t) of the vector z that describes the current firing activity of all neurons in the circuit.1 The weight vector w, which remains fixed after training, is the only part that needs to be specialized for the generation of a particular movement task. Each component of z(t) models the impact that a particular neuron v may have on the membrane potential of a generic readout neuron. Thus, each spike of neuron v is replaced by a pulse of
1 As usual, a constant component is formally included in z(t) so that the term w × z(t) may contain some fixed bias.
1718
P. Joshi and W. Maass
unit amplitude 1 that decays exponentially with a time constant of 30 ms. In other words, z(t) is obtained by applying a low-pass filter to the spike trains emitted by the neurons in the generic neural microcircuit model. Note that it is already known that hand trajectories of monkeys can be recovered from the current firing activity z(t) of neurons in motor cortex through the same types of weighted sums as considered in this article (Wessberg et al., 2000). In principle, one can also view various parameters within the circuit as being subject to learning or adaptation, for example, in order to optimize the dynamics of the circuit for a particular range of control tasks. However, this has turned out to be not necessary for the applications described in this article, although it remains an interesting open research problem how unsupervised learning could optimize a circuit for motor control tasks. One advantage of viewing the weight vector w as being plastic is that learning is quite simple and robust, since it amounts to linear regression—in spite of the highly nonlinear nature of the control tasks to which this setup is applied. Another advantage is that the same neural microcircuit could potentially be used for various other information processing tasks (e.g., prediction of sensory feedback; see section 6) that may be desirable for the same or other tasks. The generic microcircuit models used for the closed-loop control tasks described in this article were similar in structure to those used earlier for various sensory processing tasks in an open loop. More precisely, we considered circuits consisting of 600 leaky integrate-and-fire neurons arranged on the grid points of a 20 × 5 × 6 cube in 3D (see Figure 1). Twenty percent of these neurons were randomly chosen to be inhibitory. Synaptic connections were chosen according to a biologically realistic probability distribution that favored local connections but also allowed some long-range connections. Biologically realistic models for dynamic synapses were employed instead of the usual static synapses of artificial neural network models. Parameters of neurons and synapses were chosen to fit data from microcircuits in rat somatosensory cortex (based on Gupta, Wang, & Markram, 2000, and Markram, Wang, & Tsodyks, 1998; see the appendix). In order to test the noise robustness of movement generation by the neural microcircuit model, the initial condition of the circuit was randomly drawn (initial membrane potential for each neuron drawn uniformly from the interval [13.5 mV, 14.9 mV], where 15 mV was the firing threshold). In addition a substantial amount of noise was added to the input current of each neuron throughout the simulation at each time step; a new value for the noise input current with mean 0 and SD of 1 nA was drawn for each neuron and added (subtracted) to its input current. In the case of the arm model that is considered in sections 3 through 6 (model 1), the neural circuit receives analog input streams from six sources (from eight sources in the experiment with internal predictions discussed
Movement Generation with Circuits of Spiking Neurons
1719
Figure 1: Spatial arrangement of neurons in the neural microcircuit models considered in this letter. The neurons in the six layers on the left-hand side encode the values of the six input-feedback variables xdest , ydest , θ1 (t − ), θ2 (t − ), τ1 (t), τ2 (t) in a standard population code. Connections from these six input layers (shown for a few selected neurons), as well as connections between neurons in the subsequent six processing layers, are chosen randomly according to a probability distribution discussed in the text (a typical example is shown).
in Figures 8 and 9). A critical factor for the performance of these neurocontrollers is the way in which these time-varying analog input streams are fed into the circuit. The outcomes of the experiments discussed in this article would have been all negative if these analog input streams were fed into the circuit as time-varying input currents. Apparently the variance of the resulting spike trains was too large to make the information about the slowly varying values of these input streams readily accessible to the circuit. Therefore, we employed instead a standard form of population coding (Pouget & Latham, 2003). Each of the six time-varying input variables was mapped onto an array of 50 symbolic input neurons with bell-shaped tuning curves (see the appendix). Thus, the value of each of the six input variables is encoded at any time by the output values of the associated 50 symbolic input neurons (of which at least 43 neurons output at any time the value 0). The neurons in each of these six input arrays are connected2 with one of the six layers consisting of 100 neurons in the circuit of 100 × 6 ((20 × 5) × 6) integrate-and-fire neurons, providing a time-varying input current to a randomly selected subset of integrate-and-fire neurons on that layer (see Figure 1).
2 With a value of 3.3 for λ in the formula for the connection probability given in the appendix.
1720
P. Joshi and W. Maass
Figure 2: Closed-loop application of a generic neural microcircuit. The weight vectors of the linear readouts from this circuit that produce the next motor commands τ1 (t + 1), τ2 (t + 1) are the only parameters that are adjusted during training. After training, the neural circuit receives in this closed loop as inputs a target position xdest , ydest for the tip of the robot arm (in Cartesian coordinates; these inputs remain constant during the subsequent arm movement) as well as feedback θ1 (t − ), θ2 (t − ) from the arm representing previous values of joint angles delayed by an amount , as well as “efferent copies” τ1 (t), τ2 (t) of its preceding motor commands. All the dynamics needed to generate the movement is then provided by the inherent dynamics of the neural circuit in response to the switching on of the constant inputs (and in response to the dynamics of the feedbacks). During training of the readouts from the generic neural circuits, the proprioceptive feedbacks θ1 (t − ), θ2 (t − ) and the efferent copies of previous motor commands τ1 (t), τ2 (t) are replaced by corresponding values for a target movement, which are given as external inputs to the circuit (“imitation learning”).
3 A Two-Joint Robot Arm as a Benchmark Nonlinear Control Task We first trained a generic neural microcircuit model (see Figures 1 and 2) to control a standard model for a two-joint robot arm (model 1; see Figure 3). This model is used in Slotine and Li (1991) as a standard reference model for a complex nonlinear control task (see in particular sections 6 and 9). It is assumed that the arm is moving in a horizontal plane, so that gravitational forces can be ignored.
Movement Generation with Circuits of Spiking Neurons
1721
l2
lc
2
I2 , m2
Θ2, τ2
l1
lc
I1 , m1
1
Θ1, τ1
Figure 3: Standard model of a two-joint robot arm.
Using the well-known Lagrangian equation in classical dynamics, the dynamic equations for this arm model are given by equation 3.1:
H11 H12 H21 H22
¨ θ1 −h θ˙2 −h(θ˙1 + θ˙2 ) τ θ˙ 1 + = 1 , h θ˙1 0 τ2 θ˙ 2 θ¨ 2
(3.1)
with θ = [θ1 θ2 ]T being the two joint angles, τ = [τ1 τ2 ]T being the joint input torques to the two joints, and H11 = m1 lc12 + I1 + m2 l1 2 + lc22 + 2l1 lc2 cos θ2 + I2 H12 = H21 = m2 l1 lc2 cos θ2 + m2 lc22 + I2 H22 = m2 lc22 + I2 h = m2 l1 lc2 sin θ2 . Equation 3.1 can be compactly written as ˙ θ˙ = τ, H(θ )θ¨ + C(θ, θ) where H represents the inertia matrix and C represents the matrix of Coriolis and centripetal terms. I1 , I2 are the moments of inertia of the two joints. The
1722
P. Joshi and W. Maass
values of the parameters that were used in our simulations were m1 = 1, m2 = 1, lc1 = 0.25, lc2 = 0.25, I1 = 0.03, and I2 = 0.03. The closed-loop control system that we used is shown in Figure 2. During training of the weights of the linear readouts from the generic neural microcircuit model, the circuit was used in an open loop with target values for the output torques provided by equation 3.1 (for a given target trajectory {θ1 (t), θ(t)}), and feedbacks from the plant replaced by the target values of these feedbacks for the target trajectory. The delay of the proprioceptive or sensory feedback is assumed to have a fixed value of 200 ms, except for section 6, where we study the impact of this value for the precision of the movement. For each such target trajectory, 20 variations of the training samples were generated, for which at each time step3 t a different noise value of 10−5 × ρ was added to each of the input channels where ρ is a random number drawn from a gaussian distribution with mean 0 and SD 1, multiplied by the current value of that input channel. The purpose of this extended training procedure was to make the readout robust with regard to deviations from the target trajectory caused by faulty earlier torque outputs given by the readouts from the neural circuit (see section 8). Each target trajectory had a time duration of 500 ms. 4 Teaching a Generic Neural Microcircuit Model to Generate Basic Movements As a first task, the generic neural microcircuit model described in section 2 was taught to generate with the two-joint arm described in section 3 the four movements indicated in Figure 4. In each case, the task was to move the tip of the arm from point A to point B on a straight line, with a biologically realistic bell-shaped velocity profile. The two readouts from the neural microcircuit model were trained by linear regression to output the joint torques required for each of these movements.4 3 All time steps were chosen to have a length of 2 ms, except for the experiment reported in Figure 6, where a step size of 1 ms was used to achieve a higher precision. 4 Training data were generated as follows. For a given start point x start , ystart and target end point xdest , ydest of a movement (both given in Cartesian coordinates), an interpolating trajectory of the tip of the arm was generated according to the following equation given in Flash and Hogan (1965):
x(t) = xstart + (xstart − xdest ) · (15τ 4 − 6τ 5 − 10τ 3 ) y(t) = ystart + (ystart − ydest ) · (15τ 4 − 6τ 5 − 10τ 3 ), where τ = t/MT and MT is the target movement time (in this case, MT = 500 ms). From this target trajectory for the end point of the robot arm, we had generated target trajectories of the angles 1 , 2 of the robot arm by applying standard equations from geometry (see, e.g., Craig, 1955). From these, the target trajectories of the torques were generated according to equation 3.1.
Movement Generation with Circuits of Spiking Neurons 2
2
B
B 1.5
1.5
1
1
0.5
0.5
1723
A
A 0
0
0.5
1
1.5
2
2
0
0.5
1
1.5
2
2
B
1.5
1.5
A
1
A
B
1
0.5 0
0
0.5 0
0.5
1
1.5
2
0
0
0.5
1
1.5
2
Figure 4: Initial position A and end position B of the robot arm (model 1) for four target movements, scaled in meters. The target trajectory of the tip of the robot arm and of the elbow are indicated by dashed and dashed-dotted lines. One sees clearly that even simple linear movements of the tip to the arm require quite nonlinear movements of the elbow.
Twenty noisy variations of each of the four target movements were used for the training of the two readouts by linear regression, as specified in section 3. Note that each readout is simply modeled as a linear gate with weight vector w applied to the liquid state x(t) of the neural circuit. This weight vector is fixed after training, and during validation all four movements are generated with this fixed-weight vector at the readout. The performance of the trained neural microcircuit model during validation in the closed loop (see Figure 2) is demonstrated in Figure 5. When the circuit receives as input the coordinates xdest , ydest of the end point B of one of the target movements shown in Figure 4, the circuit autonomously generates in a closed loop the torques needed to move the tip of the two-joint arm from the corresponding initial point A to this end point B.5
5 In these experiments, no effort was made to stabilize the end point of the arm at or near the target position. Rather, the movement was externally halted at the end of the allotted time period of 500 ms. Hence, the neural circuit model acts as a movement generator rather than as a controller. However, we are not aware of a fundamental obstacle that would make it impossible to teach a circuit to stabilize the arm once it has reached the target position (which the circuit receives as an extra input).
1724
P. Joshi and W. Maass 2
2
B
B 1.5
1.5
1
1
0.5
0.5
0
0
A 0.5
1
1.5
2
2
0
A
0
0.5
1
1.5
2
2
B
1.5
1.5
A
B
A 1
1
0.5
0.5
0
0
0.5
1
1.5
2
0
0
0.5
1
1.5
2
Figure 5: Target trajectories of the tip of the robot arm as in Figure 4 (solid) and resulting trajectories of the tip of the robot arm in a closed loop for one of the test runs (dashed line). The dots around the target end points show the end points of the tip of the robot arm for 10 test runs for each of the movements (enlarged inserts show a 20 cm × 20 cm area with target end point B marked by a black open triangle). Differences are due to varying initial conditions and simulated inherent noise of the neural circuit. Nevertheless, all movement trajectories converged to the target, with an average deviation from the target end point of 4.72 cm and the SD of 0.85 cm (scale of figures in m).
Obviously temporal integration capabilities of the controller are needed for the control of many types of movements. The next experiment was designed to test explicitly this capability of neurocontrollers constructed from generic circuits of spiking neurons. Figure 6 shows results for the case where the readouts from the neural microcircuit have been trained to generate an arm movement with an intermediate stop of all movement from 225 to 275 ms (see the velocity profile at the bottom of Figure 6). The initiation of the continuation of the movement at time t = 275 ms has to take place without any external cue, just on the basis of the inherent temporal integration capability of the neural circuit. For demonstration purposes, we chose for the experiment reported in Figure 6 a feedback delay of just 1 ms, so that all circuit inputs are constant during 49 ms of the 50 ms while the controller has to wait, forcing the readouts to decide only on the basis of the inherent circuit dynamics when to move on. Nevertheless, the average deviation of the tip of the robot arm for 20 test runs (with noisy initial conditions and noise
Movement Generation with Circuits of Spiking Neurons
1725
A circuit response 250 500 0
0.1
0.2
0.3
0.4
0.5
−50 0 20
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
B 50 τ1
0
τ
2
0 −20 0 2
θ
1
0
1.5
velocity
θ
2
−2 0 2 1 0 4 2 0 0
time (s)
Figure 6: Demonstration of the temporal integration capability of the neural controller. The data shown are for a validation run for a circuit that has been trained to generate a movement that requires an intermediate stop and then autonomous continuation of the movement after 50 ms. (A) Spike raster of the 600 neurons on the right-hand side of Figure 1. Note that the readout neurons receive at time t only information about the last few spikes before time t (more precisely, they receive at time t the liquid state x(t) of the circuit as their only input). (B) Target time courses of the joint angles θ1 , θ2 , joint torques τ1 , τ2 , and the velocity of the tip of the robot arm are shown as a solid line; actual time courses of these variables during a validation run in closed loop are shown as dashed lines.
on feedbacks as before) was just 6.86 cm, and the bottom part of Figure 6 shows (for a sample test run) that the tip of the robot arm came to a halt during the period from 225 to 275 ms, and then autonomously continued to move.
1726
P. Joshi and W. Maass A 1.7 Start point End points used for training End points used for validation Desired path Performance on a sample seen during training Performance on a sample not seen during training
1.5
1.3 0.6
0.8
1
1.2
1.4
1.6
B velocity
3 Desired Velocity Observed Velocity
2 1 0
0
0.1
0.2
0.3
0.4
0.5
time (s)
Figure 7: Generation of reaching movements to new target end points that lie between end points used for training. (A) Generalization of movement generation to five target end points (small circles) that were not among the eight target end points (small squares) that occurred during training. Movement to a new target end point was initiated by giving its Cartesian coordinates as constant inputs to the circuit. Average deviation for 15 runs with new target end points: 10.3 cm (4.8 cm for target end points that occurred during training). (B) The velocity profile for one of the movements to a new target end point the solid line is the ideal bell-shaped velocity profile; the actual profile is a dashed line.
5 Generalization Capabilities The trained neurocontroller (with the weights of the linear readouts being the only parameters that were adjusted during training) had some limited capabilities to generate arm-reaching movements to new targets. For the experiment reported in Figure 7, the circuit was trained to generate from a common initial position reaching movements to eight different target positions, given in terms of their Cartesian coordinates as constant inputs xdest , ydest to the circuit. After training, the circuit was able to generate with fairly high-precision reaching movements to other target points never used during training, provided that they were located between target points used
Movement Generation with Circuits of Spiking Neurons
1727
for training. The autonomously generated reaching movements moved the tip of the robot arm on a rather straight line with a bell-shaped velocity profile, just as for those reaching movements to targets that were used for training. 6 On the Role of Feedback Delays and Autonomously Generated Feedback Estimates Our model assumes that the neural circuit receives as inputs, in addition to the constant target end points and efferent copies τ1 (t), τ2 (t) of its movement commands with very little delay, proprioceptive or visual feedback that provides at time t information about the values of the angles of the joints at time t − . Whereas it is quite difficult to construct circuits or other artificial controllers for imitating movements that can benefit significantly from feedback (e.g., with the approach of Ijspeert et al., 2003), especially if this feedback is significantly delayed, we show in Figure 8 that neurocontrollers built from generic neural microcircuit models are able to generate and control movements for feedback with a wide range of delays. In fact, Figure 8 shows that the smallest deviation between the target end point xdest , ydest and the actual end point of the tip of the robot arm is not achieved when this delay has a value of 0, but for a range of delays between 50 and 280 ms. In order to make sure that this surprising result is not an artifact of some particular randomly drawn neural microcircuit model or a particular arm movement, it has been tested on each of 10 randomly drawn neural microcircuits with 12 different movements (four trajectories as shown in Figure 4, each created at three different speeds, resulting in movement times of 300, 500, and 700 ms). The results of these statistical experiments are reported in Figure 8. The right-most point on each of the three curves shows the performance achieved without any feedback (since for this point, the delay of the feedback is as large as the duration of the whole movement). Compared with that, feedback with a suitable delay reduces the imprecision of the movement by at least 50%. Altogether, these data show that the best values for the feedback delay lie in the range of 50 to 280 ms. The upper bound for this interval depends somewhat on the duration of the movement. A possible explanation for the fact that feedback with a delay of less than 50 ms is less helpful is that in this case, the current target circuit output is very similar to the currently arriving feedback, and hence it is more difficult for the circuit to learn the map from current feedback to current target output in a noise-robust fashion. In addition, a delayed feedback complements the inherent temporal integration property of the neural microcircuit model (see Maass, Natschl¨ager, & Markram, 2004), and therefore tends to enlarge the time constant for the fading of memory in the closed-loop system. Hence these neurocontrollers perform best for a range of feedback delays that contain typical values of actual delays for proprioceptive and visual feedback measured in a variety of species (e.g., 120 ms for proprioceptive feedback
1728
P. Joshi and W. Maass
B 0.2
0.2
0.15
0.15
Error (m)
Error (m)
A
0.1 0.05 0 0
0.05 0 0
100 200 300 Feedback Delay ∆(ms)
200 400 Feedback Delay ∆(ms)
D
C 0.2
0.2
0.15
0.15
Error (m)
Error (m)
0.1
0.1 0.05 0 0
200 400 600 Feedback Delay ∆(ms)
No Feedback of Estimated θs With Feedback of Estimated Θs
0.1 0.05 0 0
200 400 Feedback Delay (ms)
Figure 8: Influence of feedback delay on movement error. Error is defined as the difference of the desired and observed end point of movement. The delay is for proprioceptive feedbacks θ1 (t − ), θ2 (t − ). The curves show the averages, and the vertical bars show the SD of the data achieved for 400 movements for each value of (four different movements as shown in Figure 4 repeated 10 different times with different random initial conditions of the circuit and different online noise for each of 10 randomly drawn generic neural microcircuit models). (A–C) Data for three movement durations: 300, 500, and 700 ms. (D) In the upper curve, results for a slightly larger neural circuit (consisting of 800 instead of 600 early integrate-and-fire neurons). The lower (dashed) curve in D shows the performance of the same circuits when internally generated estimates of proprioceptive feedbacks (for a delay of 200 ms) were fed back as additional inputs to the neural circuit. Note that the use of such internally estimated feedbacks not only improves the movement precision for all values of the actual feedback delay expected for = 200 ms, but also reduces the SD of the precision achieved for different circuits considerably.
and 200 ms for visual feedback is reported in van Beers, Baraduc, & Wolpert, 2002). In another computer experiment, we examined the potential benefit of using estimated feedback for the neurocontroller under consideration. Estimation of feedback is very easy for such neural architecture, since the generic
Movement Generation with Circuits of Spiking Neurons
1729
Figure 9: Information flow for the case of autonomously generated estimates θˆ (t − 200 ms) of delayed feedback θ(t − 200 ms). The rest of the circuit is as in Figure 2.
neural microcircuit model that generates (via suitable readouts) the movement commands has not been specialized in any way for this movement generation task and can simultaneously be used as information reservoir for estimating feedback. More precisely, two additional readouts were added and trained to estimate at any time t the values of the joint angles θ1 and θ2 at time t − 200 ms, or 200 ms earlier. These delayed values were chosen as targets for these two additional readouts during training, since the previously reported results (see, in particular, Figure 8B) show that feedback of the actual values of θ1 and θ2 with a delay of 200 ms is quite beneficial for the precision of the movement that is generated. After training, the weights of these two additional readouts were frozen (for the first two readouts, which produce the movement commands).6 The outputs of these two additional readouts were also fed back into the circuit (without delay; see Figure 9). Compared with the architecture shown in Figure 2, the neural circuit now receives two additional time-varying inputs. These were fed into the circuit in the same way as the other six inputs (described in section 2). Thus, two additional arrays consisting of 50 neurons each were used for a population coding of these time-varying input variables, and 2 “columns” consisting of 6 Since neither the training of the readouts for movement commands nor the training of the readouts for retrodiction of sensory feedback changes the neural circuit itself, it does not matter whether these readouts are trained sequentially or in parallel. In our experiments both types of readouts were trained simultaneously, while the target values of both θ1 (t − ), θ2 (t − ) and θ1 (t − 200), θ2 (t − 200) were given to the circuits as additional inputs during training (where is the assumed actual feedback delay plotted on the x-axis in Figure 8D).
1730
P. Joshi and W. Maass
100 neurons each were added of the neural circuit that received the outputs of these two additional input arrays. The top solid line in Figure 8D shows the result, computed in the same way as in the other panels of Figure 8 for the case when the values of the estimates of θ1 (t − 200 ms) and θ2 (t − 200 ms) produced by the two additional readouts were not fed back into the circuit. The bottom dashed line shows the result when these estimates were available to the circuit via feedback. Although this additional feedback does not provide any new information to the circuit, but only collects and redistributes information within the neural circuit, it significantly improved the performance of the neurocontroller for all values of the actual delay of feedback about the values of θ1 and θ2 (except for = 200 ms). The value on the right-most point of the lower curve for = 500 ms shows the improvement achieved by using estimated sensory feedback when no feedback arrives at all, since the total movements lasted 500 ms. Altogether, the use of internally estimated feedback improved the precision of the movement by almost 50% for most values of the delay of the actual feedback. 7 Application to a Biological Model for Arm Control Whereas in the preceding section we focused on a model for a robot arm as a standard example for a highly nonlinear control task, we demonstrate in this section that the same paradigm for movement generation can also be applied to a well-known model for cortical control of arm movements in primates (Todorov, 2000, 2003). This model proposes a direct relationship between the firing rate c j of individual neuron j in primary motor cortex M1 (relative to some baseline firing rate C) and the kinematics (in cartesian coordinates) and end point force f e xt of the hand, which is viewed here simply as the tip of a two-joint arm: c j (t − d) =
uTj 2
¨ + kx(t)) + b uTj x(t) ˙ (F −1 fe xt (t) + mx(t) .
(7.1)
The vector u j denotes the direction in which the end point force is generated due to activation of muscles by neuron j (assuming cosine tuning of neurons). In our simulations we simply took four unit vectors u j pointing up, down, left, and right. fe xt (t) is the end point force that the hand applies ˙ and x¨ are the position, velocity, and acceleraagainst external objects. x, x, tion of the hand, respectively (we usually write x, y for the hand position x in a two-dimensional space). Although the precise relationship between the activity of neurons in motor cortex and the activation of individual muscles is extremely complicated and highly nonlinear, a derivation given in in Todorov (2000) suggested that equation 7.1 provides a quite good (almost linear) local approximation to
Movement Generation with Circuits of Spiking Neurons
1731
multijoint kinematics over a small work space. As a consequence, we have applied this model only for arm movements when the hand moves on the boundaries of a 28.28 cm × 28,28 cm square. Since we are concerned only with the movement of the hand in its work space and do not require the hand to exert an end point force on external world objects, fe xt (t) can be set to 0. For simplicity, we have also set the transmission delay d from cortex to muscles to 0 (but our model would work just as well for other values of d). This simplifies the model to c j (t) =
uTj 2
˙ (m¨x(t) + kx(t)) + b uTj x(t) .
(7.2)
In our computer experiments, we applied this model with the parameter values b = 10 Ns/m, k = 50 N/m, and m = 1 kg, suggested in Todorov (2000). In order to produce a paradigm for arm control by cortical circuits, we took a generic cortical microcircuit model consisting of 800 neurons as described in section 2. Four readouts that received inputs from all neurons in this microcircuit model were trained by linear regression to assume the role of these four neurons in motor cortex that directly control arm muscles resulting in hand movements according to equation 7.2.7 We trained these readout neurons to produce four different hand movements along the edges of a 28.28 cm × 28.28 cm square whose diagonals were parallel to the x- and y-axis, respectively. The inputs to the neural microcircuit were the coordinates xdest , ydest of the desired target end point of the hand, efferent copies of the outputs c 1 (t), . . . , c 4 (t) of motor neurons 1, . . . , 4, and feedback x(t − 200) ms, y(t − 200) ms about preceding hand positions with a delay of 200 ms that is biologically realistic for visual feedback into motor cortex. The values of these eight inputs were fed into the 800 neuron microcircuit model in the same way as for the eight-input circuit discussed at the end of the preceding section. The results for this experiment are shown in Figure 10. The average deviation over 40 runs of the tip of the arm from the desired end point was 0.13 cm with a SD of 7.2295 × 102 cm. It is interesting to note that the generic neural microcircuit can also learn to generate movements for this quite different arm model. Another point of interest is that the control performance of the generic neural microcircuit is independent of the kind of feedback that it is receiving (cf. the angles in the earlier model and position coordinates in this model).
7
Target trajectories of the end point of the arm were generated for training as described in note 4. Target outputs c j (t) for the readouts were generated from these trajectories by equation 7.2.
1732
P. Joshi and W. Maass
A circuit response 795
400
0
0.1
0.2
0.3
0.4
0.5
20 0 60
0.1
0.2
0.3
0.4
0.5
20 0 −20
0.1
0.2
0.3
0.4
0.5
0
0.1
0.2
0.3
0.4
0.5
0 1.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
B
c
3
c
2
c1
60
−40
c
4
−25
x
−35
y
1 0 1.4
velocity
1.2 0 2 1 0 0
time (s)
Figure 10: Generation of an arm movement for biological model for cortical control of muscle activations. (A) Spike raster analogous to Figure 6. (B) Solid lines denote target values, and dashed lines show the performance of simulated readouts c 1 , . . . , c 4 from a simulated microcircuit in motor cortex that receives significantly delayed information about earlier hand positions as feedback (simulating visual feedback to motor cortex). Scales for c 1 , . . . , c 4 in N, for x, y in m, for the velocity of the hand in m/s.
Movement Generation with Circuits of Spiking Neurons
1733
8 How to Make the Movement Generation Noise Robust Mathematical results from approximation theory (see the appendix of Maass et al., 2002, and Maass & Markram, 2004, for details) imply that a sufficiently large neural microcircuit model (which contains sufficiently diverse dynamic components to satisfy the separation property) can in principle (if combined with suitable static readouts) uniformly approximate any given time-invariant fading memory filter F . Additional conditions have to be met for successful applications of neural microcircuit models in closed-loop movement generation tasks, such as those considered in this article. First, one has to assume that the approximation target for the neural microcircuit, some movement generator F for a plant P, is a time-invariant fading memory filter (if considered in an open loop). But without additional constraints on the plant or target movement generator F , one cannot guarantee that neural microcircuits L that uniformly approximate F in an open loop can successfully generate similar movements of the plant P. Assume that F can be uniformly approximated by neural microcircuit models L, that is, there exists for every ε > 0 some neural microcircuit model L so that ||(F u)(t) − (Lu)(t)|| ≤ ε for all times t and all input functions u(·) that may enter the movement generator. Note that the feedback f from the plant has to be subsumed by these functions u(·), so that u(t) is in general of the form u(t) = u0 (t), f (t), where u0 (t) are external movement commands8 and f (t) is the feedback (both u0 (t) and f (t) are in general multidimensional). Assume that such microcircuit model L has been chosen for some extremely small ε > 0. Even if the plant P has the common bounded input–bounded output (BIBO) property, it may magnify the differences ≤ ε between outputs from F and outputs from L (which may occur even if F and L receive initially the same input u) and produce for these two cases feedback functions f F (s), f L (s) whose difference is fairly large. The difference between the outputs of F and L for these different feedbacks f F (s), f L (s) as inputs may become much larger than ε, and hence the outputs of F and L with plant P may eventually diverge in this closed loop. This situation does in fact occur in the case of a two-joint arm as plant P. Hence, the assumption that L approximates F uniformly within ε cannot guarantee that ||(F u F )(t) − (Lu L )(t)|| ≤ ε for all t (where u F (t) := u0 (t), f F (t) and u L (t) := u0 (t), f L (t)), since even ||(F u F )(t) − (F u L )(t)|| may already become much larger than ε for sufficiently large t. This instability problem can be solved by training the readout from the neural circuit L to create an “attractor” around the trajectory generated by F in the noise-free case. This is possible because the current liquid
8
In our experiments u0 (t) was a very simple two-dimensional function with value 0, 0 for t < 0 and value xdest , ydest for t ≥ 0. All other external inputs to the circuit were given only during training.
1734
P. Joshi and W. Maass
state of the circuit depends not just on the most recent feedback to the circuit, but also on the preceding stream of feedbacks (therefore, the liquid state also contains information about which particular part of the movement has to be currently carried out) as well as on the target end position xdest , ydest . If one trains the readout from circuit L to ensure that ||(Lu F )(t) − (Lu L )(t)|| stays small when u F (·) and u L (·) did not differ too much at preceding time steps, one can bound ||(F u F )(t) − (Lu L )(t)|| by ||(F u F )(t) − (Lu F )(t)|| + ||(Lu F )(t) − (Lu L )(t)|| ≤ ε + ||(Lu F )(t) − (Lu L )(t)|| and thereby avoid divergence of the trajectories caused by F and L in the closed-loop system. This makes clear why it was necessary to train the readouts of the neural microcircuit models L to produce the desired trajectory not just for the ideal feedback u F (t) but also for noisy variations of u F (t) = u0 (t), f F (t) that represent possible functions u L (t) that arise if the approximating circuit L is used in the closed loop. 9 Discussion Whereas traditional models for neural computation focused on constructions of neural implementations of Turing machines or other off-line computational models, more recent results have demonstrated that biologically more realistic neural microcircuit models consisting of spiking neurons and dynamic synapses are well suited for real-time computational tasks (Buonomano & Merzenich, 1995; Maass et al., 2002; Maass, Natschl¨ager, & Markram, 2004; Natschl¨ager & Maass, 2004). Previously, only sensory processing tasks such as speech recognition or visual movement analysis (Buonomano & Merzenich, 1995; Maass et al., 2002; Legenstein, Martram, & Maass, 2003) were considered in this context as benchmark tests for real-time computing. In this letter, we have applied such generic neural microcircuit models for the first time in a biologically more realistic closed-loop setting, where the output of the neural microcircuit model directly influences its future inputs. Obviously closed-loop applications of neural microcircuit models provide a harder computational challenge than open-loop sensory processing, since small imprecisions in their output are likely to be amplified by the plant to yield even larger deviations in the feedback, which is likely to increase even further the imprecision of subsequent movement commands. This problem can be solved by teaching the readout from the neural microcircuit during training to ignore smaller recent deviations reported by feedback, thereby making the target trajectory of output torques an attractor in the resulting closed-loop dynamical system. After training, the learned reaching movements are generated completely autonomously by the neural circuit once it is given the target end position of the tip of the robot arm as (static) input. We have demonstrated that the capability of the neural circuit to generate reaching movements generalizes to novel target end positions of the tip of
Movement Generation with Circuits of Spiking Neurons
1735
the arm that lie between those that occurred during training (see Figure 7). The velocity profile for these autonomously generated new reaching movements exhibits a bell-shaped velocity profile, as for the previously taught movements. We propose to view the basic arm movements that are generated in this way as possible implementations of muscle synergies, that is, of rather stereotypical movement templates (d’Avella, Saltiel, & Bizzi, 2003). In this interpretation, the learning of a larger variety of arm movements requires superposition of time-shifted versions of several different basic movement templates of the type as are considered in this letter. Such learning on a higher level is a topic of currrent research. Surprisingly, the performance of the neural microcircuit model for generating movements not only deteriorates if the (simulated) proprioceptive feedback is delayed by more than 280 ms or if no feedback is given at all, but also if this feedback arrives without any delay. Our computer simulations suggest that the best performance of such neurocontrollers is achieved if the feedback arrives with a biologically realistic delay in the range of 50 to 280 ms. If the delay assumes other values or is missing altogether, a significant improvement in the precision of the generated reaching movements can be achieved if additional readouts from the same neural microcircuit models that generate the movements are taught to estimate the values of the feedback with an optimal delay of 200 ms, and if the results of these internally generated feedback estimates are provided as additional inputs to the circuit (see Figure 8D). Apart from these effects resulting from the interaction of the inherent circuit dynamics with the dynamics of externally or internally generated feedbacks, also the spatial organization of information streams in the simulated neural microcircuit plays a significant role. The capability of such a circuit to generate movement is quite poor if information about slowly varying input variables (such as externally or internally generated feedback) is provided to the circuit in the form of a firing rate of a single neuron (not shown) rather than through population coding (see the description in section 2) as implemented for the experiments reported in this article. Another interesting point is that our model for motor control can learn to control the arm movement regardless of the model that is used to describe the dynamics of the arm movement and the types of feedback that the circuit is receiving. One of the two arm models that was tested (see section 7) is a model for cortical control of muscle activation. Hence, our model also provides a new hypothesis for the computational function of neural circuits in the motor cortex. The results presented in this letter may be viewed as a first step toward an exploration of the role of the embodiment of motion generation circuitry, that is, of concrete spatial neural circuits and their inherent temporal dynamics, in motor control. This complements existing work on the relevance of the embodiment of actuators to motor control (Pfeifer, 2002).
1736
P. Joshi and W. Maass
Appendix: Specification of Generic Neural Microcircuit Models Neuron parameters: Membrane time constant 30 ms, absolute refractory period 3 ms (excitatory neurons), 2 ms (inhibitory neurons), threshold 15 mV (for a resting membrane potential assumed to be 0), reset voltage drawn uniformly from the interval [13.8, 14.5 mV] for each neuron, constant nonspecific background current Ib uniformly drawn from the interval [13.5 nA, 14.5 nA] for each neuron, noise at each time-step Inoise drawn from a gaussian distribution with mean 0 and SD of 1nA, input resistance 1 M. For each simulation, the initial conditions of each integrate-and-fire neuron, that is, the membrane voltage at time t = 0, were drawn randomly (uniform distribution) from the interval [13.5 mV, 14.9 mV]. The probability of a synaptic connection from neuron a to neuron b (as well as that of a synaptic connection from neuron b to neuron a ) was defined as C · exp(−D2 (a , b)/λ2 ), where D(a , b) is the Euclidean distance between neurons a and b and λ is a parameter that controls both the average number of connections and the average distance between neurons that are synaptically connected (we set λ = 1.2). Depending on whether the pre- or postsynaptic neurons were excitatory (E) or inhibitory (I ), the value of C was set according to Gupta et al. (2000) to 0.3 (E E), 0.2 (E I ), 0.4 (I E), 0.1 (I I ). We modeled the (short-term) dynamics of synapses according to the model proposed in Markram et al. (1998), with the synaptic parameters U (use), D (time constant for depression), and F (time constant for facilitation) randomly chosen from gaussian distributions that model empirically found data for such connections. Depending on whether a and b were excitatory (E) or inhibitory (I ), the mean values of these three parameters (with D, F expressed in seconds, s) were chosen according to Gupta et al. (2000) to be .5, 1.1, .05 (E E), .05, .125, 1.2 (E I ), .25, .7, .02 (I E), .32, .144, .06 (II ). The SD of each parameter was chosen to be 50% of its mean. The mean of the scaling parameter A(in nA) was chosen to be 70 (EE), 150 (EI), −47 (IE), −47 (II). In the case of input synapses, the parameter A had a value of 70 nA if projecting onto a excitatory neuron and −47 nA if projecting onto an inhibitory neuron. The SD of the A parameter was chosen to be 70% of its mean and was drawn from a gamma distribution. The postsynaptic current was modeled as an exponential decay exp (−t/τs ) with τs = 3 ms (τs = 6 ms) for excitatory (inhibitory) synapses. The transmission delays between neurons were chosen uniformly to be 1.5 ms (E E) and 0.8 ms for the other connections. We applied the following input convention. Each input variable is first scaled into the range [0, 1]. This range is linearly mapped onto an array of 50 symbolic input neurons. At each time step, one of these 50 neurons, whose number n(t) ∈ {1, . . . , 50}, reflects the current value i n (t) ∈ [0, 1], which is the normalized value of input variable i(t) (e.g., n(t) = 1 if i n (t) = 0, n(t) = 50 if i n (t) = 1). The neuron n(t) then outputs at time t the value of i(t). In addition, the three closest neighbors on both sides of neuron n(t) in this
Movement Generation with Circuits of Spiking Neurons
1737
linear array get activated at time t by a scaled-down amount according to a gaussian function (the neuron number n outputs at time step t the value −(n−n(t))2 i(t) · σ √12π e 2π 2 , where σ = 0.8). Acknowledgments This work was partially supported by the Austrian Science Fund FWF, project P15386, and PASCAL, project IST2002-506778, of the European Union. References Buonomano, D. V., & Merzenich, M. M. (1995). Temporal information transformed into a spatial code by a neural network with realistic properties. Science, 267, 1028–1030. Craig, J. J. (1955). Introduction to robotics: Mechanics and control (2nd ed.). Reading, MA: Addison-Wesley. d’Avella, A., Saltiel, P., & Bizzi, E. (2003). Combination of muscle synergies in the construction of a natural motor behavior. Nature Neuroscience, 6(3), 300– 308. Flash, T., & Hogan, N. (1965). The coordination of arm movements: An experimentally confirmed mathematical model. Journal of Neuroscience, 5(7), 1699–1703. Gupta, A., Wang, Y., & Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273– 278. Ijspeert, A. J., Nakanishi, J., & Schaal, S. (2003). Learning attractor landscapes for learning motor primitives. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 1547–1554). Cambridge, MA: MIT Press. J¨ager, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach (GMD Report 159). German National Research Center for Information Technology. J¨ager, H., & Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304, 78–80. Joshi, P., & Maass, W. (2004). Movement generation and control with generic neural microcircuits. In A. J. Ijspeert, M. Murata, & N. Wakamiya (Eds.), Biologically inspired approaches to advanced information technology (pp. 258–273). Legenstein, R. A., Markram, H., & Maass, W. (2003). Input prediction and autonomous movement analysis in recurrent circuits of spiking neurons. Reviews in the Neurosciences (Special Issue on Neuroinformatics of Neural and Artificial Computation), 14(1–2), 5–19. Available at http://www.igi.tugraz.at/ maass/publications.html. Maass, W., Legenstein, R. A., & Bertschinger, N. (2004). Methods for estimating the computational power and generalization capability of neural microcircuits. Submitted for publication. Available online: http://www.igi.tugraz.at/ maass/publications.html.
1738
P. Joshi and W. Maass
Maass, W., & Markram, H. (2004). On the computational power of recurrent circuits of spiking neurons. Journal of Computer and System Sciences, 69, 593–616. Available online: http://www.igi.tugraz.at/maass/publications.html. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Available online: http://www.igi. tugraz.at/maass/publications.html. Maass, W., Natschl¨ager, T., & Markram, H. (2004). Computational models for generic cortical microcircuits. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach (pp. 575–605). Boca Raton, FL: Chapman & Hall/CRC. Available online: http://www.igi.tugraz.at/maass/publications.html. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci., 95, 5323–5328. Natschl¨ager, T., & Maass, W. (2004). Information dynamics and emergent computation in recurrent circuits of spiking neurons. In S. Thrun, L. Saul, & B. Scholkopf ¨ (Eds.), Advances in neural information processing systems, 16 (pp. 1255– 1262). Cambridge, MA: MIT Press. Available online: http://www.igi.tugraz.at/ maass/publications.html. Pfeifer, R. (2002). On the role of embodiment in the emergence of cognition: Grey Walter’s turtles and beyond. In Proc. of the Workshop ”The Legacy of Grey Walter.” Bristol. Available online: http://www.ifi.unizh.ch/groups/ailab/. Pouget, A., & Latham, P. E. (2003). Population codes. In M. A., Arbib (Ed.), The handbook of brain theory and neural networks (pp. 893–897). Cambridge, MA: MIT Press. Slotine, J. J. E., & Li, W. (1991). Applied nonlinear control. Englewood Cliffs, NJ: Prentice Hall. Todorov, E. (2000). Direct control of muscle activation in voluntary arm movements: A model. Nature Neuroscience, 3(4), 391–398. Todorov, E. (2003). On the role of primary motor cortex in arm movement control. In M. L. Latash & M. Levin (Eds.), Progress in motor control III (pp. 125–166). Champaign, IL: Human Kinetics. van Beers, R. J., Baraduc, P., & Wolpert, D. M. (2002). Role of uncertainty in motor control. Phil. Trans. R. Soc. Lond. B, 357, 1137–1145. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., Kim, J., Biggs, S. J., Srinivasan, M. A., & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(6810), 361–365.
Received March 10, 2004; accepted December 8, 2004.
LETTER
Communicated by Angela Yu
Cognitive Enhancement Mediated Through Postsynaptic Actions of Norepinephrine on Ongoing Cortical Activity Osamu Hoshino
[email protected] Department of Intelligent Systems Engineering, Ibaraki University, Nakanarusawa 4-12-1, Hitachi-shi, Ibaraki 316-8511, Japan
We propose two distinct types of norepinephrine (NE)-neuromodulatory systems: an enhanced-excitatory and enhanced-inhibitory (E-E/E-I) system and a depressed-excitatory and enhanced-inhibitory (D-E/E-I) system. In both systems, inhibitory synaptic efficacies are enhanced, but excitatory ones are modified in a contradictory manner: the E-E/E-I system enhances excitatory synaptic efficacies, whereas the D-E/E-I system depresses them. The E-E/E-I and D-E/E-I systems altered the dynamic property of ongoing (background) neuronal activity and greatly influenced the cognitive performance (S/N ratio) of a cortical neural network. The E-E/E-I system effectively enhanced S/N ratio for weaker stimuli with lower doses of NE, whereas the D-E/E-I system enhanced stronger stimuli with higher doses of NE. The neural network effectively responded to weaker stimuli if brief γ -bursts were involved in ongoing neuronal activity that is controlled under the E-E/E-I neuromodulatory system. If the E-E/E-I and the D-E/E-I systems interact within the neural network, depressed neurons whose activity is depressed by NE application have bimodal property. That is, S/N ratio can be enhanced not only for stronger stimuli as its original property but also for weaker stimuli, for which coincidental neuronal firings among enhanced neurons whose activity is enhanced by NE application are essential. We suggest that the recruitment of the depressed neurons for the detection of weaker (subthreshold) stimuli might be advantageous for the brain to cope with a variety of sensory stimuli. 1 Introduction It is widely accepted that the release of norepinephrine (NE) into target brain areas through noradrenergic (e.g., locus coeruleus, LC) pathways facilitates the efficacies of both excitatory and inhibitory synaptic transmissions within the targeted neuronal circuits (Woodward, Moises, Waterhouse, Hoffer, & Freedman, 1979; Woodward, Moises, Waterhouse, Heh, & Cheun, 1991; Waterhouse, Mouradian, Sessler, & Lin, 2000; Usher & Davelaar, 2002). NE binds to α and β adrenoceptors of neurons, activates second messenger Neural Computation 17, 1739–1775 (2005)
© 2005 Massachusetts Institute of Technology
1740
O. Hoshino
systems, and augments the efficacies of excitatory (e.g., glutamatergic) and inhibitory (e.g., GABAergic) synaptic transmissions (Waterhouse, Moises, & Woodward, 1981; Waterhouse, Moises, Yeh, & Woodward, 1982; Mouradian, Sessler, & Waterhouse, 1991). Such NE-induced neuromodulation has been well demonstrated for cortical pyramidal cells (Sessler et al., 1995; Waterhouse et al., 2000). Concerning NE-induced modulation of excitatory synaptic transmissions, recent studies (Waterhouse, Devilbiss, et al., 1998; Waterhouse, Moises, & Woodward, 1998; Devilbiss & Waterhouse, 2000) have investigated dose-dependent actions of NE on glutamate-evoked discharges of layer II/III and V neurons of the somatosensory cortex. They found two distinct profiles of actions on the responsiveness of neurons to excitatory synaptic input. One was enhanced synaptic modulation, in which glutamateevoked discharges of neurons were progressively enhanced as the dose level of NE increased. The other one was depressed synaptic modulation, in which glutamate-evoked discharges of neurons were monotonically depressed as the dose level of NE increased. Although the augmentation of the efficacies of excitatory synaptic transmissions may provide cortical neurons with enhanced detectability to sensory stimuli, less is known about what roles the depression of the excitatory synaptic efficacies plays in processing sensory information. It seems also important to ask how the two contradictory (enhanced and depressed) NEinduced excitatory synaptic modulations operate in the cortex and influence its cognitive performance. There has been ample evidence that cortical neurons are not silent but active, or emit action potentials spontaneously, even without external stimulation. Until recently, this self-generative ongoing neuronal activity had been regarded as noise or meaningless. However, recent experiments have demonstrated that ongoing cortical activity contains important information and greatly influences subsequent cognitive processes (Engel, Fries, & Singer, 2001; Arieli, Shoham, Hildesheim, & Grinvald, 1995; Arieli, Sterkin, Grinvald, & Aertsen, 1996). Researchers have suggested that ongoing neuronal activity might play important roles in processing relevant sensory information. Recent neurophysiological experiments (Kasamatsu & Heggelund, 1982; Waterhouse et al., 1988; Waterhouse et al., 2000) have shown that NE application affects the ongoing property of cortical neurons, reducing their activity. Researchers have suggested that the reduction in ongoing neuronal activity is essential for the enhancement of evoked-to-background neuronal activity ratio, or signal-to-noise (S/N) ratio. A variety of S/N enhancements has been reported, for example, NE suppresses ongoing (background) neuronal activity more than stimulus-induced (evoked) neuronal activity (Moises, Woodward, Hoffer, & Freedman, 1979; Waterhouse & Woodward, 1980; Sessler, Cheng, & Waterhouse, 1988) or NE enhances stimulus-induced neuronal activity, keeping its ongoing activity almost unchanged (Waterhouse
Cognitive Enhancement Mediated Through Norepinephrine
1741
& Woodward, 1980). It seems likely that although NE-enhanced inhibitory synaptic efficacies are important for reducing ongoing neuronal activity (or noise) (Leventhal, Wang, Pu, Zhou, & Ma, 2003) and therefore for enhancing S/N ratio, a balance between excitatory and inhibitory synaptic modulations might be crucial for improving S/N ratio. The purpose of this study is to clarify how the two contradictory (enhanced and depressed) NE-induced modulations of excitatory synaptic efficacies operate in processing sensory information in the cortex. We investigate how the doses of NE into the cortex affect the ongoing neuronal activity and subsequent cortical information processing. We construct a neural network model that is simple but based on the essential neuronal architecture of the cortex. The network consists of neuron units, each composed of a pyramidal cell (PYC), a small basket cell (SBC), and a large basket cell (LBC). The PYC and the SBC are reciprocally connected via a positive (PYC-toSBC) and a negative (SBC-to-PYC) synapse. The PYC positively synapses on the LBC. A group of PYCs form a cell assembly whose collective activation expresses information about a specific sensory feature. The PYCs within cell assemblies are connected with each other via positive synapses, and the LBCs negatively synapse on the PYCs of other cell assemblies, by which the activities of the cell assemblies tend to be mutually inhibited. This inhibitory mechanism might be supported by the notion that neurons of primate sensory areas are often excited by mnemonic traces of preferred targets and inhibited by those of nonpreferred targets (Funahashi, Bruce, & Goldman-Rakic, 1989; Wilson, O’Scalaidhe, & Goldman-Rakic, 1994; Rao, Williams, & Goldman-Rakic, 1999). The neural network model has ongoing neuronal activity, in which these cell assemblies are dynamically, temporarily activated or deactivated in a self-generated manner. The duration and sequential order of the emergence of each dynamic cell assembly is completely random. When a given cell assembly is temporarily active, its PYCs fire action potentials in weak synchrony. When the network is stimulated with a relevant sensory feature, the dynamic cell assembly corresponding to the stimulus tends to be selectively activated. Under the stimulation period, the PYCs of the assembly continuously emit action potentials in strong synchrony, while the other cell assemblies tend to be suppressed through lateral inhibitory connections (LBC-to-PYC). We propose here two distinct NE-neuromodulatory systems: (1) an enhanced-excitatory and enhanced-inhibitory (E-E/E-I) system and (2) a depressed-excitatory and enhanced-inhibitory (D-E/E-I) system. In simulations, either E-E/E-I or D-E/E-I neuromodulatory system operates, through which the efficacies of excitatory and inhibitory synapses are modified depending on the dose level of NE. Injecting NE into the network with various dose levels, we record individual neuronal activities under both ongoing (background) and stimulus-induced (evoked) neuronal states and calculate the S/N ratio. By analyzing these data, we try to understand how the
1742
O. Hoshino
two contradictory NE-neuromodulatory systems operate in controlling the ongoing cortical state and in subsequent sensory information processing. Two distinct types of NE-mediated neuromodulation, or enhancedand depressed-excitatory glutamate-evoked responses, have been observed within the same cortical area, the rat barrel cortex (Devilbiss & Waterhouse, 2000). This led us to hypothesize that the two neuromodulatory systems (E-E/E-I and D-E/E-I) may interact. A simple hypothetical neural network model is proposed, into which the E-E/E-I and D-E/E-I systems are integrated. The PYCs of the E-E/E-I system receive enhanced synaptic inputs, and those of the D-E/E-I system receive depressed synaptic inputs from other PYCs. We investigate whether the two neuromodulatory systems cooperate in controlling the ongoing cortical state and processing sensory information. Cortical fast oscillations such as γ -oscillations (40–70 Hz) are believed to play important roles in cognitive processing in the brain. For example, Pulvermuller ¨ and colleagues (Pulvermuller, ¨ Keil, & Elbert, 1999) have suggested that cortical oscillations at higher frequencies reflect the presence of learned associative representations. We show that the E-E/E-I neuromodulatory system contributes to the generation of such γ -oscillations in ongoing neuronal activity. The ongoing γ -oscillatory state may be regarded as a ready neuronal state preparing for a subsequent sensory input, in which information about the learned sensory features is frequently accessed as γ bursts. We investigate whether the ongoing γ -oscillations play a significant role in processing sensory information in the cortex. 2 Neural Network Model 2.1 Model Structure. In the cortex, inhibitory interneurons are extremely diverse in morphology and electrophysiology (Fairen, DeFelipe, & Regidor, 1984; Gupta, Wang, & Markram, 2000; Krimer & GoldmanRakic, 2001). Inhibitory interneurons synapse on other neurons and give inhibitory effects on their dynamic behaviors. Among interneurons, basket cells are known to selectively synapse on pyramidal cells (PYCs) (Somogyi, Kisvarday, Martin, & Whitteridge, 1983) that are believed to play major roles in neural information processing in the brain. Basket cells have diversity especially in their axonal arborizations and are classified roughly into two subclasses: large basket cells (LBCs) and small basket cells (SBCs) (Wang, Gupta, Toledo-Rodriguez, Wu, & Markram, 2002) or wide arbor cells and local arbor cells (Krimer & Goldman-Rakic, 2001). The LBCs and SBCs have, respectively, aspiny wide (up to ∼1000 µm) and short (up to ∼300 µm) axonal arbors that tend to synapse on the somata of target cells. Nest basket cells (Wang et al., 2002) or medium arbor cells (Krimer & Goldman-Rakic, 2001) could be placed into a third subclass, where their axonal arborizations are in a medium range, ∼500 µm. The differential range of influence due to such a difference in length of
Cognitive Enhancement Mediated Through Norepinephrine
1743
axonal arbors has a great impact on cortical neuronal information processing (Krimer & Goldman-Rakic, 2001). Reaserchers (Sillito, 1984; Eysel, 1992) have suggested that lateral inhibition between PYCs, which is mediated by interneurons through their axonal arbors, plays an important role in various cognitive processes, such as tuning of excitatory neurons to sound frequency in the primary auditory cortex (Yost, 1994) and orientation or direction (Eysel, Shevelev, Lazareva, & Sharaev, 1998) and cross-orientation inhibition (Kisvarday, Kim, Eysel, & Bonhoeffer, 1994) in the primary visual cortex. Goldman-Rakic and colleagues (Wilson et al., 1994; Rao et al., 1999; Rao, Williams, & GoldmanRakic, 2000) have demonstrated that even in higher cortical regions (e.g., prefrontal cortex) lateral inhibitory mechanisms could operate in shaping memory fields of neurons when monkeys are engaging in working memory tasks. The LBCs may mediate such a lateral inhibitory effect through their horizontal axonal arbors. Based on the above observations, I construct a neural network model for the cortex (see Figure 1A). The model consists of an input (IP) and an output (OP) network. Feature stimuli Fn (n = 1, 2, 3, 4, 5) activate their corresponding groups of IP neurons (ellipses), whose action potentials are sent to the OP network via divergent and convergent feedforward projections (the solid lines) and activate corresponding cell assemblies (the circles). As shown in Figure 1B, the OP network consists of neuron units, each of which contains a PYC (the large triangles), an SBC (small circles), and on LBC (the large circle). In each unit, the PYC and the SBC are reciprocally connected by a positive (PYC-to-SBC) and a negative (SBC-to-PYC) synapse. The PYC positively synapses on the LBC. Groups of PYCs form cell assemblies. PYCs within cell assemblies are connected with each other via positive synapses, and there is no connection between PYCs across cell assemblies. LBCs negatively synapse on the PYCs of the other cell assemblies through lateral (LBC-to-PYC) connections. I assume here a primary cortical area whose neurons have tuning properties to specific sensory features. To make the PYCs feature selective, I create in the OP network multiple dynamic cell assemblies that are spatially separated from each other (see Figure 1A). Due to such separable property, the dynamics of the OP network allows a given cell assembly to be selectively activated against others when its corresponding feature stimulus is presented to the IP network. For simplicity, the IP network contains only projection neurons (PNs) between which there is no connection. That is, the IP network works exclusively as an input layer. 2.2 Model Description. Dynamic evolutions of the membrane potentials of PNs, PYCs, LBCs, and SBCs are, respectively, defined by τPN
duiPN (t) PN = − uiPN (t) − uPN rest + Ii (t), dt
(2.1)
1744
O. Hoshino
A
NE
B
Output LBC PYC
OP OP IP
,,,,,,
modulation rate
F5
Input
enhanced synaptic mod.
0
0
PN
IP
input
1.0
2.0 [NE]0
D modulation rate
F1 C
SBC
depressed synaptic mod.
[NE] 0
1.0
[NE]0 2.0
0
[NE] Figure 1: (A–B) Structure of the neural network model. (A) The model consists of an input (IP) and an output (OP) network. Feature stimuli Fn (n = 1, 2, 3, 4, 5) are applied to corresponding groups of IP neurons (the ellipses), whose action potentials are sent to the OP network via divergent and convergent feedforward projections (the solid lines) and activate corresponding cell assemblies (the circles). NE (norepinephrine) is dosed into the OP network. (B) The OP network consists of neuron units, each composed of a set of a PYC (the large triangle), a SBC (the small circle), and a LBC (the large circle). Open and filled small triangles denote excitatory and inhibitory synapses, respectively. (C–D) Schematic drawings of synaptic modulation rates as a function of the dose level of NE ([NE]). (C) Enhanced modulations for excitatory (the solid line) and inhibitory (the dotted line) synaptic efficacies. (D) Depressed modulation for excitatory synaptic efficacies.
τPY
NOP duiPY (t) = − uiPY (t) − uPY wiPY,PY (t)SPY j (t) rest + j dt j=1
−
NOP
PY,SB wiPY,LB (t)SLB (t)SiSB (t) j (t) − wi j
j=1
+
NIP j=1
L iOP,IP SPN j (t), j
(2.2)
Cognitive Enhancement Mediated Through Norepinephrine
1745
τLB
duiLB (t) LB,PY PY Si (t), = − uiLB (t) − uLB rest + wi dt
(2.3)
τSB
duiSB (t) SB,PY PY Si (t), = − uiSB (t) − uSB rest + wi dt
(2.4)
and their action potentials are generated according to Pr ob SiY (t) = 1 = f Y uiY (t) , f Y [u] =
1 1+
e −ηY (u−θY )
(Y = P N, PY, L B, SB),
,
(2.5)
where uiY (t) is the membrane potential of the ith Y (Y = PN, PY, LB, SB) Y neuron at time t, whose time constant is τY . urest is the resting potential PN of Y neuron. Ii (t) is an external input to the ith PN with an intensity (positive constant). NOP and NI P are the numbers of neuron units of the OP and IP networks, respectively. wiX,Y is a synaptic strength from the jth Y j neuron to the ith X (X = PY, LB, SB) neuron. wiX,Y is a synaptic strength from neuron Y to X of unit i. L iOj P,I P is a connection strength of the divergent and convergent feedforward projections from the jth PN to the ith PYC and defined by
L iOj P,I P
1.0 between Fn-sensitive PNs and PYCs: n= 1-5 = 0.0 otherwise.
(2.6)
Equation 2.5 defines the probability of neuronal firing, that is, the probability of SiY (t) = 1 is given by f Y ; otherwise, SiY (t) = 0. ηY and θY are, respectively, the steepness and the threshold of sigmoid function f Y for Y neuron. When the ith neuron fires, SiY (t) takes on a value of 1 for 1 msec, which is followed by 0 for another 1 msec. This is a simplified representation of an action potential and a refractory period and gives a maximal firing rate (500 Hz). After firing, the membrane potential is reset to the resting potential. Unless otherwise stated elsewhere, τPN = τPY = τLB = 50 msec, τSB = SB PY LB 10 msec, uPN rest = urest = −0.6, urest = urest = −0.5, NI P = NOP = 100, ηPN = ηPY = 10.2, ηLB = ηSB = 21.0, θPN = θPY = 0.1, and θLB = θSB = 0.3. As will be explained in section 2.3, wiPY,PY (t), wiPY,LB (t), and wiPY,SB (t) are modified j j depending on a level of NE application, and wiLB,PY and wiSB,PY are not to be modulated and set to a constant value (20.0). Each cell assembly consists of 10 neuron units. When the IP network is stimulated with feature Fn (n = 1, 2, 3, 4 or 5), a group of IP neurons corresponding to sensory feature Fn (see the ellipse indicated by Fn Figure 1A) receive an input as IiPN (t) = 1.0, and the other IP neurons receive no input as IiPN (t) = 0.0. These parameter
1746
O. Hoshino
values were chosen for suitable network performance, by which I could clearly demonstrate essential neuronal behaviors of the network. In section 3.3, I show in detail how sensitive the dynamic behavior of the network to these parameters is. It is well known that excitatory and inhibitory neurons frequently form reciprocal synaptic connections (Martin, 2002). Zilberter (2000) has reported that 75% (n = 80) of pairs of a PYC and a fast-spiking interneuron form reciprocal synaptic connections. The fast-spiking interneurons might be SBCs or chandelier cells (Krimer & Goldman-Rakic, 2001). Such reciprocal connections may mediate self-inhibition of PYCs through feedback inputs from their accompanying SBCs as the activities of the PYCs increase. When an input current was applied to the soma of SBCs, the SBCs generated spikes with a higher firing rate, while PYCs and LBCs fired with a lower rate (Krimer & Goldman-Rakic, 2001). The larger (τPY = τLB = 50 msec) and smaller (τSB = 10 msec) membrane time constants are functional representations of such slow- and fast-spiking cortical neurons, respectively. 2.3 NE Neuromodulatory System. As schematically shown in Figure 1C, the efficacies of both excitatory (the solid line) and inhibitory (the dotted line) synapses are enhanced as a function of a dose level of norepinephrine, or concentration of NE ([NE]). The enhanced excitatory (wiPY,PY (t)) and inhibitory (wiPY,LB (t), wiPY,SB (t)) synaptic modulations are dej j scribed by the following equations: dwiPY,PY (t) j dt
(t) − w0PY,PY , = αPY ([NE]0 − [NE])[NE] − βPY wiPY,PY j (2.7)
(t) dwiPY,LB j dt
(t) − w0PY,LB , = αLB [NE] − βLB wiPY,LB j
dwiPY,SB (t) = αSB [NE] − βSB wiPY,SB (t) − w0PY,SB . dt
(2.8) (2.9)
The depressed excitatory synaptic modulation between PYCs is schematically depicted in Figure 1D and described by dwiPY,PY (t) j dt
(t) − w0PY,PY . = −αPY [NE] − βPY wiPY,PY j
(2.10)
The inverted-U (the solid line in Figure 1C) and the monotonic-decrease (see Figure 1D) shapes for the excitatory synaptic modulation between PYCs are based on observed results (Waterhouse, Devilbiss, et al., 1998; Waterhouse, Moises, et al., 1998; Devilbiss & Waterhouse, 2000). The
Cognitive Enhancement Mediated Through Norepinephrine
1747
“monotonic-increase” shape for the inhibitory synaptic modulation (the dotted line in Figure 1C) is a simple hypothetical representation that is based on observed results (Waterhouse et al., 1981, 1982; Mouradian et al., 1991). In this study, we focus especially on the postsynaptic (PYC-to-PYC, LBC-toPYC and SBC-to-PYC) actions of NE on the activities of PYCs that play major roles in cognitive information processing in the cortex. For simplicity, we did not modulate the other excitatory synapses, PYC-to-LBC and PYC-to-SBC. Unless otherwise stated elsewhere, w0PY,PY = 7.0, w0PY,LB = 0.1, w0PY,SB = 30.0, αPY = 3.0, αLB = 0.8, αSB = 60.0, βPY = βLB = βSB = 1.0, and [NE]0 = 2.0. 3 Results 3.1 Neural Bases of NE-Induced Neuromodulation. As shown by the raster plots of action potentials in Figure 2A, the PYCs have ongoing (background) activity, where no external stimulus and no dose of NE are applied. An initial network state {SiY (t = 0) : Y = PY, L B, SB} was randomly chosen, where the probability of firing an action potential (SiY (0) = 1.0) was set to 0.1. The random and brief emergence of the five (F1–5) dynamic cell assemblies, or population activation of PYCs, characterizes the ongoing neuronal activity. The temporal formation of each dynamic cell assembly arises from mutual excitation between the PYCs within cell assemblies. The brief nature of the dynamic cell assemblies arises largely from the self-inhibition mediated through their accompanying SBCs. Due to such a self-inhibitory mechanism, the more the PYCs fire action potentials, the greater the activities of the PYCs tend to be suppressed. When the IP network is stimulated with a sensory feature (F2), whose duration is indicated by a horizontal bar in Figure 2A, the PYCs of the cell assembly corresponding to the stimulus are activated and emit a long burst of action potentials. After switching off the input, the state of the OP network returns to the ongoing state. Note that the other dynamic cell assemblies (F1, F3, F4, and F5) tend to emerge frequently during the stimulation period. This indicates that the lateral inhibition across dynamic cell assemblies, which is mediated through LBC-to-PYC inhibitory connections, is not so strong under the original condition, or at [NE] = 0.0. Figures 2B and 2C show how the dynamic behavior of the network is modulated by the enhanced-excitatory (see the solid line in Figure 1C) and enhanced-inhibitory (see the dotted line in Figure 1C) system— the E-E/E-I system. The period of each brief burst under the ongoing state is decreased as the dose level of NE ([NE]) increases (Figure 2A → 2B → 2C), which is due largely to the enhanced self-inhibition of PYCs through SBC-to-PYC feedback connections. Note that the activation of the dynamic cell assemblies tends to be temporally separated from each other as [NE] increases; that is, they are not likely to overlap in the time course. This is due largely to the enhanced lateral inhibition through LBC-to-PYC connections. Such
1748
O. Hoshino
A
F2
input F5 F4 F3 F2 F1 1
3
5
[NE ] = 0.0
7
9
11
time (s)
enhanced modulation
B
[NE ] = 1.0
F5 F4 F3 F2 F1 C
[NE ] = 2.0 F5 F4 F3 F2 F1
D
depressed modulation
[NE ] = 1.0
F5 F4 F3 F2 F1 E
[NE ] = 2.0 F5 F4 F3 F2 F1
Figure 2: Dependence of the dynamic behavior of the OP network on dose levels of NE ([NE]). Raster plots of PYC action potentials of cell assemblies that are sensitive to features F1–5 are shown. (A) NE is not dosed, or [NE] = 0.0. A horizontal bar indicates a stimulation (F2) presentation period. (B–C) NE-induced neuromodulation operated under the E-E/E-I system. (D–E) NEinduced neuromodulation operated under the D-E/E-I system.
Cognitive Enhancement Mediated Through Norepinephrine
1749
temporal segregation of dynamic cell assemblies is essential for processing the applied feature stimulus (F2) in that as feature-detection neurons of an early sensory cortex, the PYCs must respond selectively to a specific feature stimulus, while the other PYCs are not allowed to respond, or emit fewer action potentials. Note that although the ongoing PYC activity is decreased as [NE] increases, the synchronous PYC activity within cell assemblies is well preserved (see Figure 2C). The term synchronous activity implies that the PYCs within cell assemblies generate action potentials almost at the same time. Figures 2D and 2E show how the dynamic behavior of the network is modulated by the depressed-excitatory (see Figure 1D) and enhancedinhibitory (see the dotted line of Figure 1C) system—the D-E/E-I system. Both the ongoing and the stimulus-induced activities tend to be decreased as [NE] increases (Figure 2A → 2D → 2E), where synchronicity among action potentials within cell assemblies progressively disappears. Such desynchronization in PYC activity is due largely to the depression of excitatory synaptic connections between pyramidal cells. We evaluated the cognitive performance of the network in terms of evoked-to-background PYC activity ratio, or [stimulus-induced firing rate of PYCs]/[ongoing firing rate of PYCs]. We applied the same feature (F2) stimulus with various stimulus intensities; = 0.3 (strong: see Figure 3A), = 0.05 (weak: see Figure 3B) and = 0.02 (too weak: see Figure 3C). In Figures 3A to 3C, the ongoing (the circles) and stimulus-induced (the triangles) PYC activities are shown for the E-E/E-I (left) and D-E/E-I (center) neuromodulatory systems. The evoked-to-background activity, which we call a signal-to-noise (S/N) ratio in a practical sense, is indicated by circles and triangles (right) for the E-E/E-I left and D-E/E-I systems, respectively. Note that the background neuronal activity itself contains significant information as internal representations in the dynamics of the network. Hence, we cannot straightforwardly refer to the background activity as noise and therefore should use the term evoked-to-background activity ratio rather than S/N ratio in a strict sense. The reason we used the term S/N instead of evoked-to-background is to evaluate the simulation results in relation to experimental (neurophysiological) observations (Waterhouse & Woodward, 1980; Kasamatsu & Heggelund, 1982; McCormick, 1989; Gu, 2002; Leventhal et al., 2003), in which evoked-to-background activity ratio was preferentially called S/N ratio. Nevertheless, this use of S/N ratio is unusual in such a field of engineering because noise is not allowed to involve any significant information (or signal). We understand that the term signal-to-noise ratio in experimental neuroscience might be used in a more practical sense than in engineering. In this study, we employed these terms (signal, noise, and S/N ratio) based on a neurophysiological use, not an engineering use. Under the E-E/E-I neuromodulatory system, the ongoing PYC activity is gradually enhanced and then depressed as [NE] increases (see the circles at the left of Figures 3A to 3C). The increase in PYC activity at lower [NE]
1750
O. Hoshino
ε = 0.3 III
80 60 40 20 0 0
100
5
III
80
4 S/N ratio
I II
100
(pulses/s)
(pulses/s)
A
60 40
2 1
20
0
0 0
1.0 2.0 [NE]
3
0
1.0 2.0 [NE]
ε = 0.05 III
40 30 20 10 0 0
50
III
2.5
40
2.0 S/N ratio
I II
50
(pulses/s)
(pulses/s)
B
30 20 10
1.0 0.5
0
0 0
1.0 2.0 [NE]
1.5
1.0 2.0 [NE]
0
1.0 2.0 [NE]
ε = 0.02 III
40 30 20 10 0 0
1.0 2.0 [NE]
50
III
0.5
40
0.4 S/N ratio
I
50
II
(pulses/s)
C (pulses/s)
1.0 2.0 [NE]
30 20 10
0.3 0.2 0.1 0
0 0
1.0 2.0 [NE]
0
1.0 2.0 [NE]
Figure 3: Neuronal and cognitive behavior of PYCs. The model is presented with a feature stimulus with (A) strong, (B) weak, and (C) too weak intensity. In each figure, the left shows the ongoing firing rate (the circles) and stimulusinduced firing rate (the triangles) of a PYC operated under the E-E/E-I system. The center is for the D-E/E-I system. The right is S/N ratios for the E-E/E-I (the circles) and the D-E/E-I (the triangles) systems. Regions marked by I, II, and III indicate that three distinct types of S/N enhancements take place.
Cognitive Enhancement Mediated Through Norepinephrine
1751
indicates that excitatory synaptic enhancement prevails against inhibitory synaptic enhancement, while the decrease in PYC activity at higher [NE] indicates that inhibitory synaptic enhancement prevails against excitatory synaptic enhancement. Under the D-E/E-I neuromodulatory system, the ongoing activity is monotonically depressed as [NE] increases (see the circles at the center of Figures 3A to 3C). This indicates that inhibitory synaptic enhancement prevails at all levels of [NE]. For stronger stimuli (see the right of Figure 3A), the S/N ratio is enhanced at an intermediate level of [NE] ([NE] = ∼1.0) under the E-E/E-I system (circles), and enhanced at a higher level of [NE] ([NE] = ∼1.75) under the D-E/E-I system (triangles). In both systems, stimulus-induced PYC activity is progressively depressed at [NE] > ∼1.0 (see the triangles at the left and center of Figure 3A), where S/N enhancement takes place provided that noise (or background PYC activity) is reduced more than signal (or evoked PYC activity). This result implies that noise reduction is as effective as signal enhancement for improving the S/N ratio and is consistent with experimentally observed results (Moises et al., 1979; Waterhouse & Woodward, 1980; Sessler et al., 1988). For weaker stimuli (see the right of Figure 3B), the S/N ratio is enhanced at lower levels of [NE] under the E-E/E-I system (the circles), and is not likely to be enhanced under the DE/E-I system (the triangles). Figure 3C (right) shows fewer S/N enhancements for stimuli under both systems that are too weak. The stronger the excitatory connections are between PYCs, the more easily the information about an applied stimulus is picked up against noise. The higher S/N ratio for [NE] = 1.0 than [NE] = 2.0 observed under the E-E/E-I system (see the circles at the right of Figure 3A) arises from the inverted-U profile (see the solid line in Figure 1C). That is, the strength of the excitatory connections reaches a maximal value at [NE] = 1.0 and is progressively decreased as [NE] further increases. The proposed inverted-U profile is based on the experimental results (Waterhouse, Devilbiss, et al., 1998; Waterhouse, Moises et al., 1998; Devilbiss & Waterhouse, 2000), in which glutamateevoked discharges of cortical neurons were progressively facilitated to a maximum as [NE] increased and then progressively depressed as [NE] further increased. The increase in discharges for lower NE concentration arises from the activation of α-1 adrenergic receptors, and the nonspecific activation of β or α-2 adrenergic receptors is responsible for the decrease in discharges for higher NE concentration (Devilbiss & Waterhouse, 2000). One of the interesting findings might be that there have been three possible schemes for the S/N enhancement; (1) signal enhancement more than noise increase (see region I of Figure 3), (2) signal enhancement and noise reduction (region II), and (3) noise reduction more than signal decrease (region III). To investigate whether these schemes have any functional significance in NE-induced S/N enhancement, we varied the stimulus intensity and the dose level of NE ([NE]) and observed S/N changes from that of the control state ([NE] = 0.0). The E-E/E-I system (see Figure 4A) preferentially
1752
O. Hoshino
enhances the S/N ratio for weaker stimuli with lower doses of NE (see the arrow), whereas the D-E/E-I system (see Figure 4B) preferentially enhances the S/N ratio for stronger stimuli with higher doses of NE (see the arrow). These results, together with those of Figure 3, may provide an important conclusion: scheme 2, or signal enhancement and noise reduction, is quite effective for detecting weaker stimuli operated under the E-E/E-I system, where lower doses of NE greatly improve the S/N ratio. When a stronger stimulus is applied, scheme 3, or noise reduction more than signal decrease, operates to detect the stimulus under D-E-/E-I systems, where higher doses of NE greatly improve the S/N ratio. It might be an interesting finding that the network can effectively enhance the S/N ratio for weaker stimuli provided that the network employs the E-E/E-I neuromodulatory system, and NE is dosed into the network at lower levels (see the arrow in Figure 4A). To investigate its neuronal mechanisms, we analyzed the dynamics of the ongoing and stimulus-induced PYC activities, where the network was dosed with a low level of NE, [NE] = 0.5. Figure 4C is the membrane potential of a PYC under the ongoing state, on which its action potentials are overlaid. The random brief bursts characterize its ongoing PYC activity. Figure 4D is a power spectrum (left) and a cross-correlation function (right) of action potentials between PYCs belonging to the same cell assembly (F2) under the going state. The peaks (see the arrows) at ∼60 Hz (left) and ∼0.0 s (right) indicate that the ongoing random brief bursts are synchronized γ -oscillations among PYCs within cell assemblies. Figure 4E is a power spectrum (left) and a crosscorrelation function (right) for the same PYCs when stimulated by F2 with a weak intensity ( = 0.05). The significant increases of these peak values from those of Figure 4D indicate that the S/N enhancement (see the arrow of Figure 4A) has been established through tightly synchronized γ -oscillations among PYCs within the cell assembly corresponding to the stimulus. If the network employs the D-E/E-I neuromodulatory system, such synchronized brief bursts do not emerge under the ongoing state as shown in Figures 4F and 4G. There is no significant enhancement in power and
Figure 4: (A–B) Dependence of S/N enhancement on stimulus intensity and the dose level of NE, [NE]. (A) Neuromodulation operated under the E-E/E-I system. (B) Neuromodulation operated under the D-E/E-I system. An arrow indicates a peak in S/N enhancement. (C–H) Spectral and cross-correlation analyses of the activity of a PYC for E-E/E-I (C–E) and D-E/E-I (F–H) neuromodulatory systems. (C, F) The membrane potential of a PYC under the ongoing state where [NE] = 0.5, on which its action potentials are overlaid. (D, G) The power spectrum (left) and cross-correlation function (right) of action potentials under the ongoing state. (E, H) The power spectrum (left) and cross-correlation function (right) of action potentials under the stimulus-induced state. = 0.05.
Cognitive Enhancement Mediated Through Norepinephrine
B depressed modulation
enhanced modulation
500
500
100
100 2.0 2 4
stim 6 1.0 E] ulu 0.1 2 [N s in 4 0.5 ten 6 sity 1.0 0
2.0
power
1.5 1.0
0
50
100
2.0 1.5
power
5s
0.01
0.005
0.5 0.0
E
cross-correlation cross-correlation
-0.6
D
0 -2.5
0
2.5
0
2.5
0.01
0.005
1.0 0.5 0.0 0
50
100
frequency (Hz)
0 -2.5
time lag (s)
-0.6 2.0
G power
1.5 1.0
0.01
5s
0.005
0.5 0.0 0
50
100
2.0
H
cross-correlation cross-correlation
F
1.5
4
stim 6 1.0 E] ulu 0.1 2 [N s in 4 0.5 ten 6 sity 1.0 0
C
2.0 2
1.5
1.5
power
S/N change (%)
A
1753
0 -2.5
0
2.5
0
2.5
0.01
0.005
1.0 0.5 0.0 0
50
100
frequency (Hz)
0 -2.5
time lag (s)
1754
O. Hoshino
cross-correlation even when presented with the stimulus (see Figure 4H). Note that under the D-E/E-I system, this condition ( = 0.05, [NE] = 0.5) cannot enhance the S/N ratio (see Figure 4B), whereas the E-E/E-I system can enhance it (see the arrow in Figure 4A). This result may indicate that synchronized γ -oscillations within cell assemblies under the ongoing state are essential for the enhancement of the S/N ratio for weaker (or subthreshold) stimluli. As demonstrated above, synchronized, brief γ -bursts characterize the ongoing PYC activity controlled under the E-E/E-I neuromodulatory system. To investigate the neuronal relevance of the γ -bursts in S/N enhancement, we measured power spectra with various stimulus intensities and dose levels of NE. Under the ongoing state (see Figure 5A), the γ -oscillations are gradually increased and reach a maximum and then are decreased as [NE] increases further (from top to bottom). The emergence of γ -bursts is limited to a certain range of [NE], ∼0.25 < [NE] < ∼0.75. Figure 5B shows how the γ -power is enhanced when stimulated with a weak ( = 0.05) stimulus. The γ -power is greatly increased at ∼0.25 < [NE] < ∼0.75. Note that the maximal peak at [NE] = ∼0.5 corresponds to the maximal signal enhancement (see the triangles at the left of Figure 3B). When a strong ( = 0.3) stimulus is presented to the network, there is no significant increase in power at a specific frequency (see Figure 5C). Instead, a wide range of frequencies (∼3–70 hertz) is augmented. This indicates that S/N enhancement for stronger stimuli is established through increasing all the oscillatory (frequency) components that appear under the ongoing state. It might be that the ongoing γ -oscillatory state is a ready neuronal state preparing for a subsequent sensory input, especially for weaker or subthreshold stimuli, in which information about the learned sensory features (F1–5) is frequently accessed as γ -bursts. It is well known that cortical neurons can respond with direct excitation or depression to iontophoretically applied norepinephrine (Johnson, Roberts, & Straughan, 1969). NE can influence neuronal behavior through direct actions on potassium channels, hyperpolarizing the cells (Sarvey, 1988). These influences roughly correspond to reparameterization of equation 2.5. To answer the question of how it interacts with synaptic modulation in controlling cortical S/N ratio, we made an additional simulation. We varied the threshold value (θPY of equation 2.5) and NE concentration ([NE]), recorded ongoing and stimulus-induced activities of PYCs, and calculated the S/N ratio for the E-E/E-I (see Figure 6A) and D-E/E-I (see Figure 6B) systems. For simplicity, we assumed that θPY is independent of NE concentration. Note that the parameter θPY manipulates the firing probability of PYCs (see equation 2.5). The S/N ratio is maximal at around (θPY , [NE]) = (∼0.05, ∼0.5) for the E-E/E-I system (see the shaded region of Figure 6A). For the D-E/E-I system (see Figure 6B), the S/N ratio can be enhanced with smaller doses (or even without any dose) of NE (see the filled arrow) provided that a higher value
Cognitive Enhancement Mediated Through Norepinephrine
B
ongoing activity
weak (ε = 0.05) inp. strong (ε = 0.3) inp. 2.0
[NE] = 0.0
power
1.5 1.0 0.5 0.0
1.0 0.5
100
0
frequency (Hz)
1.0 0.5 0.0
1.5 1.0 0.5
100
4.0
1.0 0.5
2.0
0.0
50
1.0 0.5
100
4.0
0.5
2.0
0.0
50
50
1.0 0.5
100
4.0
0.5 0.0
50
50
100
frequency (Hz)
[NE] = 0.75
2.0 1.0
100
2.0
0
1.0 0.5
50
100
frequency (Hz) 4.0
[NE] = 1.0
1.5
[NE] = 1.0
3.0 2.0 1.0 0.0
0.0 0
100
3.0
frequency (Hz) power
1.0
50
frequency (Hz)
0.0 0
[NE] = 1.0
1.5
1.0 0
[NE] = 0.75
1.5
frequency (Hz) 2.0
2.0
100
0.0 0
[NE] = 0.5
3.0
frequency (Hz)
power
1.0
100
0.0 0
[NE] = 0.75
1.5
50
frequency (Hz)
[NE] = 0.5
1.5
frequency (Hz) 2.0
1.0
0
power
50
2.0
100
0.0 0
[NE] = 0.25
3.0
frequency (Hz)
[NE] = 0.5
1.5
100
0.0 0
power
power
2.0
50
frequency (Hz)
power
50
frequency (Hz)
power
0
0.0 0
1.0
100
[NE] = 0.25
2.0
power
power
1.5
2.0
frequency (Hz)
[NE] = 0.25
2.0
50
power
50
[NE] = 0.0
3.0
0.0
0.0 0
power
4.0
[NE] = 0.0
1.5
power
power
2.0
C
power
A
1755
0
50
100
frequency (Hz)
0
50
100
frequency (Hz)
Figure 5: Spectral analysis of the PYC activity operated under the E-E/E-I neuromodulatory system. (A) Power spectra under the ongoing state. (B) Power spectra under the stimulus-induced state, where a weak stimulus ( = 0.05) is presented. (C) Power spectra under the stimulus-induced state, where a strong stimulus ( = 0.3) is presented. [NE] is increased from 0.0 (top) to 1.0 (bottom).
1756
O. Hoshino A
B
depressed modulation
S/N ratio
S/N ratio
enhanced modulation
2.0 0
2.0 0 0
2.0
0.1
0.2
1.0
Y
0.3 0
E]
[N
2.0
0.1
θP
Y
θP
1.0
0.2
0
0.3 0
E]
[N
Figure 6: Dependence of the S/N ratio on the threshold value of the sigmoid function (θPY of equation 2.5) and the dose level of NE ([NE]) operated (A) under the E-E/E-I system where a weak stimulus ( = 0.05) is applied, and (B) under the D-E/E-I system where a strong stimulus ( = 0.3) is applied.
is taken for θPY . When θPY is low, the S/N ratio can be enhanced with greater doses of NE (see the open arrow). These simulation results may indicate that a combinatorial modulation of firing probability and synaptic efficacies may operate in controlling cortical S/N ratio. 3.2 Possible Interaction Between E-E/E-I and D-E/E-I Neuromodulatory Systems. The main issue of this study was to investigate how the two distinct types of neuromodulatory systems, or the E-E/E-I (enhancedexcitatory/enhanced-inhibitory) system and the D-E/E-I (depressedexcitatory/enhanced-inhibitory) system, work in processing neuronal information in the cortex. We showed in Figures 4A and 4B that a major difference between the two systems is in their operation range. That is, the E-E/E-I system (see Figure 4A) preferentially enhances the S/N ratio for weaker stimuli with lower doses of NE, whereas the D-E/E-I system (see Figure 4B) preferentially enhances the S/N ratio for stronger stimuli with higher doses of NE. Two distinct types of NE-mediated cortical responses, or enhanced and depressed glutamate-evoked responses, have been observed within the same cortical area, for example, the rat barrel cortex (Devilbiss & Waterhouse, 2000). This led us to hypothesize that the two neuromodulatory-modes (E-E/E-I and D-E/E-I) may interact. A simple hypothetical neural network model into which the two modes are integrated is shown in Figure 7A. The PYCs of the E-E/E-I system (the solid circles) receive enhanced-excitatory synaptic input (the solid lines), and those of the D-E/E-I system (the dotted circles) receive depressed-excitatory synaptic input (the dotted lines). We investigate how the S/N ratio changes as a function of stimulus intensity and NE concentration ([NE]). An enhanced neuron, or a PYC whose activity is enhanced by NE application, shows
Cognitive Enhancement Mediated Through Norepinephrine
1757
A : enhanced neuron : depressed neuron : enhanced connection : depressed connection
enhanced neuron
S/N change (%)
B
depressed neuron
C
400
400 100
2.0 2 4
stim
1.5
6
0.1
ulu
4
s in
E]
1.0
2
ten
6
1.0
sity
0.5 0
[N
100
2.0 2 4
stim
1.5
6
0.1
ulu
2
s in 4 ten 1.0 0 sity
E]
1.0 0.5
[N
Figure 7: (A) A hypothetical neural network model into which the two neuromodulatory modes (E-E/E-I, D-E/E-I) are integrated. The PYCs of the E-E/E-I system (the solid circles) receive enhanced synaptic input (the solid lines), and those of the D-E/E-I system (the dotted circles) receive depressed synaptic input (the dotted lines) from other PYCs. (B–C) Dependence of S/N enhancement on stimulus intensity and the dose level of NE ([NE]) operated under the integrated neuromodulatory system. (B) S/N change of an enhanced neuron, or a PYC whose activity is to be enhanced by NE application. (C) S/N change of a depressed neuron, or a PYC whose activity is to be depressed (suppressed) by NE application. An arrow indicates a peak in S/N enhancement.
optimal S/N enhancement for weaker stimuli and lower [NE] (see the arrow of Figure 7B). This result is similar to that of Figure 4A (see the arrow). An interesting finding might be that a depressed neuron, or a PYC whose activity is to be depressed by NE application, shows bimodal S/N enhancement (see the arrows of Figure 7C). That is, the S/N ratio can be enhanced not only for stronger stimuli (see the open arrow) as in the original D-E/E-I system (see the arrow in Figure 4B) but also for weaker stimuli (see the filled arrow). This recruitment of the depressed neurons for the detection of weaker stimuli, which is not obvious in the original D-E/E-I system (see
1758
O. Hoshino
Figure 4B), might be advantageous for the brain to cope with a variety of sensory stimuli. It might be interesting to note that the profile of the S/N enhancement of the depressed neuron for weaker stimuli (see the filled arrow in Figure 7C) is quite similar to that of the enhanced neuron (see the arrow in Figure 7B). To answer why the depressed neuron behaves like the enhanced neuron, we calculated cross-correlation functions of action potentials between a pair of an enhanced and a depressed neuron under the ongoing state (see Figure 8). The level of synchronization between the two neurons is progressively enhanced as [NE] increases (A → B → C → D ), reaches a maximal at [NE] = 1.0 (E), and then is depressed (F → G → H → I ). The synchronized activity implies that these neurons behave like the members of the same cell assembly, and thus the depressed neuron can respond in the same way as the enhanced neuron. The less synchronization for higher [NE] ([NE] > ∼1.75; H, I) implies that the depressed neuron is segregated from the cell assembly. This result raises an important question of why the level of synchrony between the enhanced and depressed neuron is increased even though the efficacies of excitatory connections from the enhanced to depressed neurons are weakened as [NE] increases. Synchronization between enhanced neurons plays an important role in synchronizing the depressed neuron with the enhanced neuron. That is, the depressed neuron can effectively be activated if it receives coincidental inputs from the enhanced neurons. Tight synchronization among enhanced neurons, which can be established at suitable NE concentrations, ∼0.25 < [NE] < 1.5, could provide such coincidental firings between the enhanced neurons within cell assemblies (not shown). The coincidental firings effectively activate the depressed neurons so that the depressed neurons can be tightly synchronized with the enhanced neurons, regardless of the NE-mediated depression of the excitatory synaptic connections from the enhanced to depressed neurons. 3.3 Dependence of Ongoing γ -Oscillatory Behavior on Neuron and Network Parameters. In this section, we show how sensitive the dynamic behavior of the PYCs to the parameters of the modeled neuron and network is, based on which we determined the suitable parameter values as presented in sections 2.2 and 2.3. As demonstrated in Figures 4 and 5, the ongoing γ -bursts controlled under the E-E/E-I system played an essential role in enhancing S/N ratio for weaker stimuli, here we focus on how the ongoing γ -oscillatory behavior is influenced by these parameters. Figure 9 presents the sensitivity of the dynamic behavior of PYCs to their membrane time constant. Under the ongoing state governed by [NE] = 0.5, γ -bursts take place in a range of τPY = ∼5.0–50.0 ms, where its peak frequency tends to be increased as τPY decreases (e.g., see the top right of Figure 9A). Such a frequency shift presumably arises from an increase in mutual excitation among PYCs within cell assemblies, which is
Cognitive Enhancement Mediated Through Norepinephrine A 0.03
B 0.03
[NE] = 0.0
0.03
[NE] = 0.25
0.02
0.02
0.01
0.01
0.01
0
0
0
-0.01
-0.01
-0.01
0
2.5
-2.5
time lag (s)
0
2.5
[NE] = 0.5
-2.5
time lag (s)
D 0.03
cross-correlation
C
0.02
-2.5
0.02
0.02
0.01
0.01
0.01
0
0
0
-0.01
-0.01
-0.01
0
2.5
-2.5
time lag (s)
0.03
0
2.5
-2.5
time lag (s)
G
0.03
0.03
[NE] = 1.75
0.02
0.02
0.01
0.01
0.01
0
0
0
-0.01
-0.01
-0.01
0
time lag (s)
2.5
-2.5
0
2.5
I
0.02
-2.5
[NE] = 1.25
time lag (s)
H [NE] = 1.5
2.5
F 0.03
[NE] = 1.0
0.02
-2.5
0
time lag (s)
E 0.03
[NE] = 0.75
1759
0
time lag (s)
2.5
-2.5
[NE] = 2.0
0
2.5
time lag (s)
Figure 8: Cross-correlation functions of action potentials between a pair of an enhanced and a depressed neuron (PYC) under the ongoing state, where the E-E/E-I and D-E/E-I neuromodulatory systems interact (see Figure 7A). [NE] is increased from 0.0 (A) to 2.0 (I).
mediated through the coincidental firing of the PYCs due to such a shorter value of τPY . For τSB (see Figure 9B) and τLB (see Figure 9C), γ -bursts take place in a range of ∼1 to 100 ms. Figures 9D, 9E, and 9F present how the membrane potentials of PYCs change as these membrane time constants increase under the ongoing γ -oscillatory state, where AMP denotes the av-
1760
O. Hoshino
erage membrane potential of PYCs. Note that it is advantageous for the PYCs to oscillate at a subthreshold level for their action potential generation, because such a subthreshold oscillation in membrane potentials might enable the PYCs to respond rapidly to sensory stimulation. These parameter values (τPY = 50 ms, τSB = 10 ms, and τLB = 50 ms) were suitable for generating such subthreshold γ -oscillations under the ongoing state. Figure 10 presents the sensitivity of the dynamic behavior of PYCs to the original synaptic strength (w0PY,PY of equations 2.7) between PYCs (see Figure 10A), the strength (w0PY,SB of equations 2.9) from SBC to PYC (see Figure 10B), and the strength (w0PY,LB of equations 2.8) from LBC to PYC (see Figure 10C). Under the ongoing state governed by [NE] = 0.5, γ -bursts take place in a range of w0PY,PY = ∼7.0–9.0 (see Figure 10A). For w0PY,SB (see Figure 10B) and w0PY,LB (see Figure 10C), γ -bursts take place in a range of ∼10.0 to 30.0 and ∼0.01 to 1.0, respectively. Figures 10D, 10E, and 10F present how the membrane potentials of PYCs change as these synaptic strengths increase. The existing parameter values, or w0PY,PY = 7.0, w0PY,SB = 30.0, and w0PY,LB = 0.1, were suitable for generating subthreshold γ -oscillations in membrane potentials of PYCs under the ongoing state. Figure 11 presents the sensitivity of the dynamic behavior of PYCs to the threshold (see Figure 11A) and steepness (see Figure 11B) of the sigmoid function (see equation 2.5), and to the number of cell assemblies (see Figure 11C). Under the ongoing state governed by [NE] = 0.5, γ -bursts take place in a range of θPY = ∼0–0.1 (see Figure 11A) and ηPY = ∼8.0–10.5 (see Figure 11B). Figures 11D and 11E present how the membrane potentials of PYCs change as these parameters values increase. The present values, or θPY = 0.1 and ηPY = 10.2, were suitable for generating subthreshold γ -oscillations in membrane potentials of PYCs under the ongoing state. We compare this result with a neural network model (Keeler, Pichler, & Ross, 1989) that investigated the effects of signal-to-noise suppression (Waterhouse & Woodward, 1980; Foote, Bloom, & Aston-Jones, 1983). The model consists of McCulloch-Pitts neurons. These neurons receive weighted inputs plus noise, and their outputs are determined by a gain function such as a step or sigmoid function. The researchers have suggested that by
Figure 9: Sensitivity of the dynamic behavior of a PYC to membrane time constants under the ongoing state, where the E-E/E-I system operates and [NE] = 0.5. (A) Sensitivity to the membrane time constant of PYCs (τPY ). Left: The membrane potential of the PYC on which its action potentials are overlaid. Right: The power spectrum of the action potentials of the PYC. (B) Sensitivity to the membrane time constant of SBCs (τSB ). (C) Sensitivity to the membrane time constant of LBCs (τLB ). (D–F) Average membrane potentials for (A–C).
Cognitive Enhancement Mediated Through Norepinephrine
5s
= 30ms
PY
power
τ
A
τ
PY
power
-0.6
= 50ms
τ
PY
power
-0.6
= 70ms
-0.6
1761
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz) 5s SB
power
τ
B
= 5ms
τ
SB
= 10ms
SB
= 20ms
power
-0.6
τ
power
-0.6
-0.6
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz) 5s LB
= 10ms
power
τ
C
τ
LB
power
-0.6
= 50ms
-0.6 LB
= 100ms
power
τ -0.6
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz)
AMP
D
E
F
-0.5
-0.6
-0.6
-0.8
-0.7
-0.7
-1.1
-0.8 1
τ
10 PY
(ms)
100
-0.8 1
τ
10 SB
(ms)
100
1
τ
10 LB
(ms)
100
1762
O. Hoshino
w0
5s = 5.0
power
PY,PY
A
PY,PY w0 =
power
-0.6 7.0
PY,PY w0 =
power
-0.6 9.0
-0.6
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz)
w0
5s
power
PY,SB
B
= 20.0
PY,SB w0 =
power
-0.6 30.0
PY,SB
w0
power
-0.6 = 40.0
-0.6
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz)
w0
5s
power
PY,LB
C
= 0.05
PY,LB w0 =
power
-0.6 0.1
PY,LB
w0
power
-0.6 = 1.0
-0.6
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz)
AMP
D
E
F
-0.65
-0.65
-0.6
-0.7
-0.7
-0.75
-0.75
-0.7
7
8 PY,PY
w0
9
-0.8 10
20 PY,SB
w0
30
-0.9 0.01
0.1
1
PY,LB
w0
Figure 10: Sensitivity of the dynamic behavior of a PYC to the original synaptic strengths under the ongoing state, where the E-E/E-I system operates and [NE] = 0.5. (A) Sensitivity to the original synaptic strength between PYCs (w0PY,PY ). Left: The membrane potential of the PYC on which its action potentials are overlaid. Right: The power spectrum of the action potentials of the PYC. (B) Sensitivity to the original synaptic strength from SBC to PYC (w0PY,SB ). (C) Sensitivity to the original synaptic strength from LBC to PYC (w0PY,LB ). (D–F) Average membrane potentials for A–C.
Cognitive Enhancement Mediated Through Norepinephrine
5s PY
= 0.05
power
θ
A
θ
PY
power
-0.6
= 0.1
-0.6 PY
= 0.2
power
θ -0.6
1763
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz) 5s
power
ηPY = 9.0
B -0.6
power
ηPY = 10.2 -0.6
power
ηPY = 11.0 -0.6
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz) power
5s
C
n=5
power
-0.6
n=7 -0.6
power
n = 10 -0.6
2.0 1.0 0.0 2.0 1.0
0.0 2.0 1.0 0.0
0
50 100
frequency (Hz)
AMP
D
E
F
-0.64
-0.64
-0.64
-0.66
-0.66
-0.66
-0.68
-0.68
-0.68
-0.70 0
0.05
θ
PY
-0.70 0.1 8
9
ηPY
10
-0.70
5
7
9
n
Figure 11: Sensitivity of the dynamic behavior of a PYC to the neuron and network parameters under the ongoing state, where the E-E/E-I system operates and [NE] = 0.5. (A) Sensitivity to the threshold value (θPY ) of the sigmoid function (see equation 2.5). Left: The membrane potential of the PYC on which its action potentials are overlaid. Right: The power spectrum of the action potentials of the PYC. (B) Sensitivity to the steepness (ηPY ) of the sigmoid function. (C) Sensitivity to the number of cell assemblies (n). (D–F) Average membrane potentials for A–C.
1764
O. Hoshino
increasing the concentration of NE, the noise might be effectively decreased, and thus the signal-to-noise ratio can be enhanced. The level of noise is analogous to the temperature of the so-called Boltzmann machine type of neural network, and noise reduction can be made by decreasing the temperature (Sejnowski, 1981; Hinton & Sejnowski, 1986). Since a decrease in temperature corresponds to an increase in steepness of the sigmoid function in the model (see equation 2.5), the simulation result shown in Figure 11B may support their results. That is, the background activity (or noise) is reduced as the steepness increases (e.g., see the first through third traces from the top of Figure 11B), and thus the signal-to-noise ratio could be enhanced. Note that an excessive increase of ηPY should be avoided; otherwise, the ongoing γ -bursts tend to disappear (e.g., see the third trace from the top of Figure 11B). To investigate whether the ongoing γ -oscillations are sensitive to network size, we increased the number of neurons that participate in processing sensory information. New dynamic cell assemblies were created to express additional information about new sensory features. As shown in Figure 11C, similar ongoing oscillatory behaviors are available, in which five (top; the original model), seven (middle), and ten (bottom) dynamic cell assemblies were, respectively, employed for encoding five (n = 5), seven (n = 7), and ten (n = 10) sensory features. Note that the number of neurons constituting each cell assembly is the same: 10 neurons. Although the overall ongoing γ -oscillatory behavior is less influenced by the number of cell assemblies, the membrane potentials of PYCs tend to be decreased as the number of cell assemblies increases (see Figure 11F). Such a decrease in membrane potentials implies that the ongoing state moves away from the threshold for action potential generation, and therefore the response property of the PYCs would deteriorate. Concerning the rate of change of weights under NE modulation (αPY , αSB , and αLB of equations 2.7 to 2.10, we have obtained a similar tendency in network behavior (not shown) as that in Figure 10: the membrane potentials of PYCs under the ongoing γ -oscillatory state are decreased as αPY increases, increased as αSB increases, and decreased as αLB increases. The parameter values, or αPY = 3.0, αSB = 60.0, and αLB = 0.8, were suitable for generating subthreshold γ -oscillations in membrane potentials of PYCs under the ongoing state. Note that the parameter values, βPY = βSB = βLB = 1.0 (see equations 2.7 to 2.10), determine the transient property of synaptic modulation. In the simulations, we applied NE with the same concentration as [NE] = constant, dosed at time = 1000 ms, and kept it throughout simulations. To collect data from cell recordings under stationary states for the weight modulation, we discarded initial (time = 0–5000 ms) data, which was long enough for transient neuronal behaviors not to be involved in the collected data because of such smaller values of βPY , βSB , and βLB . We investigate how neuronal behavior is altered if the shapes of the modulation rates (see Figures 1C and 1D) are changed. We changed the
Cognitive Enhancement Mediated Through Norepinephrine
A
1765
C
excitatory synaptic modulation modulation rate
1 : 2
3
-0.50
AMP
1
2 :
-0.55 -0.60 -0.65 -0.70
0 0
1.0
2.0
0.0
3.0
0.2
[NE]
0.4
0.6
0.8
1.0
[NE]
B
3
power
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz)
0.0
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz)
0.0
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz)
0.0
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz)
power
1.0 0.0
0 50 100 frequency (Hz) 2.0
power
1.0
power
power
0 50 100 frequency (Hz) 2.0
power
1.0
0.0
1.0
NE=1.0 2.0
1.0 0.0
0 50 100 frequency (Hz) 2.0
power
0.0
0 50 100 frequency (Hz) 2.0
NE=0.75 2.0
power
1.0
0.0
1.0
power
2
power
0 50 100 frequency (Hz) 2.0
1.0
NE=0.5 2.0
power
0.0
power
1.0
power
power
1
NE=0.25 2.0
power
NE=0 2.0
1.0 0.0
0 50 100 frequency (Hz)
Figure 12: Dependence of ongoing γ -oscillatory behavior on the shape of modulation rate for the excitatory connections between PYCs operated under the E-E/E-I system. (A) Shapes of modulation rate. (B) Power spectra of action potentials of a PYC. (C) Average membrane potentials of PYCs.
shape of the inverted-U type of modulation as indicated by 1, 2, and 3 in Figure 12A, and recorded the ongoing activity of a PYC, whose power spectra is shown in Figure 12B. Ongoing γ -oscillations are available for shapes 1 and 2 but not for 3. Figure 12C presents the average membrane potentials of PYCs under the two ongoing γ -oscillatory states. The average membrane potential is lowered if the modulation is accelerated (shape 1) from the original condition (shape 2). Figure 13B presents how neuronal behavior is altered if the monotonic type of modulation (see Figure 13A) for inhibitory connections from SBC to PYC is accelerated (row 1) or decelerated (row 3) from the original modulation rate (row 2). If the modulation rate is accelerated (row 1),
1766
O. Hoshino
ongoing γ -oscillations disappear (see Figure 13B). I have observed ongoing γ -oscillations for rows 2 and 3. The deceleration of the modulation rate (row 3) results in decreasing the membrane potentials (see Figure 13C). The original condition (row 2) provides a suitable ongoing network state, or ongoing γ -bursts in PYC activity. Figure 13E presents how neuronal behavior is altered if the monotonic type of modulation (see Figure 13D) for the inhibitory connections from LBC to PYC is accelerated (graph 1) or decelerated (graph 3) from the original one (graph 2). In this case, the shape of modulation rate less influences the γ -oscillatory behavior (see Figure 13E). Although an increase in membrane potentials (see Figure 13F) by the deceleration of the modulation rate appears to be advantageous (i.e., moving the state of the network towards the threshold), it should be restricted. That is, a further increase in membrane potentials from the original condition (graph 2) results in an increase in ongoing firing rate of PYCs and thus in a decrease in the S/N ratio (not shown). 4 Discussion A variety of S/N enhancements have been reported. For example, NE enhances stimulus-induced neuronal activity, keeping its ongoing activity almost unchanged (Waterhouse & Woodward, 1980) or NE suppresses ongoing neuronal activity more than stimulus-induced neuronal activity (Moises et al., 1979; Waterhouse & Woodward, 1980; Sessler et al., 1988). This study has shown that a balance between excitatory and inhibitory synaptic modulation controlled under NE neuromodulatory systems is crucial for such a variety of ways of S/N enhancement. Depending on external circumstances, the brain may adopt the most appropriate NE neuromodulatory system and scheme among possible candidates—E-E/E-I or D-E/E-I, and signal enhancement more than noise increase, signal enhancement and noise reduction, or noise reduction more than signal decrease—for improving cognitive performance of the cortex. Waterhouse and colleagues (Waterhouse et al., 1988; Waterhouse, Azizi, Burne, & Woodward, 1990; Mouradian et al., 1991) have demonstrated that NE can facilitate cortical neuronal responses to both subthreshold and
Figure 13: (A–C) Dependence of the ongoing γ -oscillatory behavior on the shape of modulation rate for the inhibitory connections from SBC to PYC operated under the E-E/E-I system. (A) Shapes of modulation rate. (B) Power spectra of action potentials of a PYC. (C) Average membrane potentials of PYCs. (D–F) Dependence of ongoing γ -oscillatory behavior on the shape of modulation rate for the inhibitory connections from LBC to PYC. The other conditions are the same as those of A–C.
Cognitive Enhancement Mediated Through Norepinephrine
inhibitory synaptic modulation
C 2 :
-0.5 1
2
3
AMP
3 :
-0.6 -0.7 -0.8
0 2.0
0.0
3.0
0.2
0.4
[NE]
NE=0.25
1.0 0 50 100 frequency (Hz) 2.0
power
power
1.0
1.0
0.0
modulation rate
0 50 100 frequency (Hz)
power
0 50 100 frequency (Hz) 2.0 1.0
0.0
0 50 100 frequency (Hz) 2.0 1.0
0.0
0 50 100 frequency (Hz) 2.0
0.0
0 50 100 frequency (Hz) 2.0 1.0
0.0
0 50 100 frequency (Hz)
0.0
1.0
0.0
0 50 100 frequency (Hz) 2.0 1.0
0.0
0 50 100 frequency (Hz)
D
1.0
0.0
0 50 100 frequency (Hz) 2.0
1.0
0.0
0 50 100 frequency (Hz) 2.0
power
1.0
1.0
0.0
NE=1.0 2.0
power
power
power
0 50 100 frequency (Hz) 2.0
0.0
3
1.0
0.0
0 50 100 frequency (Hz) 2.0
2
power
1.0
0.0
1.0
NE=0.75 2.0
power
1.0
NE=0.5 2.0
power
2.0
power
1
0.8
power
NE=0 2.0
0 50 100 frequency (Hz)
0.0
0 50 100 frequency (Hz)
inhibitory synaptic modulation
F 1
0
2
3
-0.50
AMP
B
0.6
[NE]
power
1.0
power
0
power
modulation rate
A
1767
1 :
-0.55 -0.60
2 :
-0.65 -0.70
0
1.0
2.0
3.0
3 : 0.0
0.2
0.4
[NE]
0.6
0.8
1.0
[NE]
E
3
power
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz)
0.0
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz)
0.0
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz)
0.0
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz) 2.0 1.0 0.0
0 50 100 frequency (Hz)
power
1.0 0.0
0 50 100 frequency (Hz) 2.0
power
1.0
power
power
0 50 100 frequency (Hz) 2.0
1.0
power
1.0
0.0
NE=1.0 2.0
1.0 0.0
0 50 100 frequency (Hz) 2.0
power
0.0
0 50 100 frequency (Hz) 2.0
1.0
NE=0.75 2.0
power
1.0
0.0
power
2
power
0 50 100 frequency (Hz) 2.0
1.0
NE=0.5 2.0
power
0.0
power
1.0
power
power
1
NE=0.25 2.0
power
NE=0 2.0
1.0 0.0
0 50 100 frequency (Hz)
1768
O. Hoshino
suprathreshold excitatory stimuli. I have shown here that weaker (or subthreshold) stimuli can be recognized if the E-E/E-I neuromodulatory system operates with a lower dose of NE (see the arrow in Figure 4A). For suprathreshold (or stronger) stimuli, the D-E/E-I system may operate with a higher dose of NE (see the arrow in Figure 4B). Concerning ongoing neuronal activity, it is well known that cortical neurons are not silent but active, or emit action potentials spontaneously, even without external stimulation. This self-generative activity is called ongoing, spontaneous, or background neuronal activity. A recent experiment (Engel et al., 2001) has demonstrated that self-generated ongoing neuronal activity has a distinct spatiotemporal patterning, which leads to the temporal coordination of input-triggered responses and to their binding into synchronous dynamic cell assemblies. Arieli and colleagues (Arieli et al., 1995, 1996) have demonstrated that preceding ongoing neuronal activity is reflected in the activity of responding cell assemblies to subsequent sensory stimulation and suggested that ongoing neuronal activity might play an essential role in cortical information processing. We have expressed here such an ongoing neuronal state by random transitions between dynamic cell assemblies. Although the exact dynamic structure of the ongoing state in the cortex is still largely unknown, Tsodyks and colleagues (Tsodyks, Kenet, Grinvald, & Arieli, 1999) have suggested that in an ongoing state, where any sensory stimulus is not present, a cortical network wanders through various states expressed by the synchronous firings of different cell assemblies. When a stimulus is presented, it will quickly push the network from the current state into a preferred state, where the specific cell assembly representing information about the applied stimulus is selectively activated. The random transitions between dynamic cell assemblies may correspond to such a wandering state. When stimulated with a sensory feature, the network is driven from the wandering state to a preferred state in which the dynamic cell assembly corresponding to the stimulus is selectively activated. We have shown that the neural network can effectively respond to weaker stimuli if brief γ -bursts are involved in ongoing neuronal activity. Fast oscillations such as γ -oscillations (40–70 Hz) are believed to play important roles in cognitive processing in the brain. Pulvermuller ¨ and colleagues (Pulvermuller ¨ et al., 1999) have suggested that cortical oscillations at higher frequencies reflect the presence of learned associative representations. The dynamic cell assemblies could express such learned associative representations as well, because these assemblies can be created through self-organized (Hebbian) learning processes (Hoshino, Kashimori, & Kambara, 1998; Hoshino, Inoue, Kashimori, & Kambara, 2001; Hoshino, Zheng, & Kuroiwa, 2002). We suggest that such learned representations may be frequently accessed under the ongoing state as temporal formation of each dynamic cell assembly, which is coordinated by synchronous γ -busts among pyramidal cells.
Cognitive Enhancement Mediated Through Norepinephrine
1769
We expressed information about sensory features by separate dynamic cell assemblies, where neurons of the cell assemblies are sensitive to single features or the neurons are unimodal. However, neurons of the cortex are known to encode sensory information based on the so-called distributed coding scheme, in which neurons tend to participate in processing more than one sensory feature. Namely, the neurons are multimodal. This implies that increasing the inhibition of other pyramidal cells would not help enhance the S/N ratio. For an example, see the neuronal architecture presented in Figure 14A. Suppose that the neurons within and across the three cell assemblies (A, B, C) are, respectively, connected through excitatory and inhibitory synapses. The neurons within subset 1 are sensitive to three different features (A-B-C); those within 2, 3, and 4 are sensitive to two different features (A-B, A-C, B-C); and those within 5, 6, and 7 are sensitive to one specific feature (A, B, C). This neuronal circuitry might provide a simple example for the distributed coding of sensory information. When feature A is applied to as a sensory input, the neurons of 1, 2, 3, and 5 are simultaneously activated, but then the neurons of 5 receive suppressive inputs through lateral inhibitory connections from 1-to-5, 2-to-5, and 3-to-5, and thus the S/N ratio would deteriorate. To overcome this problem, a specific neuronal circuitry shown in Figure 14B may work. The solid and dotted lines, respectively, denote excitatory and inhibitory connections between the subsets (1–7) of neurons. Note that neurons within subsets are connected through excitatory synapses. In
A
A
1 2 3
5
4 2
3
5
1
6 6
B
4
7
7
U U B C U U A B C' U U A B' C U U A' B C U U A B' C' U U A' B C' U U A' B' C
B
A
5
2
3 1
6
4
7
C
Figure 14: (a) A neuronal architecture based on the distributed coding scheme. Neurons within and across the three cell assemblies (A, B, and C) are, respectively, connected through excitatory and inhibitory synapses. The circled numbers 1–7 represent subsets of neurons. (B) A proposed neuronal circuitry for the distributed coding scheme. The solid and dotted lines, respectively, denote excitatory and inhibitory connections between subsets of neurons. Within subsets, neurons are connected through excitatory synapses.
1770
O. Hoshino
this neuronal circuitry, the neurons of 5 are not suppressed but excited by 1, 2, and 3 when stimulated with feature A, because of no lateral inhibitory but excitatory connections from 1-to-5, 2-to-5, and 3-to-5. Although the other (irrelevant) neurons (4, 6, and 7) tend to be activated via the multimodal neurons (1, 2, and 3), such excitation is counterbalanced by the lateral inhibition from 5-to-6, 5-to-4, 5-to-7, 2-to-7, and 3-to-6. Therefore, the deterioration of the S/N ratio would be prevented or lessened. We suggest that the proposed NE-induced enhancement in the S/N ratio might be applicable to more realistic cortical maps in which sensory information is expressed based on the distributed coding scheme once their synaptic structure is properly organized. It has been suggested that information propagation through long-range excitatory connections between neurons integrates widely distributed information over the visual field (Kobayashi et al., 2000) or mediates internal representation of olfactory information in the piriform cortex (Hasselmo, Linster, Patil, Ma, & Cekic, 1997). NE application depresses these excitatory connections and therefore prevents the spread of excitatory activity (or information propagation), by which the influence of internal representation relative to afferent input is decreased. This implies that NE release makes the cortex to change its attentional focus from internal representation to external representation, or outside stimuli, and thus enables the cortex to acquire new information. Our simulation results may support this notion. That is, the internal representation, which is expressed in the neural network presented here as ongoing synchronous neuronal activity within cell assemblies, disappears for higher NE concentration (e.g., see Figure 2E). Even in such a desynchronized (ongoing) condition, synchronization in neuronal activity can easily be established when presented with an external stimulus (e.g., see the period of F2 stimulation of Figure 2E). In such a situation, a Hebbian learning process would reinforce (create) information about known (novel) features through the reorganization of the cortical map. It seems that the rate-based S/N (or evoked-to-background) ratio is not enough for assessing information transmission in this neural network, because although the neurons responding to an applied feature stimulus carry a major part of information about the stimulus, the other neurons irrelevant to the stimulus also play an important role in information transmission. To thoroughly investigate this issue, it is necessary to develop another measure. We propose sensory contrast, a measure that is defined as evokedto-suppressed activity ratio. This ratio, or evoked neuronal activity (e.g., see the F2-sensitive neurons of Figure 2A) to suppressed neuronal activity (see the F1-F3-, F4-, and F5-sensitive neurons), may transmit information about the applied feature (F2) as sensory contrast. The stronger the neurons that are irrelevant to the applied feature are suppressed, the more the efficiency in information transmission could be enhanced in the network, increasing its sensory contrast. Since the suppressive property of these
Cognitive Enhancement Mediated Through Norepinephrine
1771
irrelevant neurons is also sensitive to NE concentration (see Figures 2A to 2E), the efficiency in information transmission might be greatly influenced by NE. A detailed investigation on this issue will be our future work.
5 Conclusion We proposed two distinct types of NE neuromodulatory systems: (1) an enhanced-excitatory and enhanced-inhibitory (E-E/E-I) system and (2) a depressed-excitatory and enhanced-inhibitory (D-E/E-I) system. These two systems modified ongoing background cortical activity and greatly influenced subsequent cognitive neuronal processing. We found three possible schemes for the S/N enhancement operating under the E-E/E-I system: (1) signal enhancement more than noise increase, (2) signal enhancement and noise reduction, and (3) noise reduction more than signal decrease. We found only scheme (3) for the D-E/E-I system. We suggest that depending on external circumstances, the brain may adopt the most appropriate NE neuromodulatory system and S/N enhancement scheme with a suitable amount of NE release into the cortex for improving its cognitive performance. The E-E/E-I system effectively enhanced the S/N ratio for weaker stimuli with lower doses of NE, whereas the D-E/E-I system enhanced stronger stimuli with higher doses of NE. The neural network effectively responded to weaker stimuli if brief γ -bursts were involved in ongoing neuronal activity that is controlled under the E-E/E-I neuromodulatory system. We suggest that ongoing γ -oscillatory neuronal states might be essential for the brain to recognize weaker (subthreshold) sensory stimuli. We proposed a hypothetical neural network model, in which the E-E/EI and the D-E/E-I systems coexist in the same cortical area and interact with each other. We found that depressed neurons whose activity is to be depressed by NE application have bimodal property. That is, the S/N ratio can be enhanced not only for stronger stimuli as its original property but also for weaker stimuli, for which coincidental neuronal firings among enhanced neurons whose activity is to be enhanced by NE application are essential. We suggest that the recruitment of the depressed neurons for the detection of weaker (subthreshold) stimuli might be advantageous for the brain to cope with a variety of sensory stimuli.
Acknowledgments I am grateful to T. Kambara and Y. Kashimori for valuable discussions. I am also grateful to the anonymous referees for giving me valuable comments and suggestions on the earlier draft.
1772
O. Hoshino
References Arieli, A., Shoham, D., Hildesheim, R., & Grinvald, A. (1995). Coherent spatiotemporal patterns of ongoing activity revealed by real-time optical imaging coupled with single-unit recording in the cat visual cortex. J. Neurophysiol., 73, 2072–2093. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Devilbiss, D. M., & Waterhouse, B. D. (2000). Norepinephrine exhibits two distinct profiles of action on sensory cortical neuron responses to excitatory synaptic stimuli. Synapse, 37, 273–282. Engel, A. K., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchronization in top-down processing. Nature Rev. Neurosci., 2, 704–717. Eysel, U. T. (1992). Lateral inhibitory interactions in areas 17 and 18 of the cat visual cortex. Prog. Brain Res., 90, 407–422. Eysel, U. T., Shevelev, I. A., Lazareva, N. A., & Sharaev, G. A. (1998). Orientation tuning and receptive field structure in cat striate neurons during local blockade of intracortical inhibition. Neuroscience, 84, 25–36. Fairen, A., DeFelipe, J., & Redidor, J (1984). Nonpyramidal neurons: General account. In A. Peter & E. G. Jones (Eds.), Cerebral cortex: Cellular components of the cerebral cortex (pp. 201–245). New York: Plenum Press. Foote, S. L., Bloom, F. E., & Aston-Jones, G. (1983). Nucleus locus ceruleus: New evidence of anatomical and physiological specificity. Physiol. Rev., 63, 844–914. Funahashi, S., Bruce, C. J. & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiol., 61, 331–349. Gu, Q. (2002). Neuromodulatory transmitter systems in the cortex and their role in cortical plasticity. Neuroscience, 111, 815–835. Gupta, A., Wang, Y., & Markram, H. (2000). Organization principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273–278. Hasselmo, M. E., Linster, C., Patil, M., Ma, D., & Cekic, M. (1997). Noradrenergic suppression of synaptic transmission may influence cortical signal-to-noise ratio. J. Neurophysiol., 77, 3326–3339. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press. Hoshino, O., Inoue, S., Kashimori, Y., & Kambara, T. (2001). A hierarchical dynamical map as a basic frame for cortical mapping and its application to priming. Neural Comput., 13, 1781–1810. Hoshino, O., Kashimori, Y., & Kambara, T. (1998). An olfactory recognition model based on spatio-temporal encoding of odor quality in olfactory bulb. Biol. Cybern., 79, 109–120. Hoshino, O., Zheng, M. H., & Kuroiwa, K. (2002). Roles of dynamic linkage of stable attractors across cortical networks in recalling long-term memory. Biol. Cybern., 88, 163–176. Johnson, E. S., Roberts, M. H., & Straughan, D. W. (1969). The responses of cortical neurones to monoamines under differing anaesthetic conditions. J. Physiol., 203, 261–280.
Cognitive Enhancement Mediated Through Norepinephrine
1773
Kasamatsu, T., & Heggelund, P. (1982). Single cell responses in cat visual cortex to visual stimulation during iontophoresis of noradrenaline. Exp. Brain Res., 45, 317–327. Keeler, J. D., Pichler, E. E., & Ross, J. (1989). Noise in neural networks: Thresholds, hysteresis, and neuromodulation of signal-to-noise. Proc. Natl. Acad. Sci. USA, 86, 1712–1716. Kisvarday, Z. F., Kim, D. S., Eysel, U. T., & Bonhoeffer, T. (1994). Relationship between lateral inhibitory connections and the topography of the orientation map in cat visual cortex. Eur. J. Neurosci., 6, 1619–1632. Kobayashi, M., Imamura, K., Sugai, T., Onoda, N., Yamamoto, M., Komai, S., & Watanabe, Y. (2000). Selective suppression of horizontal propagation in rat visual cortex by norepinephrine. Eur. J. Neurosci., 12, 264–272. Krimer, L. S., & Goldman-Rakic, P. S. (2001). Prefrontal microcircuits: Membrane properties and excitatory input of local, medium, and wide arbor interneurons. J. Neurosci., 21, 3788–3796. Leventhal, A. G., Wang, Y., Pu, M., Zhou., Y., & Ma, Y. (2003). GABA and its agonists improved visual cortical function in senescent monkeys. Science, 300, 812–815. Martin, K. A. C. (2002). Microcircuits in visual cortex. Curr. Opin. Neurobiol., 12, 418– 425. McCormick, D. A. (1989). Cholinergic and noradrenergic modulation of thalamocortical processing. Trends Neurosci., 12, 215–221. Moises, H. C., Woodward, D. J. Hoffer, B. J., & Freedman, R. (1979). Interaction of norepinephrine with Purkinje cell response to putative amino acid neurotransmitters applied by microiontophoresis. Exp. Neurol., 64, 493–515. Mouradian, R., Sessler, F. M., & Waterhouse, B. D. (1991). Noradrenergic potentiation of excitatory transmitter action in cerebrocortical slices: Evidence for mediation by an α1 receptor-linked second messenger pathway. Brain Research, 546, 83–95. Pulvermuller, ¨ F., Keil, A., & Elbert, T. (1999). High-frequency brain activity: Perception or active memory? Trends Cogn. Sci., 3, 250–253. Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (1999). Isodirectional tuning of adjacent interneurons and pyramidal cells during working memory: Evidence for microcolumnar organization in PFC. J. Neurophysiol., 81, 1903–1916. Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (2000). Destruction and creation of spatial tuning by disinhibition: GABA(A) blockade of prefrontal cortical neurons engaged by working memory. J. Neurosci., 20, 485–494. Sarvey, J. M. (1988). Protein synthesis in long-term potentiation and norepinephrineinduced long-lasting potentiation in hippocampus. In P. W. Lanfield & S. Deadwyler (Eds.), Long-term potentiation: From biophysics to behavior (pp. 329–353). New York: Liss. Sejnowski, T. (1981). Skeleton filters in the brain. In G. Hinton & J. Anderson (Eds.), Parallel models of associative memory (pp. 189–212). Hillsdale, NJ: Erlbaum. Sessler, F. M., Cheng, J. T., & Waterhouse, B. D. (1988). Electrophysiological actions of norepinephrine in rat lateral hypothalamus I. Norepinephrine-induced modulation of LH neuronal responsiveness to afferent synaptic inputs and putative neurotransmitters. Brain Research, 446, 77–89. Sessler F. M., Lin, W., Firifides, M. L., Mouradian, R., Lin, R. C. S., & Waterhouse, B. D. (1995). Noradrenergic enhancement of GABA-induced input resistance changes
1774
O. Hoshino
in layer V regular spiking pyramidal neurons of rat somatosensory cortex. Brain Research, 675, 171–182. Sillito, A. M. (1984). Functional considerations of the operation of GABAergic inhibitory processes in the visual cortex. In A. Peter & E. G. Jones (Eds.), Cerebral cortex (pp. 91–117). New York: Plenum Press. Somogyi, P., Kisvarday, Z. F., Martin, K. A. C., & Whitteridge, D. (1983). Synaptic connections of morphologically identified and physiologically characterized large basket cells in the striate cortex of cat. Neuroscience, 10, 261–294. Tsodyks, M., Kenet, T., Grinvald, A., & Arieli, A. (1999). Linking spontaneous activity of single cortical neurons and the underlying functional architecture. Science, 286, 1943–1946. Usher, M., & Davelaar, E. J. (2002). Neuromodulation of decision and response selection. Neural Networks, 15, 635–645. Wang, Y., Gupta, A., Toledo-Rodriguez, M., Wu, C. Z., & Markram, H. (2002). Anatomical, physiological, molecular and circuit properties of nest basket cells in the developing somatosensory cortex. Cerebral Cortex, 12, 395–410. Waterhouse, B. D., Azizi, S. A., Burne, R. A., & Woodward, D. J. (1990). Modulation of rat cortical area 17 neuronal responses to moving visual stimuli during norepinephrine and serotonin microiontophoresis. Brain Research, 514, 276–292. Waterhouse, B. D., Devilbiss, D. M., Fleischer, D., Sessler, F. M., & Simpson, K. L. (1998). New perspective on the functional organization and postsynaptic influences of locus ceruleus efferent projection system. Adv. Pharmacol., 42, 749–754. Waterhouse, B. D., Moises, H. C., & Woodward, D. J. (1981). Alpha-receptor-mediated facilitation of somatosensory cortical neuronal responses to excitatory synaptic inputs and iontophoretically applied acetylcholine. Neuropharmacology, 20, 907– 920. Waterhouse, B. D, Moises, H. C., & Woodward, D. J. (1998). Phasic activation of the locus coeruleus enhances responses of primary sensory cortical neurons to peripheral receptive field simulation. Brain Research, 790, 33–44. Waterhouse, B. D., Moises, H. C., Yeh, H. H., & Woodward, D. J. (1982). Norepinephrine enhancement of inhibitory synaptic mechanisms in cerebellum and cerebral cortex: Mediation by beta adrenergic receptors. J. Pharmacol. Exp. Ther., 221, 495–506. Waterhouse, B. D., Mouradian, R., Sessler, F. M., & Lin, R. C. S. (2000). Differential modulatory effects of norepinephrine on synaptically driven responses on layer V barrel field cortical neurons. Brain Research, 868, 39–47. Waterhouse, B. D., Sessler, F. M., Cheng, J. T., Woodward, D. J., Azizi, S. A., & Moises, H. C. (1988). New evidence for a gating action of norepinephrine in central neuronal circuits of mammalian brain. Brain Research Bulletin, 21, 425–432. Waterhouse, B. D., & Woodward, D. J. (1980). Interaction of norepinephrine with cerebrocortical activity evoked by stimulation of somatosensory afferent pathways in the rat. Exp. Neurol., 67, 11–34. Wilson, F. A., O’Scalaidhe, S. P., & Goldman-Rakic, P. S. (1994). Functional synergism between putative γ -aminobutyrate-containing neurons and pyramidal neurons in prefrontal cortex. Proc. Natl. Acad. Sci. USA, 91, 4009–4013. Woodward, D. J., Moises, H. C., Waterhouse, B. D. Heh, H. H., & Cheun, J. E. (1991). Modulatory actions of norepinephrine on neural circuits. In S. Kito, T. Segawa,
Cognitive Enhancement Mediated Through Norepinephrine
1775
& R. W. Olsen (Eds.), Neuroreceptor mechanisms in brain (pp. 193–208). New York: Plenum Press. Woodward, D. J., Moises, H. C., Waterhouse, B. D. Hoffer, B. J., & Freedman, R. (1979). Modulatory actions of norepinephrine in the central nervous system. Federation Proc., 38, 2109–2116. Yost, W. A. (1994). Fundamentals of hearing San Diego, CA: Academic Press. Zilberter, Y. (2000). Dendritic release of glutamate suppresses synaptic inhibition of pyramidal neurons in rat neocortex. J. Physiology, 538, 489–496.
Received April 23, 2004; accepted December 9, 2004.
LETTER
Communicated by Nigel Goddard
Advancing the Boundaries of High-Connectivity Network Simulation with Distributed Computing Abigail Morrison
[email protected] Computational Neurophysics, Institute of Biology III and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany
Carsten Mehring
[email protected] Department of Zoology, Institute of Biology I and Berstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany
Theo Geisel
[email protected] Department of Nonlinear Dynamics, Max-Planck-Institute for Dynamics and Self Organization, 37018 G¨ottingen, Germany
Ad Aertsen
[email protected] Neurobiology and Biophysics, Institute of Biology III and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany
Markus Diesmann
[email protected] Computational Neurophysics, Institute of Biology III and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany
The availability of efficient and reliable simulation tools is one of the mission-critical technologies in the fast-moving field of computational neuroscience. Research indicates that higher brain functions emerge from large and complex cortical networks and their interactions. The large number of elements (neurons) combined with the high connectivity (synapses) of the biological network and the specific type of interactions impose severe constraints on the explorable system size that previously have been hard to overcome. Here we present a collection of new techniques combined to a coherent simulation tool removing the fundamental obstacle in the computational study of biological neural networks: the enormous number of synaptic contacts per neuron. Distributing an individual simulation over multiple computers enables the investigation of networks orders of magnitude larger than previously possible. The Neural Computation 17, 1776–1801 (2005)
© 2005 Massachusetts Institute of Technology
Distributed High-Connectivity Network Simulation
1777
software scales excellently on a wide range of tested hardware, so it can be used in an interactive and iterative fashion for the development of ideas, and results can be produced quickly even for very large networks. In contrast to earlier approaches, a wide class of neuron models and synaptic dynamics can be represented.
1 Introduction It has long been pointed out (Hebb, 1949; Braitenberg, 1978) that cortical processing is most likely carried out by large ensembles (assemblies) of nerve cells, whereby the membership of neurons in various assemblies is exhibited through the complex correlation structure of their spike trains (von der Malsburg, 1981, 1986; Abeles, 1982, 1991; Palm, 1990; Aertsen, Gerstein, Habib, & Palm, 1989; Gerstein, Bedenbaugh, & Aertsen, 1989; Singer, 1993, 1999). Theoretical studies (Shadlen & Newsome, 1998; Diesmann, Gewaltig, & Aertsen, 1999; Salinas & Sejnowski, 2000; Kuhn, Aertsen, & Rotter, 2003) have demonstrated that the cortical neuron is indeed sensitive to the higherorder correlation structure of their input. With the experimental technology for multiple-single unit recordings becoming routinely available for animals involved in behavioral tasks, appropriate network models need to be constructed to interpret the results. Due to the nonlinear and stochastic nature of neural systems, simulations have become a research tool of major importance in the developing field of computational neuroscience (Dayan & Abbott, 2001; Koch, 1999; Koch & Segev, 1989). A number of simulation tools have been developed (Genesis: Bower & Beeman, 1997; Neuron: Hines & Carnevale, 1997; XPP: Ermentrout, 2002) and are in widespread use. They are general purpose in the sense that they maintain a layer of abstraction between the neuron model to be simulated and the machinery implementing the network interaction; that is, they are not neuron or network model specific. The primary focus of these tools is small networks of detailed neuron models. One exception is SpikeNET (Delorme & Thorpe, 2003), which is specialized for a specific class of large networks of spiking neurons. A major barrier in the simulation of mammalian cortical networks has been the large number of inputs (afferents) a single neuron receives. Assuming a biologically realistic level of connectivity, each neuron should have of the order of 104 afferents. In order to ensure that the network is sufficiently sparse, a connection probability of 0.1 should be assumed, resulting in a minimal network size of 105 neurons, corresponding to roughly a cubic millimeter of cortex (Braitenberg & Schuz, ¨ 1998). Such a network contains of the order of 109 synapses and has proved to be beyond the memory capacity of the computers available to many researchers. Furthermore, even given a computer with sufficient memory resources, the amount of time required to construct and simulate such networks and
1778
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
perform reasonable parameter scans is not conducive to rapid scientific progress. Faced with these problems, many researchers have opted to use scaled networks, in which the connection probability remains constant and the synaptic weights are scaled in some manner with the inverse of the total number of connections a neuron receives. This approach is not without its risks; although the mean or, alternatively, the standard deviation of the subthreshold activity can be held constant, this is not true for correlations and combinatorial measures. Furthermore, in such networks, the memory requirement increases quadratically with the number of neurons. Clearly, these disadvantages hold only until the minimal network size is attained, at which point synaptic scaling need no longer be applied and memory requirements increase only linearly with increasing network size. In this letter, we describe without reference to a particular implementation language how distributed computing (i.e., simultaneous execution on multiple processors, where the addressable memory of one processor is not visible to the others) can be used to acquire the memory resources to surpass the threshold of 105 neurons and reach a simulation speed suitable for practical work. To our knowledge, this is the first description of a general-purpose simulation scheme that allows the routine investigation of networks of spiking neurons with biologically realistic levels of connectivity. Our design consists of a highly efficient distributed algorithm, based on the serial (i.e., running on one processor) simulation kernel first described in Diesmann, Gewaltig, and Aertsen (1995). Despite the use of a distributed algorithm, a serial interface is presented to the researcher. In addition to the increases in network size and simulation speed enabled by our approach, a key feature of our design is its flexibility due to object orientation (the implementation language being C++; Stroustrup, 1997). Recent advances in simulation technique have tended to focus on current-based integrate-and-fire point neurons (Mattia & Del Giudice, 2000; Lee & Farhat, 2001; Reutimann, Giugliano, & Fusi, 2003); using our technology, the researcher is not bound by any of these restrictions and can just as easily use conductance-based neurons (e.g., Chance, Abbott, & Reyes, 2002; Destexhe, Rudolph, & Pare, 2003; Kuhn, Aertsen, & Rotter, 2004), non-integrate-and-fire models (Hawkes, 1971), simple compartmentbased models (Larkum, Zhu, & Sakmann, 2001; Kumar, Kremkow, Rotter, & Aertsen, 2004), or implement a neuron model of his or her own. Whereas complex compartmental models could theoretically be implemented, there is no support for generating them as in Neuron (Hines & Carnevale, 1997) or Genesis (Bower & Beeman, 1997), and so such models remain out of the reach of the current version. The main constraint is the restriction to the class of neuron models where the interaction between neurons is noninstantaneous (finite delays) and mediated by point events (spikes). Similarly, a large range of synaptic dynamics can be employed, and a network may be entirely heterogeneous in terms of the neuron and synapse models used.
Distributed High-Connectivity Network Simulation
1779
In section 2, we describe the representation and construction of a network, including compression techniques. On this basis, we explain the distributed algorithm for solving the dynamics in section 3. After a brief discussion of the treatment of pseudorandom numbers in section 4, we provide benchmarks for our simulation technology in section 5 using relevant simulation examples and hardware, demonstrating its excellent scalability with respect to number of processors, network activity, and network size on computer clusters and parallel computers (for the purpose of this article, we reserve the term parallel computer for shared memory architectures). Section 6 summarizes the key concepts of our simulation scheme and discusses our approach in the light of future directions of neuroscience research and upcoming computer architectures. Source code detail is out of the scope of this letter, as is simulation infrastructure such as writing data to files. In the following, the term machine refers to one processor, addressing memory that is assumed to be invisible by other machines. Thus, a computer with two processors and shared memory would be regarded as two machines. The term list is taken in its intuitive meaning of a sequential ordering, which need not necessarily be implemented as the data structure known as list (Aho, Hopcroft, & Ullman, 1983). The research on a distributed simulation kernel described in this article is a module in our long-term collaborative project to provide the technology for neural systems simulations (Diesmann & Gewaltig, 2002). The application of the technology described here has already enabled interesting insights into the nature of cortical dynamics (Mehring, Hehl, Kubo, Diesmann, & Aertsen, 2003; Aviel, Mehring, Abeles, & Horn, 2003; Tetzlaff, Morrison, Geisel, & Diesmann, 2004). Preliminary results have been presented in abstract form (Morrison et al., 2003). 2 Representation of Network Structure 2.1 A Generic Network. Consider a generic network of spiking point model neurons. In order to represent this, it is helpful to consider the synapses as being separate entities from the neurons. If a synapse is triggered by a spike from its presynaptic neuron, it transmits this information in the form of an event with weight w and delay d to its postsynaptic neuron. Each neuron is assigned a unique index, and we say that the synapse that transmits a spike from neuron i to neuron j is an axonal synapse of i but a dendritic synapse of j. Thus, for a serial algorithm, a list of neurons, each possessing a list of its axonal synapses, whereby each synapse contains the index of its postsynaptic neuron, would suffice to represent the network structure completely. This is analogous to the adjacency lists commonly used to represent graphs (Gross & Yellen, 1999) and is illustrated in Figure 1. Inorder to allow maximum flexibility in the structure and heterogeneity of
1780
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann Neurons Synapses
1
3
2 5
4
7
6 8 9
12 10
11
1 2 3 4 5 6 7 8 9 10 11 12
4 1 2 5 3 1 3 9 5 6 8 7
6 10 2 7 5 8
Index: 12
Index: 7
V I τ
w d
10 12 11 9
Figure 1: Data structures for a serial simulation scheme. The network comprises N uniquely indexed neurons (left column), and each neuron is assigned a list of axonal synapses (right column). Each neuron contains its own state variables, as does each synapse (illustrated by close-up). In addition, each synapse contains the index of its postsynaptic neuron.
the networks that can be simulated, an object-oriented approach is highly advantageous, in which each individual neuron and synapse maintains its own parameters, as depicted in the close-up in Figure 1, and performs its own dynamics rather than being subject to a global algorithm. For a distributed algorithm, matters are somewhat more complicated. The most obvious requirement for distributing a simulation is that the neurons are distributed. The simplest possible load balancing is to assign an equal number of neurons to each machine. As there may be different neuron models with different associated computational costs in the network, this assignment is performed using a modulo operation, so that contiguous blocks of neurons (the most intuitive way of defining neuron populations) are dealt out among all the machines, resulting in a good estimate of the fairest load distribution. For simplicity, here and in the remainder of the article, we ignore this assignment and assume that neurons 1 to N/m, where m is the number of machines, are located on the first machine, neurons N/m + 1 to 2N/m on the second machine, and so on. Clearly, the synapses must also be distributed. We distribute the axonal synapses of each neuron as follows: on each machine, there are N lists of synapses, one for each neuron in the network. A synapse from neuron i to neuron j is stored in the ith list on the machine owning neuron j. Alternatively, from a biological point of view, we say the axon of a neuron is distributed, but its dendrite is local, as anticipated in Mattia and Del Giudice (2000). This seems less intuitive than distributing the dendrite and keeping the axon local, but confers a considerable advantage in communication costs, as discussed in section 3.
Distributed High-Connectivity Network Simulation
1
1781
3
2 5
4
7
6 8 9
12 10
11
Machine: a
Machine: b
Synapses Neurons Machines
1
4 1 2 5 6 3 1 2 3 5
1 2 3 4 5 6
a a a a b a a b
Synapses Neurons Machines
1 10
5 6
12
12
7 8 9 10 12 11 8 9 7
7 8 9 10 11 12
a b b a b a b b b
Figure 2: Data structures for a distributed simulation scheme. As an example, the network shown in Figure 1 is distributed (top) over two machines. Each machine (bottom panels) contains N/2 neurons (center column) and N lists of synapses (left column), one list for each neuron in the network. All synapses in the ith list are synapses from neuron i to neurons on the local machine. The dashed arrows indicate connections that cross machine boundaries. In addition to the data structures required for the serial algorithm (see Figure 1), each neuron is assigned a list of machines (right column). The machine list for neuron i specifies all the machines on which neuron i has targets (i.e., all machines where the ith synapse list is not empty).
Instead of a list of axonal synapses, each neuron is assigned a list of the machines on which it has targets; that is, the machine list for i contains exactly those machines on which the ith synapse list is not empty. This scheme is depicted in Figure 2. Further network elements not shown in Figure 2 include neuron input devices such as current generators and observation devices such as membrane potential recorders and spike detectors, which interact with any or all of the local neurons.
1782
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
2.2 Network Construction. As the synapses are represented on the same machine as their postsynaptic neuron, the network must also be connected from the postsynaptic side: for each neuron, its afferent neurons must be specified and a synapse added to the local synapse list of each of those neurons. Once the inputs for each neuron have been established, one complete exchange between the machines (see section 3.4) suffices to propagate the connectivity information necessary to construct the machine list for each neuron. Constructing the network is therefore a fully parallelizable activity that results in a near-linear speed-up (see Figure 7). This is an important feature of our technology, as without parallelization, the wiring of a network can account for a large fraction of the total run time, especially as biological levels of connectivity are approached. Despite the highly parallel nature of the construction, the user needs no knowledge about the location of the neurons and can specify connections as if it were a serial application. On the most basic level, a command Connect(i, j) is ignored almost instantaneously on all machines except the one owning neuron j. For higherlevel connection commands, such as one for connecting a random network, we make use of the fact that a machine can determine at the beginning of such a routine which neurons belong to it. The desired method can then be applied to all neurons within the appropriate bounds without having to check ownership of each neuron, thus achieving an even higher degree of code parallelization. 2.3 Network Compression. Simulations of large biological neural networks are by their very nature highly consumptive of memory. Although the emphasis of this article is on the use of distributed computing to provide the necessary memory resources, it is clear that any compression of the network structures will be beneficial, as it increases the size of network that can be investigated on any given hardware and decreases the simulation time for any given network. We therefore explain briefly the general principles of how redundancy can be reduced without special knowledge of the network structure in order to reduce the memory demands of a simulation. We will consider the simplest synapse possible, a static synapse consisting of a constant weight w, a constant delay d, and the index of the postsynaptic neuron i. Ignoring for simplicity the memory overhead involved in creating the synapse object, a naive representation of a synapse would therefore require a number of bytes Ms = Mw + Md + Mi , and a network of N neurons with K synapses each would require NKMs bytes purely for the synapses. To give some idea of the scale of the problem, typical values for Mw , Md , and Mi on a 32-bit system are 8, 4, and 4 bytes, respectively, and a network containing N = 105 neurons with a connectivity of K = 104 synapses each would require 16 GB. However, instead of storing the postsynaptic index for each synapse in a list, we keep the list sorted in order of increasing index and store just the differences between them. If this difference is less than 255, it can be stored in 1 byte. If the difference happens
Distributed High-Connectivity Network Simulation
1783
to be too large, an appropriate entry is made in an overflow list. In fact, in most applications, this is a rare occurrence. In the network mentioned above, if connected using a uniform distribution, the average difference between neurons is 10. If networks of orders of magnitude larger than this were to be investigated, spatial structure would have to be taken into consideration to avoid obtaining a biologically unrealistic sparseness. It should be noted that the requirement of keeping the target lists sorted is fulfilled without extra costs if, given indices j1 , j2 of neurons located on a particular machine with j1 ≤ j2 , the commands establishing the connections are issued in the sequence . . . , Connect(i 1 , j1 ), . . . , Connect(i 2 , j2 ), . . . the order and location of the presynaptic neurons being irrelevant. This technique reduces the amount of memory for each synapse to Ms Mw + Md + 1. Further compression can be achieved by considering the distributions of the synaptic delays and weights. In the best-case scenario, as far as memory is concerned, each neuron makes only one kind of axonal synapse: wij = wi and dij = di . In this case, the parameters have to be stored only once per list rather than once per synapse, resulting in Ms 1 and reducing the memory requirements for the above example network to approximately 1 GB. In the next best case, each neuron makes axonal synapses with weights and delays that take on only a few possible values (see, e.g., Brunel, 2000). Here, several synaptic lists will be initialized, one for each combination encountered. This produces compression almost as good as in the ideal case. However, there is a non-negligible memory overhead associated with the construction of each list, so for a broad distribution of delays, it is more efficient to represent them in terms of their difference from a reference value. For many applications, this difference can be expressed in 1 byte, resulting in Ms Mw + 1 + 1. Similar to the representation of the postsynaptic neurons, a value too far away from the reference value can be stored in an overflow list. This kind of compression is applicable only to discrete-valued variables such as delay. There is currently no way to compress a continuous distribution. These methods can be easily extended to more complicated synapse objects, but the onus is on the user to be aware of the redundancies in the network to be investigated and choose the appropriate kind of compression. As illustrated above, depending on the heterogeneity of the network, eliminating redundancy can reduce the memory requirements significantly, so that even large networks can be simulated on hardware available to modest budgets. Conversely, with access to large clusters or parallel machines, it is possible to investigate networks orders of magnitude larger than previously possible. 3 Simulation Dynamics 3.1 Time Driven or Event Driven? A Hybrid Approach to Simulation. The two classic approaches to simulation are time driven and event driven, also known as synchronous and asynchronous algorithms. We pursue a
1784
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
hybrid strategy whereby the neurons are updated at every time step, but a synapse is updated only if its presynaptic neuron produces a spike. A purely event-driven algorithm such as that proposed by Mattia and Del Giudice (2000) is unsuitable for our purposes because it places too many restrictions on the classes of neurons that can be simulated. For example, any neuron class in which a spike induces a continuous postsynaptic current would be very difficult to implement if a neuron is updated only on the arrival of a spike event. Furthermore, the motivation for these algorithms is the fact that the state of a current-based integrate-and-fire (IAF) neuron can be directly interpolated between events, where events are assumed to be rare. This perceived computational advantage dwindles rapidly as the frequency of events increases: a neuron with 104 afferent connections firing at just 1 Hz receives events at a rate of 10,000 Hz and is therefore at least as expensive to simulate event driven as on a time grid with a step size of 0.1 ms. In Reutimann et al. (2003), a solution to the problems of rise times and event saturation is presented for IAF neurons at the cost of large look-up tables and restrictions on the type of background population activity. By updating the neurons in fixed time steps and applying exact integration techniques where possible (Rotter & Diesmann, 1999), we maintain a highly flexible simulation environment at no grave computational cost. As a consequence of this scheme, spike times are constrained to the time grid and are expressed as multiples of h, the time step or computational resolution. It should, however, be noted that a neuron can perform arbitary dynamics within this time step, including using an internal resolution much finer than the one the spikes are constrained to. Conversely, a purely time-driven algorithm is equally unsuitable. Synapses are by far the most numerous elements in a network. This is the case even when synaptic scaling is employed (i.e., using fewer but stronger synapses), but particularly so if biologically realistic connectivity is assumed. Clearly, updating 109 synapses every time step would have catastrophic consequences for simulation times. Fortunately, although the above consideration that events are rare is not valid for neurons, it is valid for individual synapses—in the situation described above, a synapse processes events at just 1 Hz. Assuming that the synaptic state can be calculated from its previous state, the time since its last update, and information available from the postsynaptic neuron, a synapse need be updated only when it transfers a spike event. In fact, a wide range of synaptic dynamics falls into this category, including synaptic depression (Thomson & Deuchars, 1994), synaptic redistribution (Markram & Tsodyks, 1996), and spike-timedependent plasticity (Bi & Poo, 1998, 2001), for a review see Abbott and Nelson (2000). By combining the flexibility of a time-driven algorithm with the speed of an event-driven algorithm, a highly functional and fast simulation environment is achieved.
Distributed High-Connectivity Network Simulation
1785
y: state at t y ← Fh (y) y ← G(y, w t+h ) F
spike now? T spike
F
refractory?
wt+h
T refractory dynamics
y: state at t + h
rotation
Figure 3: Update of neuron state. The flowchart (left) defines the sequence of operations required to propagate the state y of an individual neuron by one time step h. Operator Fh performs the subthreshold dynamics, and operator G modifies the state according to the incoming events. Event buffers (only one is shown, right) contain incoming events for the neuron. They are rotated at the end of each time step so that for any simulation time t, the current read position (indicated by the black read head symbol) always provides the input scheduled to arrive at simulation time t + h.
3.2 Neuron Update. At each time step, each neuron is updated from its state at time t, yt = (y1 , y2 , . . . , yn )t , to its state at time t + h, where h is the temporal resolution of the simulation. This update is performed on the basis of yt and wt+h , the summed weight of all events arriving at time t + h. A flowchart of the update is shown in Figure 3, a concrete example is given in Diesmann, Gewaltig, Rotter, and Aertsen (2001), and the theory for grid-based simulation is developed in Rotter and Diesmann (1999). First, the subthreshold dynamics of the system is applied to the state vector; that is, the state of the neuron is calculated without taking new events into consideration. Next, wt+h is read out of the neuron’s event buffer, also shown in Figure 3. The event buffer can be thought of as a primitive looped tape device; at each time step, the current value can be read off and erased. At the end of a time step, all event buffers are rotated one segment so that the next value is available to be read in the next time step. Only one such buffer is depicted; in fact, the number of buffers is dependent on the dynamics of the neuron model. The provisional new state of the neuron is then updated on the basis of wt+h . Now the provisional state of the neuron reflects the values valid at time t + h with respect to the neuron’s subthreshold dynamics. If at this point the state of the neuron fulfills its spiking criteria (e.g., passing a threshold potential), the state is updated once again according to the appropriate spike generation dynamics (such as resetting the membrane
1786
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
potential). The emitted spike is assigned to the time t + h, and the information that it has occurred must be transmitted to the neuron’s distributed axonal synapses. This calculation of the new state of the neuron is defined by the individual neuron model rather than by a global algorithm. In many cases, it is possible to avoid using computationally expensive differential equation solvers by applying a propagator matrix to the neuron state to integrate exactly (Rotter & Diesmann, 1999). Clearly, the use of event buffers, already present in the original serial version (Diesmann et al., 1995; Gewaltig, 2000), obviates the requirement for a centralized event queuing system; an event of weight w due to arrive at the neuron d time steps in the future can be added to the event buffer d steps upstream from the current reading position. Obviously, causality requires a nonzero transmission delay, that is, d ≥ 1. It is important to note the implicit assumption made here that the weights of incoming events can be summed, as each ring buffer segment contains just one value, which is incremented by the weights of successive incoming events. It does not, however, imply that an incoming event may only cause a discontinuous jump in a state variable of the neuron. The interpretation of the weights is left up to the individual neuron model. For example, one model may interpret the weights as the magnitude of a jump in the membrane potential and another as the maximum value of an alpha function (Jack, Noble, & Tsien, 1983; Bernard, Ge, Stockley, Willis, & Wheal, 1994) describing the change of conductance. Nor does it imply that all the events received by a neuron necessarily induce identical dynamics. Models with different time constants for excitatory and inhibitory input, for example, or with several compartments (Kumar et al., 2004) are implemented by giving the neuron access to several such event buffers. 3.3 Index Buffering. At first glance it seems as if communication between machines should take place after every time step in order to convey the information of which neurons spiked during that time step. Fortunately, this is not so. If the minimum synaptic delay is dmin · h, then a neuron spiking at time t cannot have an effect on any postsynaptic neuron at a time earlier than t + dmin · h. Therefore, if the spikes can be stored maintaining their temporal order, it is sufficient to communicate in intervals of dmin time steps. This also represents the maximum possible communication interval; any greater communication interval would result in events arriving with delays longer than those specified by the user. This communication scheme is completely independent of the temporal resolution; simulating on a finer time grid does not increase the frequency of communication. Maintaining the temporal ordering is easily done. In each time step, the indices of all spiking neurons can be stored in a buffer as illustrated in Figure 4, and at the end of the time step, a marker is inserted into the buffer to separate the spikes of successive time steps. Note that if the axonal synapses were local and the dendritic synapses distributed, the synaptic weight, delay,
Distributed High-Connectivity Network Simulation
1787
Machine: a Neurons Machines 1 2 3 4 5 6
a a a a b a a b
Buffer for Machine:
a 6
b
Figure 4: Target machine-specific buffering of local events. In this example, neuron 6 located on machine a (see Figure 2) produces a spike. Its list of target machines contains the identifiers for machines a and b. The index 6 is appended to the index buffers for these machines.
and index of every target neuron would have to be communicated, at a cost of Mc = K· (Mw + Md + Mi ) per spike. Using a representation of the network structure where the postsynaptic neuron maintains the information about the weights and delays of incoming connections, as is the case in several simulators (e.g., Bower & Beeman, 1997), would reduce this cost to Mc = K · Mi . Assuming the same sizes of the synaptic parameters as in section 2.3, this amounts to a reduction factor of 4. However, in our representation, the entire synaptic structure, including the index of the postsynaptic neuron, is stored physically on the postsynaptic side but logically on the presynaptic side (see section 2.1). This means that it suffices to send merely the index of the source neuron to every machine on which it has a target, so the communication cost is no longer proportional to the connectivity of the neuron but to the number of machines: Mc = m · Mi . This is a reduction factor of K /m, which for biologically realistic levels of connectivity can be of the order of 104 and so adequately justifies the otherwise counterintuitive decision to distribute the axonal rather than dendritic synapses. A further reduction in communication bulk results from sending spike information only to where it is needed. In Figure 4, each neuron is shown to have a list of the machines on which it has target neurons. This information can be used to filter the indices into machine-specific buffers. At the end of the dmin interval, these buffers can be exchanged with the corresponding machines. By distributing axonal synapses, communicating in intervals of dmin , and sending spike information only to where it is needed, a communication scheme is achieved with both minimal bulk and frequency. 3.4 Buffer Exchange. The communication itself is performed using routines from the Message Passing Interface (MPI) library (Pacheco, 1997). The two basic communication types are blocking and nonblocking. In blocking
1788
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
communication, a Send (b) on machine a requires a corresponding call of Receive (a ) on machine b, and both machines wait until these calls have been successfully completed. In nonblocking communication, the data to be sent from machine a to machine b are written into a buffer that can be picked up by b when it is ready, and a continues with its next instruction without waiting. The latter paradigm is potentially more efficient; we use the former as it is more robust: the MPI library definition does not fully specify how data that have not yet been retrieved are to be buffered. Furthermore, as each of the machines needs to receive event buffers from every other machine before it can continue and the amount of time spent in communication is small compared to the amount of time required to update the neurons (see section 5), using nonblocking communication could result in only a minimal increase in performance. Given m machines, there are consequently m(m − 1)/2 individual bidirectional exchanges that need to be ordered carefully in order to prevent deadlock (for example, if a tries to send to b, while b tries to send to c and c tries to send to a ). This is equivalent to an edge coloring problem, where each machine is a vertex of a fully connected graph, and each edge represents the exchange of buffers between the two machines it connects. Each vertex may have only one edge of each color, and edges that are the same color correspond to exchanges that can be carried out in parallel without causing deadlock. The Complete Pairwise EXchange (CPEX) algorithm (Tam & Wang, 2000; Gross & Yellen, 1999) is a simple constructive process that produces sets of edges to enable the graph to be colored with the minimum of colors or, equivalently, the order of exchanges for maximally efficient communication with no deadlock. In Figure 5A, the ordering of exchanges is illustrated for a network with five machines. It should be noted that the algorithm is more efficient if an even number of machines is involved (see Figure 5B), resulting in m − 1 communication steps (colors) rather than m steps for odd m, where in every step, one machine is idle. 3.5 Event Delivery. The received buffers are then sequentially processed by reading off the indices one by one and activating the corresponding synapses, as illustrated in Figure 6. If the index of neuron i is read off before the first marker has been read, this means that neuron i spiked dmin time steps ago. The synapses of i are activated, and for each postsynaptic neuron j, an event of weight wij with synaptic delay dij · h is produced. This weight is then written to the appropriate event buffer of neuron j, dij − dmin time steps on from the current reading position, thus maintaining the correct temporal ordering of events. Indices between the first and second markers correspond to neurons that spiked in the second time step following the last buffer exchange; accordingly, the resulting events have their synaptic delays decremented by dmin − 1. This process continues until all indices from the buffer have been read off, at which point the next buffer can be processed in
Distributed High-Connectivity Network Simulation
1789
A
B
Figure 5: Illustration of the complete pairwise exchange (CPEX) algorithm as an edge coloring problem. Machines correspond to the nodes of the graph (filled circles) and communication routes to the edges (lines). (A) The five communication steps required for a computer cluster with m = 5 machines are represented by the sequence of graphs. In each step (from left to right), two pairs of machines (connecting edges highlighted by thick lines) exchange their messages. In the edge coloring terminology, this corresponds to the application of a different color. Using this odd number of machines, the progress of the algorithm can be visualized by the clockwise rotation of the highlighted parallel edges. (B) The m − 1 communication steps required for m = 6 machines. Same display as in A.
Machine: b Synapses
Neurons
1
Buffer from Machine:
6
a lag: 1
b
2
3
10
6
w,d–3
7 8 9 10 12 11 8 9 12 7
7 8 9 10 11 12
Figure 6: Delivery of received events. In this example, dmin = 3, and so as the index 6 is read out of the section of the buffer received from machine a (left) before the first marker (hatched block), neuron 6 spiked three time steps ago. This information is passed to the synapse list of neuron 6 on this machine (center). This list contains a synapse with the postsynaptic neuron 7. It produces an event of weight w and delay d, which is placed in the event buffer (cf. Figure 3) of neuron 7 (right) not d, but d − 3 segments on from the current reading position, to take account of the communication lag of three time steps.
1790
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
the same way. Once all buffers have been processed, the next dmin · h period in the life of the network is simulated, and the cycle begins again. 4 Random Numbers Many networks require a massive amount of random numbers for their simulation and construction. In addition to the usual pitfalls of pseudorandom number generation, this presents extra problems in a distributed environment. The ideal solution should be able to produce sequences of random numbers for each machine such that each sequence is itself uncorrelated, and each sequence is uncorrelated to any other sequence. Furthermore, it should be possible to perform a simulation on a different hardware or with a different number of machines, and obtain identical results. A central random number server is not an appropriate solution for this application, as this would result in a bottleneck due to the sheer volume of numbers required. A partial solution is provided by the use of random number generators (RNGs), which produce independent trajectories for different seeds (Knuth, 1997). Such RNGs are available from the GNU Scientific Library (Galassi, Gough, & Jungman, 2001), which, moreover, ensures platform independence. The solution is completed by the introduction of pseudoprocesses. The number of pseudoprocesses Npp is specified at run time, when the number of machines used m is also known. They are assigned to the machines such that pseudoprocess p is on machine p mod m, and each machine has an equal number of pseudoprocesses, thereby constraining m to be a factor of Npp . Each pseudoprocess is assigned one RNG with a unique seed. Each neuron is assigned to a pseudoprocess such that neuron n is assigned to pseudoprocess n mod Npp , and all the random numbers required for that neuron are drawn from the corresponding RNG. As the algorithm to assign the neurons to the pseudoprocesses depends solely on Npp , identical simulation results will be obtained for any m fulfilling the constraint. In this way, we ensure a fast, safe production of random numbers, independent of both the platform and the number of machines used. 5 Performance We tested the software on four architectures, chosen to reflect the kind of hardware currently available:
r r r
Elderly PC cluster (8 × 2 processors, 100 MBit Ethernet, Intel Pentium 0.8 GHz, 256 kB cache) Recent PC cluster (20 × 2 processors, Dolphin/Scali network, Intel Xeon, 2.8 GHz, 512 kB cache) Compaq GS160 (16 processors, Alpha 0.7 GHz, 8 MB cache)
Distributed High-Connectivity Network Simulation
A
B
32
6400 3200
16
1600
time [s]
8
time [s]
1791
4 2
800 400 200
1
100 1
2
4
machines
8
16
1
2
4
8
16
machines
Figure 7: Scalability of wiring with respect to number of processors: elderly PC cluster, solid line; recent PC cluster, dashed line; GS160, dash-dotted line; GS1280, dotted line. See the text for architecture and simulation details. (A) Wiring time for the 104 network against number of processors, log-log representation. The gray line indicates slope for a linear speed-up. (B) As in A but for the 105 network.
r
Compaq GS1280 (8 processors, Alpha 1.15 GHz, 1.75 MB associative cache)
The following simulations were performed on the three different architectures for several different numbers of processors:
r r r
104 low rate network: 10,000 neurons with 1000 random afferent connections each and an average spike rate of around 2.5 Hz (dynamics as described in Brunel, 2000) was simulated for 10 biological seconds. 104 high rate network: as above, but with an unrealistically high average spike rate of around 250 Hz. 105 low rate network: 100,000 neurons with 10,000 random afferent connections each and an average spike rate of around 2.5 Hz (dynamics as described in Brunel, 2000) was simulated for 1 biological second.
These simulations were chosen to demonstrate the scalability of the software with respect to the number of processors for networks of significantly different sizes and activities. In Figure 7 the wiring times for the two different network sizes are plotted against the number of machines used. The double logarithmic representation reveals the exponent of the dependence. In both cases, the wiring time scales linearly with the number of machines. For the smaller network (see Figure 7A), a saturation for large numbers of machines seems to be visible; however, the times measured are close to the resolution of measurement
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
(A) 6400
(B)
1
speed-up factor
1792
0.8
3200
time [s]
1600 800 400 200 100 1
2
4
machines
8
16
0.6 0.4 0.2 0
1
4
8
12
16
machines
Figure 8: Scalability of simulation time with respect to number of processors: elderly PC cluster, solid line; recent PC cluster, dashed line; GS160, dash-dotted line; GS1280, dotted line. See the text for architecture and simulation details. (A) Simulation time for the 105 network against number of processors, loglog representation. The gray line indicates the slope for a linear speed-up. (B) Corresponding speed-up factor against number of processors. The diagonal (broad gray line) corresponds to linear speed-up.
(1 second). For the larger network (see Figure 7B), no saturation is observed. In fact, no saturation is visible even when using 40 machines of the modern PC cluster (not shown). In contrast to the other architectures, the elderly PC cluster slows when increasing from 8 to 10 machines. This is due to the fact that the two processors on each board share a memory bus. For 8 or fewer machines, the application can be distributed such that only one processor on each board is running it. Above this point, both processors are in use on at least one board, which leads to a significant reduction in efficiency. If the application is distributed such that both processors are in use on all contributing boards, then a supralinear behavior is seen for all numbers of machines, but at the cost of larger absolute run times. In Figure 8A the simulation time of the 105 network is plotted against the number of machines used. All tested architectures show a steeper slope than that expected for a linear speed-up (gray line), commonly considered to be the maximum speed-up possible for a distributed application. This surprising result is due to the fact that the amount of fast cache memory available increases linearly with the number of processors. For our nondeterministic algorithm, the exploitation of this memory more than compensates for the memory and communication overheads that accrue as a result of distributing the simulation. In particular, in the range of machines tested, the communication overheads are negligible. Even when simulating the 104 high rate network on the elderly PC cluster (i.e., the maximimum amount
Distributed High-Connectivity Network Simulation
1793
1200 1000
time [s]
800 600 400 200 0
low rate 104
high rate 104
low rate 105
Figure 9: Comparison of architectures. Total time to run the three different types of simulation on four different architectures using eight processors. See the text for architecture and simulation details. In each block, the black bar refers to the elderly PC cluster, the dark gray bar to the recent PC cluster, the light gray bar to the GS160, and the white bar to the GS1280. The left block shows the run times of the 104 low-rate network, the middle block the run times of the 104 high-rate network, and the right block the run times of the 105 network. For the 105 network, the lower part of each bar shows the construction time of the network and the upper part the simulation time. For the 104 networks, the construction time is negligible.
of communication with respect to the number of neurons, with the slowest communication hardware), the time spent communicating amounted to less than 0.5% of the total runtime. The corresponding speed-up curves (see Wilkinson & Allen, 2004), that is, how many times faster the application runs with m processors than with 1 processor as a function of m, are plotted in Figure 8B. In the case of the PC clusters, the application is too large to be addressed by one 32-bit processor; therefore, the curves have been normalized appropriately. In this representation, the supralinear scaling is indicated by the fact that all the curves lie above the gray diagonal, indicating a linear speed-up, except for the elderly PC cluster for large n, as explained above. Similar results (not shown) were obtained for the 104 networks with low and high spike rates, whereby the supralinear behavior is more pronounced at high rates. At high rates, the efficiency of writing to the event buffers becomes an increasingly crucial factor, so the exploitation of the cache resource plays a much more important role than for low rate simulations. In the case of the recent PC cluster, we have been able to test up to 40 processors, and even at this high number, no saturation of the speed-up was observed, resulting in total run times for the 105 network of less than 2 minutes. In both panels, the reduction in efficiency of the elderly PC cluster caused by two processes competing for memory access is clearly
1794
A
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann 1
3200 1600
B
800
time [s]
time [s]
0.8 0.6 0.4 0.2 0
400 200 100 50 20 10
0
25
50
rate [Hz]
75
100
1
2
4
8
16
32
64
neurons [× 104 ]
Figure 10: Scalability with respect to activity and network size. (A) Simulation time (diamonds) on eight processors of the GS1280 for 1 biological second of a 104 network plotted against average spike rate λ. The gray line is a linear fit to the data. (B) Simulation time (diamonds) on eight processors of the GS1280 for 1 biological second of low-rate networks (3.9 Hz) against number of neurons, log-log representation. The number of synapses per neuron increases linearly with the number of neurons (i.e., constant connection probability) until a biologically realistic connectivity is reached (at 13 × 104 ), after which the number of synapses per neuron remains constant. The gray lines are linear fits to the two regimes, with slopes of 1.88 and 1.05, respectively. The lower dashed gray line indicates the expected run time increase assuming a linear dependence of the run time on the number of neurons; the upper dashed gray line indicates the expected run-time increase assuming a quadratic dependence on the number of neurons.
visible. Again, supralinear behavior is observed for the entire series if the application is redistributed as described above, but at the cost of higher absolute run times. A comparison of the run times for the different types of simulation is given in Figure 9. The total run time for all combinations of simulations and architectures is depicted for the case that eight processors are used. In all cases, the simulation finishes in less than 20 minutes. The recent PC cluster with its high-speed, low-latency network, and rapid clock speed has a clear advantage over both the elderly PC cluster and parallel computer and is at this number of processors comparable to the modern parallel computer. To demonstrate the scalability of the software with respect to network activity and size, we varied one parameter while holding the others constant. Figure 10A shows that the software scales linearly with the average spike rate λ. An excellent scaling behavior is also seen with respect to the number of neurons in the network with constant rate and connection probability (see Figure 10B). It lies between the linear scaling due to the increase in the number of neurons and the quadratic scaling due to the increase in the
Distributed High-Connectivity Network Simulation
1795
machines
relative time
1.1
1
2
4
6
8
1.05 1 0.95 0.9
1
2
3
4
5
6
7
8
9
5
neurons [× 10 ]
Figure 11: Scalability with respect to problem size. The ratio of serial simulation time and parallel simulation time (vertical) is shown as a function of network size (horizontal, bottom) and number of machines (horizontal, top) for the GS1280. The network size is increased from 110,500 to 884,000, keeping the number of neurons per machine and the number of synapses per neuron constant. The gray line represents a linear fit exhibiting a slope of −0.0057.
number of synapses. For increases in network size above 105 , a near-linear increase is observed; having reached biological levels of complexity, the total number of synapses in the network increases only linearly. We have discussed how simulation time scales with number of processors for a fixed network size (see Figure 8) and how the simulation time scales with the network size for a fixed number of machines (see Figure 10B). Another useful performance measure is the scaled speed-up (Wilkinson & Allen, 2004), where the network size is increased linearly with the number of machines. The motivation is that with a larger number of machines available, it should be possible to address a proportionally larger problem in the same time as the original problem on one machine. The scaled speed-up characterizes to what extent this assumption holds. Figure 11 shows that for the number of processors available on the parallel computer tested, the scaled speed-up remains close to 1, indicating excellent scalability of the software and the problem. The small, systematic linear decline of scaled speed-up presumably originates from the increased absolute amount of memory access and communication load. 6 Discussion We described a scheme for the efficient distributed simulation of large heterogeneous spiking neural networks. In contrast to earlier approaches,
1796
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
the simulation technology enables the investigation of recurrent networks with levels of connectivity and sparseness characteristic for the mammalian brain. A network of 100,000 neurons permits a biologically realistic number of synapses per neuron (of the order of 104 ) while simultaneously adhering to a biologically realistic constraint on the connection probability between two neurons (of the order of 0.1 in the local volume). In this sense, such a network constitutes a threshold size for realistic simulations of cortical networks. The technology described in this article easily overcomes this threshold on available hardware, requiring wall clock times suitable for routine use. For larger neural systems, the required computer memory and wall clock time scale merely linearly (as opposed to quadratically) with the number of neurons. The neural network to be simulated can be distributed over many computers. Execution time and the computer memory required on an individual machine scale excellently with the number of machines used. As a consequence of the considerations above, larger networks can be simulated, or a reduction in execution time achieved, simply by adding a proportionate number of computers to the system. In addition to the distributed simulation of the dynamics, an important feature of our technology is the parallel generation of the network structure. A serial construction of the network would severely limit the speed-up, as the time required to construct the network can constitute a considerable fraction of the total execution time (see Figure 9). A similar argument holds for the generation of random numbers, which can also easily become the component limiting the speed-up. Consequently, the generation of random number is parallelized, and care is taken that simulation results are independent of the number of machines used to carry out the simulation. The efficiency of the simulation scheme results from exploiting the fact that the interaction of network elements is noninstantaneous and mediated by point events (spikes). The frequency and bulk of the communication between machines are independent of the computation step size (precision of the simulation). The simulation scheme profits from the fact that the fast cache memory increases proportionally with the number of machines, reducing the ratio between the locally required working memory and the locally available cache (Wilkinson & Allen, 2004). Surprisingly, the increase in simulation speed gained more than compensates the overhead due to the communication in a distributed environment. Overall, a supralinear speedup is observed, justifying the use of large clusters of computers. A detailed quantitative investigation of cache effects is outside the scope of this study. However, it is evident from the results presented that when deciding on hardware for a simulation project using our scheme, not only the clock speed of the processors but also the amount of cache memory should be considered. We pointed out in section 3.1 that neither of the textbook simulation schemes, discrete time and event-driven algorithms, is optimal for the
Distributed High-Connectivity Network Simulation
1797
simulation of biological neural networks. Instead, only a carefully adjusted hybrid of both leads to a satisfactory run-time behavior. A similar observation is made with respect to the class design of the software components (cf. the comment on this observation in Gamma, Helm, Johnson, & Vlissides, 1994). Only a well-balanced mixture of objects from the problem domain (neurobiology) and from the machine-oriented domain of parallel algorithms leads to a design that appropriately compromises between the heterogeneity of biological structure, usability of the software, and the efficiency of the simulation on today’s computer hardware. Both observations argue for a pragmatic and undogmatic usage of software design principles and the selection of an implementation language supporting multiple paradigms (Stroustrup, 1994). Ideally, the interface with which the researcher, attempting to implement a new neuron model, is confronted would be expressed in terms of neuroscience concepts and objects, while the objects of the software layer below are optimized for cache exploitation and efficient communication. We have made some first steps in this direction, but further research is required to work out how this approach can consistently be applied to the different components of the simulation scheme. Future work on this topic falls broadly into two categories: optimization and functionality. With respect to the former, the clearly observable cache effects suggest that much could be gained by optimizing the data structures accordingly. We are currently testing various alternative representations of the network structure and update schemes in order to enhance the cache usage, particularly for low numbers of machines. Furthermore, the load balancing currently carried out by the simulation kernel is minimal and relies on the static uniform distribution of network elements and their types over the available machines. Thus, efficient use of the hardware resources requires homogeneous computer clusters. While in the context of high-performance computing this does not represent a major constraint, the limitation becomes relevant when more structured networks are investigated with large differences in the communication load between and within subnetworks. The next step in addressing the problem of load balancing would be to provide user-level control over the mapping of network elements to machines. Finally, the results presented in this article demonstrate that it is possible and efficient to execute our code on computers with multiple processors. However, the use of a communication protocol developed under the constraints of distributed computing (Pacheco, 1997) does not fully exploit the existence of a working memory addressable by all processors. In the framework of the NEST initiative (www.nest-initiative.org) we are developing the simulation technology for computers with multiple processors. Current trends in computer hardware toward clusters of multiprocessor machines, whereby each machine has a small number of processors and support for multithreading (Butenhof, 1997) in the individual processors, makes a hybrid simulation kernel using multithreading locally and message passing between computers increasingly interesting.
1798
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
One factor limiting the usability of the software described here is that the protocol for a particular simulation experiment must be specified by a C++ program. In a parallel line of work, we are developing a simulation language interpreter enabling the interactive specification and manipulation of neural systems simulations (Diesmann & Gewaltig, 2002). It remains to be investigated how the distributed simulation kernel can be combined with this interpreter. Other current work (Morrison, Hake, Straube, Plesser, & Diesmann, 2005) focuses on extending the functional range of the technology through the incorporation of precise (off-grid) spike times (see Hansel, Mato, Meunier, & Neltner, 1998; Rotter & Diesmann, 1999; Shelley & Tao, 2001, for discussion) and spike-time-dependent plasticity (Morrison, Aertsen, & Diesmann, 2004) into the simulation scheme. In future projects, structural plasticity and the interaction of spiking neural networks with modulatory chemical subsystems will also be addressed. Acknowledgments We acknowledge constructive discussions with Denny Fliegner, Stefan Rotter, Masayoshi Kubo, and the members of the NEST collaboration (in particular Marc-Oliver Gewaltig and Hans Ekkehard Plesser). This work was partially funded by the Volkswagen Foundation, GIF, BIF, DIP F1.2, DAAD 313-PPP-N4-lk, and BMBF Grant 01GQ0420 to BCCN Freiburg. All simulations have been carried out using the parallel computing facilities of the Max-Planck-Institute for Fluid Dynamics (now MPI for Dynamics and Self Organization) in Gottingen ¨ and the Agricultural University of Norway ˚ Part of the work was carried out when A. M. and M. D. were based in As. at the Max-Planck-Institute for Fluid Dynamics. The initial distributed version of the simulation software was developed by Mehring in the context of Mehring, et al. (2003). References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nat. Neurosci., 3(Suppl.), 1178–1183. Abeles, M. (1982). Local cortical circuits: An electrophysiological study. Berlin: SpringerVerlag. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Aertsen, A., Gerstein, G., Habib, M., & Palm, G. (1989). Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900– 917. Aho, A. V., Hopcroft, J. E., & Ullman, J. D. (1983). Data structures and algorithms. Reading, MA: Addison-Wesley. Aviel, Y., Mehring, C., Abeles, M., & Horn, D. (2003). On embedding synfire chains in a balanced network. Neural Comput., 15(6), 1321–1340.
Distributed High-Connectivity Network Simulation
1799
Bernard, C., Ge, Y. C., Stockley, E., Willis, J. B., & Wheal, H. V. (1994). Synaptic integration of NMDA and non-NMDA receptors in large neuronal network models solved by means of differential equations. Biol. Cybern., 70, 267– 273. Bi, G.-q., & Poo, M.-m. (1998). Activity-induced synaptic modifications in hippocampal culture: Dependence on spike timing, synaptic strength and cell type. J. Neurosci., 18, 10464–10472. Bi, G., & Poo, M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu. Rev. of Neurosci., 24, 139–166. Bower, J. M., & Beeman, D. (1997). The book of GENESIS: Exploring realistic neural models with the GEneral NEural SImulation System (2nd ed.). New York: TELOS, Springer-Verlag. Braitenberg, V. (1978). Cell assemblies in the cerebral cortex. In R. Heim & G. Palm (Eds.), Theoretical approaches to complex systems. Berlin: Springer. Braitenberg, V., & Schuz, ¨ A. (1998). Cortex: Statistics and geometry of neuronal connectivity (2nd ed.). Berlin: Springer-Verlag. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Butenhof, D. R. (1997). Programming with POSIX threads. Reading, MA: AddisonWesley. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Delorme, A., & Thorpe, S. (2003). Spikenet: An event-driven simulation package for modeling large networks of spiking neurons. Network: Comput. Neural Systems, 14, 613–627. Destexhe, A., Rudolph, M., & Pare, D. (2003). The high-conductance state of neocortical neurons in vivo. Nat. Rev. Neurosci., 4, 739–751. Diesmann, M., & Gewaltig, M.-O. (2002). NEST: An environment for neural systems simulations. In T. Plesser & V. Macho (Eds.), Forschung und wisschenschaftliches Rechnen, Beitr¨age zum Heinz-Billing-Preis 2001 (pp. 43–70). Gottingen: ¨ Ges. fur ¨ Wiss. Datenverarbeitung. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1995). SYNOD: An environment for neural systems simulations: Language interface and tutorial (Tech. Rep. GC-AA-/953). Tel Aviv: Weizmann Institute of Science. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Diesmann, M., Gewaltig, M.-O., Rotter, S., & Aertsen, A. (2001). State space analysis of synchronous spiking in cortical neural networks. Neurocomputing, 38–40, 565– 571. Ermentrout, B. (2002). Simulating, analyzing, and animating dynamical systems: A guide to Xppaut for researchers and students (software, environments, tools). Philadelphia: Society for Industrial and Applied Math. Galassi, M., Gough, B., & Jungman, G. (2001). Gnu scientific library: Reference manual. Bristol, UK: Network Theory Ltd. Gamma, E., Helm, R., Johnson, R., & Vlissides, J. (1994). Design patterns: Elements of reusable object-oriented Software. Reading, MA: Addison-Wesely.
1800
A. Morrison, C. Mehring, T. Geisel, A. Aertsen, and M. Diesmann
Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Trans. Biomed. Eng., 36, 4–14. Gewaltig, M.-O. (2000). Evolution of synchronous spike volleys in cortical Networks: Network simulations and continuous probabilistic models. Aachen, Germany: Shaker. Gross, J., & Yellen, J. (1999). Graph theory and its applications. Boca Ratan, FL: CRC Press. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10(2), 467–483. Hawkes, A. G. (1971). Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society (London) B, 33, 438–443. Hebb, D. O. (1949). Organization of behavior: A neurophysiological theory. New York: Wiley. Hines, M., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Comput., 9, 1179–1209. Jack, J. J. B., Noble, D., & Tsien, R. W. (1983). Electric current flow in excitable cells. New York: Oxford University Press. Knuth, D. E. (1997). The art of computer programming: Seminumerical algorithms (3rd ed., Vol. 2). Reading, MA: Addison-Wesley. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Koch, C., & Segev, I. (1989). Methods in neuronal modeling. Cambridge, MA: MIT Press. Kuhn, A., Aertsen, A., & Rotter, S. (2003). Higher-order statistics of input ensembles and the response of simple model neurons. Neural Comput., 15(1), 67–101. Kuhn, A., Aertsen, A., & Rotter, S. (2004). Neuronal integration of synaptic input in the fluctuation-driven regime. J. Neurosci., 24, 2345–2356. Kumar, A., Kremkow, J., Rotter, S., & Aertsen, A. (2004). Synaptic integration in a 3-compartment model of layer 5 pyramidal neurons. FENS Abstr., 2, A014.27. Larkum, M., Zhu, J., & Sakmann, B. (2001). Dendritic mechanisms underlying the coupling of the dendritic with the axonal action potential initiation zone of adult rat layer 5 pyramidal neurons. J. Neurophysiol., 533 (pt. 2), 447–466. Lee, G., & Farhat, N. H. (2001). The double queue method: A numerical method for integrate-and-fire neuron networks. Neural Networks, 14, 921–932. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382(6594), 807–810. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305– 2329. Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (2003). Activity dynamics and propagation of synchronous spiking in locally connected random networks. Biol. Cybern. 88(5), 395–408. Morrison, A., Aertsen, A., & Diesmann, M. (2004). Stability of plastic recurrent networks. Paper presented at the Monte Verita Workshop on Spike-Timing Dependent Plasticity, Monte Verita, Ascona, Switzerland. Morrison, A., Hake, J., Straube, S., Plesser, H. E., & Diesmann, M. (2005). Precise spike timing with exact subthreshold integration in discrete time network simulations. In Proceedings of the 30th G¨ottingen Neurobiology Conference, Neuroforum Supplement 1, pp. 205B.
Distributed High-Connectivity Network Simulation
1801
Morrison, A., Mehring, C., Diesmann, M., Aertsen, A., & Geisel, T. (2003). Distributed simulation of large biological neural networks. In Proceedings of the 29th G¨ottingen Neurobiology Conference, (p. 590). Pacheco, P. S. (1997). Parallel programming with MPI. San Francisco: Morgan Kaufmann. Palm, G. (1990). Cell assemblies as a guideline for brain research. Conc. Neurosci., 1, 133–148. Reutimann, J., Giugliano, M., & Fusi, S. (2003). Event-driven simulation of spiking neurons with stochastic dynamics. Neural Comput., 15, 811–830. Rotter, S., & Diesmann, M. (1999). Exact digital simulation of time-invariant linear systems with applications to neuronal modeling. Biol. Cybern., 81(5/6), 381–402. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193– 6209. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comput. Neurosci., 11(2), 111–119. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Singer, W. (1999). Time as coding space. Curr. Opin. Neurobiol., 9(2), 189–194. Stroustrup, B. (1994). The design and evolution of C++. Reading, MA: Addison-Wesley. Stroustrup, B. (1997). The C++ programming language (3 ed.). Reading, MA: AddisonWesley. Tam, A., & Wang, C. (2000). Efficient scheduling of complete exchange on clusters. In G. Chaudhry & E. Sha (Eds.), 13th International Conference on Parallel and Distributed Computing Systems (PDCS 2000). Cary, NC: International Society for Computers and Their Applications. Tetzlaff, T., Morrison, A., Geisel, T., & Diesmann, M. (2004). Consequences of realistic network size on the stability of embedded synfire chains. Neurocomputing, 58–60, 117–121. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. TINS, 17, 119–126. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C. (1986). Am I thinking assemblies? In G. Palm & A. Aertsen (Eds.), Brain theory (pp. 161–176) Berlin: Springer-Verlag. Wilkinson, B., & Allen, M. (2004). Parallel programming: Techniques and applications using networked workstations and parallel computers (2nd ed.). Upper Saddle River, NJ: Prentice Hall.
Received July 20, 2004; accepted January 31, 2005.
LETTER
Communicated by Thomas Wennekers
Dynamical Analysis of Continuous Higher-Order Hopfield Networks for Combinatorial Optimization Miguel Atencia
[email protected] Departamento de Matem´atica Aplicada, ETSI Telecomunicaci´on, Universidad de M´alaga, 29071 M´alaga, Spain
Gonzalo Joya
[email protected] Francisco Sandoval
[email protected] Departamento de Tecnolog´ıa Electr´onica, ETSI Telecomunicaci´on, Universidad de M´alaga, 29071 M´alaga, Spain
In this letter, the ability of higher-order Hopfield networks to solve combinatorial optimization problems is assessed by means of a rigorous analysis of their properties. The stability of the continuous network is almost completely clarified: (1) hyperbolic interior equilibria, which are unfeasible, are unstable; (2) the state cannot escape from the unitary hypercube; and (3) a Lyapunov function exists. Numerical methods used to implement the continuous equation on a computer should be designed with the aim of preserving these favorable properties. The case of nonhyperbolic fixed points, which occur when the Hessian of the target function is the null matrix, requires further study. We prove that these nonhyperbolic interior fixed points are unstable in networks with three neurons and order two. The conjecture that interior equilibria are unstable in the general case is left open. 1 Introduction The main objective of this letter is the establishment of a solid background for the application of higher-order Hopfield neural networks to the solution of combinatorial optimization problems (COPs). With this aim, several contributions to the theoretical foundation of Hopfield neural networks help to clarify the dynamical behavior of these networks. Despite the importance of combinatorial optimization and intense research in the field (Pardalos & Resende, 2002), classical techniques for the solution of COPs have faced important limitations, mainly due to poor performance when dealing with high-dimensional problems. Hence, the application of the neural network paradigm to the traveling salesman problem (TSP) (Tank & Hopfield, 1985) Neural Computation 17, 1802–1819 (2005)
© 2005 Massachusetts Institute of Technology
Dynamics of Continuous Hopfield Networks for Optimization
1803
resulted in the establishment of a solid competitor to conventional optimization techniques. However, the practical results achieved after almost two decades of research (see Smith, 1999, for a recent review) do not correspond to the initial enthusiasm arisen by Hopfield networks. Indeed, several authors have presented a severe criticism on the irreproducibility of Tank and Hopfield’s experimental results (Wilson & Pawley, 1988). Several contributions have attempted to enhance the solution to the TSP by means of an optimum choice of parameters (e.g., Matsuda, 2002). Regrettably, these enhancements are dependent on the particular problem, so the results cannot be extrapolated to a more general setting. Further, since the TSP is solved by first-order networks, attention has seldom been paid to higher-order networks, which can be applied to a wider class of important problems. As a consequence, the general theoretical background for Hopfield networks is still uncomplete. This letter seeks to contribute to such a theory so that some insight can be obtained on the advantages and drawbacks of generic Hopfield networks. The fact that results found in the literature are not reproducible is often due to the confusion among different stages of the design of a practical optimization method with higher-order Hopfield networks. We propose that research on Hopfield networks, in the context of combinatorial optimization, should follow the following four-step program: 1. Analysis of the continuous network, defined by a differential equation 2. Implementation on a digital computer or as a hardware device 3. Selection of design parameters, related to the particular implementation 4. Heuristics to improve the performance in a particular problem Plenty of contributions (e.g., Matsuda, 1998; Bharitkar, Tsuchiya, & Takefuji, 1999) deal with the third and fourth items of this list, while results for establishing a solid background in full generality are scarce. Hence, this article focuses on a rigorous study of the continuous higher-order network. Although the second item is out of the scope of this contribution, its importance should be emphasized, since the implementation influences both convergence and solution quality. Indeed, any hardware device is subject to noise, whereas the implementation on a digital computer results in discretization and round-off errors. Yet the study of the continuous model is worthwhile in order to determine which favorable properties should be preserved in the implementation. This theoretical foundation should pave the way for the analysis of noise and discretization, which is, left for future research. Interestingly, the notion of defining an ordinary differential equation (ODE), whose long-term dynamics “solves” a problem, has already appeared in diverse fields, such as numerical analysis (Stuart & Humphries, 1996) and singular root finding (Riaza & Zufiria, 2002). With
1804
M. Atencia, G. Joya, and F. Sandoval
regard to optimization, a trend has emerged bared upon the so-called “oneapproach to nonlinear programming” (Schropp, 1995, 2000). In this sense, our analysis of Hopfield networks could constitute the seed of a promising ODE approach to combinatorial optimization. The analysis of the continuous network requires clearly stating the definition of the dynamics, as well as the corresponding Lyapunov function, in order to avoid confusing concepts (Joya, Atencia, & Sandoval, 2002). Noticeably, there are different formulations of Hopfield networks, among which we have chosen the Abe formulation (Abe, 1989), since its Lyapunov function has a structure identical to the multilinear target function of usual COPs. In order to prove the main properties of this system, first we distinguish between nonhyperbolic and hyperbolic fixed points. The former are those where the Jacobian of the dynamical equation is singular and their analysis is more difficult, since linearization techniques do not provide information on the nonlinear system at a nonhyperbolic point. The main result of the article is that when restricting to hyperbolic fixed points, there are only stable equilibria among vertices. This property has already been proved in the particular case of first-order and piecewise-linear transfer functions (Abe, 1993), but the procedure used is not easily generalizable to higher-order networks. The dynamical behavior of the network is summarized by the existence of a Lyapunov function, which implies stability, the fact that the state remains bounded within the unitary hypercube, and the convergence toward vertices. The overall conclusion of these properties is that continuous higher-order Hopfield networks are optimization methods that converge to a feasible solution; thus, they must be regarded as an adequate method for the solution of COPs. Hence, the criticism of their ability to achieve valid results should be displaced to the discretization process. As for nonhyperbolic fixed points, we claim that they are also unstable, although we are able to provide a proof of this fact only for the simplest higher-order network, with three neurons and second order. The conjecture that interior nonhyperbolic equilibria are unstable is left open in the general case. The letter is structured as follows. Section 2 recalls the main concepts on higher-order continuous Hopfield networks and clarifies the differences between the Hopfield formulation and the Abe formulation. Only the latter is used in subsequent sections. In section 3, the techniques of dynamical systems are used to analyze Hopfield networks, in the Abe formulation, with no restriction regarding the network order. The main result is that hyperbolic interior equilibria are unstable. The analysis of nonhyperbolic interior equilibria is faced in section 4. The nonhyperbolic case is much more difficult, and the instability of these interior equilibria is proved in the case of a second-order network with three neurons. We claim that this instability also holds in general, and this conjecture is left open. Finally, conclusions and possible directions for future research are provided in section 5.
Dynamics of Continuous Hopfield Networks for Optimization
1805
2 The Abe Model of Continuous Hopfield Networks In this section, we review the definitions of Hopfield networks that are continuous in both state and time and establish the equations of the model that are studied in the rest of the article. There are at least two formulations of continuous Hopfield networks, with different dynamical equations and with different Lyapunov functions, and there is often some confusion among them. The original formulation by Hopfield (1984) is given by the following ODE: dui = −ui + neti ; dt
si (t) = tanh
ui (t) ; β
i = 1 . . . n,
(2.1)
where si is the output or state of neuron i, ui can be regarded as a sort of internal potential, β > 0 is a parameter that regulates the slope of the hyperbolic tangent function, neti is the multilinear input to neuron i, and n is the number of neurons. The multilinear term will be formally defined below. Note that the results in this article are easily generalizable to any function different from the hyperbolic tangent as long as it is a saturated sigmoidal (Feng & Brown, 1998), that is, a bounded and strictly increasing function. A different formulation of the Hopfield network, which is better suited to combinatorial optimization, was proposed by Abe (1989): dui = neti ; dt
ui (t) si (t) = tanh ; β
i = 1 . . . n.
(2.2)
In the original formulations by both Hopfield and Abe, the term neti was simply the linear affine, strictly speaking, combination of states of other neurons, neti =
n
wij s j − b i ,
(2.3)
j=1
where wij is the weight of the connection from neuron j to neuron i and b i is the bias of neuron i. It is also usual to express equation 2.3 in matrix form: net = W s − b. In order to show the stability of the network, Hopfield assumed the conditions of symmetry of the weight matrix W, that is, wij = wji , and the absence of self-weights (wii = 0). In view of the limitations of first-order networks, higher-order models were soon proposed (Sejnowski, 1986), where connections start not only from individual neurons but also from sets of two or more neurons. It had already been argued that further analysis was needed to determine if the increased performance compensates for the computational cost of
1806
M. Atencia, G. Joya, and F. Sandoval
higher-order connections, and this question remains unsolved. This pioneering work focused on Boltzmann machines: networks with binary neurons and stochastic activation. The same rationale leads to the higher-order generalization of continuous Hopfield models (Samad & Harper, 1990; Joya, Atencia, & Sandoval 1997), whose dynamics is defined by equations 2.1 or 2.2, but now the term neti is multilinear; it contains the weighted sum of products of neuron states. In order to formalize this term, a word on notation q is needed. For any finite set A, let CA represent the set of all combinations of q elements chosen from A. Note that the word combination is used with the specialized meaning that it has in the context of combinatorics—a choice of q different elements from A not considering the order of them (Bronshtein, Semendyayev, Musiol, & Muehlig, 2004). Let Nn be the set of the first n natural numbers. Then the multilinear term neti of an r th order network can be expressed as
neti =
r
wii1 i2 ,...,iq si1 si2 , . . . , siq − b i ,
(2.4)
q q =1 (i 1 ,i 2 ,...,i q )∈ CN n −{i}
where wii1 i2 ,...,iq is the weight of the q th order connection from neurons sumi 1 , . . . , i q to neuron i. Note that the second summation comprises n−1 q mands and the definition is meaningful only if r < n. When r = 1, the original linear input, as defined in equation 2.3, is recovered. In the sequel, we assume a condition that could be termed as higher-order symmetry, although the term is misleading because now it is not related to the symmetry of any matrix: for every two weights wii1 i2 ,...,iq and wjj1 j2 ,..., jq of different neurons, if the sets {i, i 1 , i 2 , . . . , i q } and { j, j1 , j2 , . . . , jq } coincide, then wii1 i2 ,...,iq = w j j1 j2 ,..., jq . The other condition assumed in first-order networks, namely, the absence of self-connections, stems from equation 2.4 since the indexes of neurons that connect to neuron i are chosen from Nn − {i}. The justification for these assumptions will become evident below, because of the methodology of application of the Hopfield model to combinatorial optimization problems. The stability of the Hopfield network is proved by defining a suitable Lyapunov function (Joya et al., 1997, 2002). In the case of the Hopfield formulation, given by equations 2.1 and 2.4, the Lyapunov function is defined as V(s) = −
r
wi1 i2 ,...,iq iq +1 si1 si2 , . . . , siq siq +1 +
q =1 (i 1 ,i 2 ,...,i q ,i q +1 )∈C q +1 Nn
+β
i
0
n
b i si
i=1
si
arctanh(x) d x,
(2.5)
Dynamics of Continuous Hopfield Networks for Optimization
1807
whereas the Lyapunov function of the Abe formulation, defined by equations 2.2 and 2.4, is identical except for the absence of the integral term: V(s) = −
r
wi1 i2 ,...,iq ,iq +1 si1 si2 , . . . , siq siq +1 +
q =1 (i 1 ,i 2 ,...,i q ,i q +1 )∈ C q +1 Nn
n
b i si .
i=1
(2.6) In order to prove that equations 2.5 and 2.6 define suitable n Lyapunov functions, it will be helpful to observe that for each q , the q +1 combinations can , and be divided into those that contain a particular index i, which are n−1 q n−1 those that do not depend on si , which are q +1 :
wi1 i2 ,...,iq iq +1 si1 si2 , . . . , siq siq +1 q +1
(i 1 ,i 2 ,...,i q ,i q +1 )∈CNn
=
wii1 i2 ,...,iq si si1 si2 , . . . , siq q
(i 1 ,i 2 ,...,i q )∈CNn −{i}
+
wi1 i2 ,...,iq ,iq +1 si1 si2 , . . . , siq siq +1 .
(2.7)
q +1
(i 1 ,i 2 ,...,i q i q +1 )∈CNn −{i}
Note that this operation is possible due to the assumptions of higher-order symmetry and absence of self-weights. Next, we perform the calculations for the Lyapunov function of the Abe formulation, given in equation 2.6, and the procedure for the Hopfield formulation is similar. The partial derivatives are r ∂V =− ∂ si q =1
wi i1 i2 ,...,iq si1 si2 , . . . , siq q
(i 1 ,i 2 ,...,i q )∈CNn −{i}
+ b i = −neti = −
dui , dt
(2.8)
where the last equality results from the dynamical definition of the Abe formulation, equation 2.2. Then the negativeness of the time derivative results, n n n dV ∂ V dsi ∂ V dsi dui dui 2 tanh (ui ) < 0, = = =− dt ∂si dt ∂si dui dt dt i=1 i=1 i=1 (2.9) since the derivative of the hyperbolic tangent is positive everywhere.
1808
M. Atencia, G. Joya, and F. Sandoval
Typical COPs consist in finding a minimum of a multilinear function, where the solution is constrained to a finite set, usually the set of vertices of the hypercube, so that the problem can be formulated as:
Minimize E(s) = −
r
wi1 i2 ,...,iq ,iq +1 si1 si2 , . . . , siq siq +1 +
q =1 (i 1 ,i 2 ,...,i q ,i q +1 )∈C q +1 Nn
subject to
|si | = 1 i = 1, . . . , n.
n
b i si
i=1
(2.10)
The application of the Hopfield model to combinatorial optimization (Tank & Hopfield, 1985) is performed by matching the Lyapunov function of the network V to the target function E. This procedure justifies the conditions on the weights assumed above: higher-order symmetry and absence of self-weights. With respect to the former, two weights with the same subindexes coincide because they represent the same coefficient of the target function, that is, the coefficient of a product that appears only once. The latter stems from the constraint |si | = 1, which implies sik = 1 if k is even and sik = si if k is odd, so that no variable is raised to a power greater than 1 in the target function. A glance at equation 2.5 reveals a severe drawback of Hopfield’s formulation: its Lyapunov function can match the target function only if the integral term disappears, so β must be forced toward zero. Although some strategies have been studied with this aim (Vidyasagar, 1995; Joya et al., 2002), their convergence is slow, which penalizes the network performance. Comparison of equations 2.5 and 2.6 shows that the integral term is not present in the Abe model, so that the Lyapunov function can exactly match a multilinear target. This fact explains that many practical applications make use of the Abe network, although this is not always conveniently detailed. Besides, theoretical studies of the Abe dynamics are scarce, compared to the abundance of recent contributions on the Hopfield formulation (see, e.g., Chen & Amari, 2001; Juang, 1999). We focus on the Abe formulation of Hopfield networks, given by equation 2.2. In this field, several applications to optimization have been reported; significantly, the research by Takefuji and colleagues (Takefuji & Lee, 1991; Bharitkar et al., 1999), is worth mentioning. However, the proposed solutions are problem dependent, and implementation issues are solved by trial and error. Most theoretical studies are restricted to first-order networks (Abe, 1993) or approach the discretization process without a previous rigorous analysis of the continuous model (Matsuda, 1998). Thus, we feel that these systems deserve a deep theoretical investigation that permits their application to general optimization problems.
Dynamics of Continuous Hopfield Networks for Optimization
1809
3 Stability In this section, we prove one of the main contributions of the article: that under the assumption of hyperbolicity, the Abe formulation possesses no stable fixed point apart from vertices. The same result restricted to firstorder networks, already known, is obtained as a corollary. The techniques, used linearization and a local eigenvalue analysis, are usual tools in the study of ODEs, but they are not directly applicable to equation 2.2. Indeed, this equation is, strictly speaking, a differential algebraic equation (Ciarlet & Lions, 2002) rather than an ODE, since the two sets of variables si and ui are related by the nondifferential equation si = tanh(ui ). Hence, we must first reformulate equation 2.2 as an ODE, with the single set of variables si : dsi dui dsi = = dt dui dt
1 f (s). 1 − si2 neti = i β
(3.1)
In order to preserve the equivalence to equation 2.2, the initial conditions |si (t = 0)| < 1 are required. One can wonder if the direct substitution of the values of the si variables in equations 2.2 and 2.4 is more natural, leading to the following ODE: r dui = dt q =1
wii1 i2 ,...,iq tanh
ui1 β
(i1 ,i2 ,...,iq )∈CNq n −{i} uiq ui2 × tanh . . . tanh − bi . β β
(3.2)
However, adopting equation 3.1 as the dynamical definition of the network has a significant drawback: the feasible solutions to the problem are no longer fixed points of the dynamics. In fact, the vertices |si | = 1 are transformed to the ui variables as points at infinity. Although equation 3.2 can be used to study interior fixed points, the main aim of this section, the use of two different dynamical definitions for various fixed points is not particularly appealing, whereas equation 3.1 unifies the definition with the only cost being to pay attention to initial values. On the other hand, note that the fact that states remain bounded within the hypercube is not evident in equation 3.1, whereas it was so in the differential algebraic formulation, equation 2.2, since the range of the hyperbolic tangent function is (−1, 1). Although irrelevant in the study of the continuous network, it is worth mentioning that this invariance of the hypercube could be destroyed when the network is implemented on a digital computer due to round-off or discretization errors. Hence, a careful numerical implementation of equation 3.1 should preserve this property.
1810
M. Atencia, G. Joya, and F. Sandoval
The existence of a Lyapunov function implies that the long-term behavior of the system is simple (Hirsch & Smale, 1974): for any initial state, the system asymptotically approaches a point that belongs to the set of fixed points. In particular, there are no periodic solutions. The fixed points of this system can be classified into three classes:
r r r
The vertices of the unit hypercube: s = {−1, 1}n . The interior fixed points, which lead to net = 0. The points such that |si | = 1 ∀i ∈ I and neti = 0 ∀i ∈ / I for some subset I ⊆ {1, . . . , n}. These points lie on sides of the hypercube.
Since combinatorial optimization is intended, valid results should be obtained at the vertices. If equilibria within the hypercube exist, they should be unstable in order to guarantee that trajectories approach a vertex. This is the major result, which we prove: interior equilibria are unstable. In order to prove this, we must first calculate the jacobian of the function f, defined in equation 3.1. Since there are no self-weights, the condition ∂neti /∂si = 0 holds, so the Jacobian results in J(s) =
∂fi ∂s j
; ij
∂neti ∂fi 1 = 1 − si2 ∂s j β ∂s j
∂fi 2 = − si neti ; ∂si β
j = i. (3.3)
Next we prove an auxiliary technical lemma: Lemma 1. The Jacobian J of f at an interior fixed point is a matrix with zero diagonal. Further, it is similar to a symmetrical matrix, that is, there exist a nonsingular matrix A and a symmetrical matrix B such that B = A−1 JA. Proof. Let ξ be an interior fixed point. Then, since ξ is not a vertex, |ξi | = 1 ∀i holds. Consequently, since ξ is a fixed point (f(ξ) = 0), neti = 0 ∀i results from the definition of f in equation 3.1. Substituting the condition neti = 0 in equation 3.3 leads to J(ξ)ii = ∂fi /∂si |s=ξ = 0; hence, the Jacobian is a matrix with zero diagonal. In order to prove the second assertion of the lemma, define the two matrices D(s), H(s): D(s) =
1 diag 1 − si2 ; β
H(s) =
∂ neti ∂ sj
.
(3.4)
ij
Thus, from equation 3.3, since the diagonal vanishes, the Jacobian at ξ can be expressed as the product J(ξ) = D(ξ) H(ξ). Since all the diagonal elements of D(ξ) are strictly positive as long as ξ is an interior point, there exists the
Dynamics of Continuous Hopfield Networks for Optimization
1811
1
matrix D(ξ) 2 , as well as its inverse. So the following similarity transformation (Demmel, 2000) is possible: A. J(ξ) ∼ D− 2 J(ξ) D 2 = D− 2 D H D 2 = D 2 H D 2 = 1
1
1
1
1
1
(3.5)
Note that the reference to the vector ξ is omitted for brevity. As we observe in the last equality, the final matrix A is symmetrical if H is symmetrical. Hence, the proof is concluded by proving that H is symmetrical, which stems from the fact that the Lyapunov function V defined in equation 2.6 fulfills ∂ V/∂si = −neti so that h ij =
∂neti ∂2V =− , ∂s j ∂si ∂s j
(3.6)
and the symmetry of H stems from the interchangeability of the order of partial derivation. Remarkably, as observed in equation 3.6, the matrix H(s) defined in the proof of lemma 1 is the Hessian of the target function V(s) at a point s, with changed sign. This matrix will have a significant role in the elucidation of the dynamics of the network. Now the main theorem of this section, regarding instability of interior fixed points, is stated and proved: Theorem 1. Let V be the target function of an optimization problem, and let equation 2.2 define the dynamics of the corresponding neural network. Given an interior fixed point ξ, if the Hessian of V(s) at ξ is not the zero matrix, then ξ is unstable. Proof. From the previous lemma, the Jacobian evaluated at ξ is a matrix with zero diagonal. Recall from linear algebra (Golub & van Loan, 1996) that for any matrix, the sum of its eigenvalues coincides with its trace. Hence, the sum of the eigenvalues of the Jacobian is zero, and either there is at least one positive eigenvalue or all eigenvalues are zero. In the former case, the point ξ is unstable, and the proof is finished. But the latter case leads to a contradiction. It is well known (Horn & Johnson, 1985) that every symmetrical matrix is diagonalizable and all its eigenvalues are real. Since J(ξ) is similar to a symmetrical matrix, it is also similar to a real diagonal matrix A whose diagonal elements are its eigenvalues. If all eigenvalues of J(ξ) are null, so are the eigenvalues of A, which is then the zero matrix, and, by similarity, J(ξ) is also zero. In turn, since J(ξ) can be expressed by the product J(ξ) = D(ξ) H(ξ) and D(ξ) is nonsingular at interior points, H(ξ) also vanishes. Finally, observe in the proof of lemma 1 that the Hessian of V at ξ is the matrix −H(ξ) that, by hypothesis, is nonzero, and a contradiction results.
1812
M. Atencia, G. Joya, and F. Sandoval
The condition on the Hessian in theorem 1 is not easy to prove for a particular higher-order Hopfield network, since this would imply finding the fixed point by solving a nonlinear system of equations. Rather, the theorem is intended to serve as a confirmation of the observed fact that, in general, stable interior equilibria are not found in practical applications. Besides, numerical simulations of several networks suggest the conjecture that even the null Hessian condition is not necessary for instability. Observe that the condition H(ξ) = 0 implies that the Jacobian of f vanishes at ξ; hence, ξ is referred to as a nonhyperbolic fixed point in the context of dynamical systems. It is known that the analysis of nonhyperbolic equilibria is a rather difficult task since, in contrast to the hyperbolic case, linearization techniques do not provide information on stability. Thus, more advanced stability techniques are used in section 4 in an attempt to refine the theorem and omit the condition H = 0. When the range of studied neural networks is restricted to first-order networks, the fact that interior fixed points are unstable has already been proved by an involved procedure (Abe, 1993). Remarkably, the technique Abe used is not extendable to higher-order networks. Contrarily, in our analysis, this fact results from a simple corollary of theorem 1: Corollary 1. Consider a first-order network whose Lyapunov function is nonconstant. Then fixed points within the hypercube are unstable. Proof. If the network is first order, the target function is quadratic; hence, its Hessian is constant, and it coincides with the weight matrix W. Now suppose ξ is a stable interior fixed point. Then, by theorem 1, the Hessian of V(s) at ξ is the zero matrix, but since the Hessian is constant, it is identically zero. Hence, W is the zero matrix and the network is unconnected. Furthermore, null Hessian implies constant gradient, and since ξ is a fixed point, ∇V(ξ) = 0; hence, the gradient of V is identically zero. Thus, V is constant, and the optimization problem is trivial. Next we analyze the fixed points on sides of the hypercube, given by |si | = 1 ∀i ∈ I and neti = 0 ∀i ∈ / I , for some subset I ⊆ {1, . . . , n}. These points can be regarded as interior points within the side they lie on, which is a hypercube of reduced dimensionality: Lemma 2. Consider a fixed point ξ that fulfills |ξi | = 1 ∀i ∈ I ⊆ {1, . . . , n}. If the Hessian of V(s) at ξ is nonsingular, then ξ is unstable. Proof. The fixed point ξ lies on a side S of the hypercube, defined by S = {x | xi = ξi ∀i ∈ I }. We claim that a contradiction results from assuming that ξ is stable. Indeed, should this occur, ξ is an interior stable fixed point of the network restricted to the hypercube S of reduced dimensionality. Apply theorem 1 restricted to the hypercube S to conclude that a principal submatrix of H(ξ) is the zero matrix, concretely h ij = 0 ∀i, j ∈ I . Also, from
Dynamics of Continuous Hopfield Networks for Optimization
1813
equation 3.3, h ij = 0 ∀i ∈ / I ∀ j ∈ I , since the rows i ∈ / I correspond to |ξi | = 1. Thus, H(ξ) has at least one zero column, and hence it is singular, which contradicts the hypothesis, and the instability of ξ results. The remark following theorem 1 is also applicable here: the condition of nonsingularity of the Hessian is not easily provable, but it seems to hold in reported applications. Besides, the known result (Abe, 1993) for first-order networks is here a simple corollary: Corollary 2. Consider a first-order network with a stable fixed point ξ that lies on a side S of the hypercube. Then V(s) has the same value at any s ∈ S. Proof. Repeat the proof of corollary 1, restricted to the hypercube of reduced dimensionality S. From the point of view of optimization, the previous corollary leads to a practical procedure: if the network state approaches a side S where the points s ∈ S fulfill |si | = 1 ∀i ∈ I , then the variables si with i ∈ / I can be set to any value without affecting the solution. In particular, they can be rounded so as to attain a vertex, which is a feasible solution, instead of the approached interior—hence, unfeasible—fixed point. Simply put, the results in this section claim that continuous Hopfield networks in the Abe formulation are well suited to combinatorial optimization problems, since stable equilibria occur only at vertices. The corollaries on first-order networks are similar to the results in (Abe, 1993), but Abe employs the simpler piecewise-linear functions, and his techniques cannot be extended to higher-order networks. Therefore, to the best of our knowledge, the statement of these results in the general—higher-order—setting is novel. Also, these results critically depend on the absence of self-weights, which implies ∂neti /∂si = 0; hence, the diagonal of the Jacobian vanishes. Some experimental results (Abe & Gee, 1995; Abe, 1996) suggest that the network performs better when self-weights are allowed, but the algorithm has to deal with eventual stable interior equilibria. The techniques used are applied only in the case of first-order networks and have no general applicability due to their problem-dependent and empirical nature. Interestingly, some applications that do not fall into the frame of combinatorial optimization, but into continuous optimization (Atencia, Joya, & Sandoval, 2003a), require the existence of self-weights, since interior solutions are then intended. Finally, for completeness, we characterize stable vertices: Lemma 3. A vertex ξ is asymptotically stable if ξi neti |s=ξ > 0 ∀i. Proof. The Jacobian evaluated at a vertex is directly obtained by substituting the vertex condition |ξi | = 1 ∀ i in equation 3.3, resulting in the diagonal matrix diag(−2 ξi neti |s=ξ /β), whose eigenvalues are its diagonal elements.
1814
M. Atencia, G. Joya, and F. Sandoval
Thus, all the eigenvalues are negative if ξi neti |s=ξ > 0 ∀i. Hence, the vertex is stable. To summarize, since nonvertex fixed points are unstable and a Lyapunov function exists, the system converges to a stable vertex. Therefore, the network performs as intended, and there exist no interior solutions, which would be unfeasible. 4 NonHyperbolic Interior Equilibria In this section, we aim at acquiring further knowledge on the behavior of the network around a nonhyperbolic interior fixed point. As we have proved in theorem 1, this case requires the Hessian of the target function to be null at the fixed point. We have also observed that this can occur only in higher-order networks. The lack of experience in the application of the Abe formulation to higher-order problems is the reason for the absence of results in this subject. We conjecture that in higher-order neural networks, every nonhyperbolic fixed point is unstable. However, there is no evident path for the proof of this conjecture in maximum generality, so we restrict our study to the simplest nontrivial higher-order network, with three neurons and second order. In doing so, we are inspired by fruitful contributions that analyze the stability of particularly simple networks (e.g., Tino, ˇ Horne, & Giles, 2001). Even the simplified network with three neurons exhibits a complex dynamics, since the Jacobian has a zero eigenvalue with multiplicity three. In fact, the analysis of these triple-zero degeneracies is an active line of research (e.g., Algaba, Merino, Freire, Gamero, & Rodriguez-Luis, 2003) among the dynamical systems community. First, we prove that such an interior nonhyperbolic fixed point may indeed occur, and we establish sufficient conditions for its existence. Then the instability of the interior equilibrium is shown by means of an ad hoc simplification, since it is well known that classical linearization techniques fail in revealing the behavior of a dynamical system near a nonhyperbolic fixed point. Interestingly, the result requires the higher-order symmetry of the weights, as we have been assuming through the article. Recall that the ODE of a generic neural network, given in equation 3.1, is dsi /dt = (1/β) (1 − si2 ) neti . Then, in the most general form, the dynamical equations of the three-neuron network are: ds1 1 = 1 − s12 (w12 s2 + w13 s3 + w123 s2 s3 − b 1 ) dt β ds2 1 = 1 − s22 (w12 s1 + w23 s3 + w123 s1 s3 − b 2 ) dt β ds3 1 = 1 − s32 (w13 s1 + w23 s2 + w123 s1 s2 − b 3 ). dt β
(4.1)
Dynamics of Continuous Hopfield Networks for Optimization
1815
Assume an interior fixed point ξ, characterized by neti = 0 ∀ i, exists. Assume further that the Hessian of the corresponding target function vanishes at ξ. Recall that the Hessian is the Jacobian (see equation 3.3) of the function f = d s/d t. Since ξ is within the hypercube (|ξi | < 1), the Jacobian vanishing requires ∂ neti /∂ s j = 0 ∀i = j, which results in ξ1 = −
w23 , w123
ξ2 = −
w13 , w123
ξ3 = −
w12 . w123
(4.2)
Since this algebraic calculation is valid, as long as w123 = 0, we conclude that this nonhyperbolic point exists and is unique. Substituting equation 4.2 into the condition neti = 0 provides a characterization of the network biases: b1 =
w12 w13 , w123
b2 =
w12 w23 , w123
b3 =
w13 w23 . w123
(4.3)
These calculations are summarized in the following lemma: Lemma 4. Consider a neural network with three neurons, and assume its order is strictly two, that is, w123 = 0. Assume also |wij | < |w123 | ∀ i, j and the conditions of equation 4.3 hold. Then an interior fixed point ξ exists, as defined in equation 4.2. Further, the Hessian of the target function vanishes at ξ and this is the only nonhyperbolic interior fixed point. Proof. From equation 4.2, the condition |wij | < |w123 | implies |ξi | < 1, so ξ is within the unitary hypercube. Direct substitution of equations 4.3 and 4.2 shows both that ξ is a fixed point and that ∂ neti /∂ s j = 0 holds, so the Hessian of the target function is null. Since the simultaneous equations neti = 0 and ∂ neti /∂ s j = 0 have no other solution, ξ is the only nonhyperbolic interior fixed point. Next, we prove the instability of the interior fixed point, whose eventual existence is proved in lemma 4. For simplicity, we translate the fixed point to the origin with the linear transformation x = s − ξ. Hence, the dynamics, equation 4.1, reduces to: d x1 1 = 1 − (x1 + ξ1 )2 w123 x2 x3 dt β d x2 1 = 1 − (x2 + ξ2 )2 w123 x1 x3 dt β d x3 1 = 1 − (x3 + ξ3 )2 w123 x1 x2 . dt β
(4.4)
1816
M. Atencia, G. Joya, and F. Sandoval
Consider now the lowest-order terms, which completely determine the dynamical behavior of the network in a neighborhood of the origin: d x1 1 ≈ 1 − ξ12 w123 x2 x3 dt β d x2 1 ≈ 1 − ξ22 w123 x1 x3 dt β
(4.5)
d x3 1 ≈ 1 − ξ32 w123 x1 x2 . dt β Incidentally, notice that this is nothing but the normal form on the center manifold (Guckenheimer & Holmes, 1997). Then, differentiating the first equation in equation 4.5 and substituting the other two results in: d 2 x1 dt2
dx2 w123 dx3 1 − ξ12 x3 + x2 β dt dt 1 w123 1 = 1 − ξ12 1 − ξ22 w123 x1 x32 + 1 − ξ32 w123 x1 x22 β β β 2 2 2 w123 2 2 2 1 − ξ 1 − ξ x x x1 . + 1 − ξ (4.6) = 1 2 3 3 2 β2
=
Since the whole expression between brackets in the last line of equation 4.6 is positive, we conclude that the behavior of the equation in a small neighborhood of the origin is qualitatively similar to the simpler equation, d 2 x1 = k x1 , d t2
(4.7)
where k is a positive constant. The analytical solution of equation 4.7 is x1 = e
√ kt
√ kt
+ e−
,
(4.8)
which reveals that x1 is unstable at the origin. The same procedure can be applied to x2 and x3 ; hence, the second-order network with three neurons is unstable at interior points. Numerical simulations suggest that this also occurs in a more general setting, but a rigorous proof of this open conjecture is left for further research. 5 Conclusions and Future Directions In this letter, a thorough study of continuous higher-order Hopfield neural networks has been performed, with the aim of determining their ability
Dynamics of Continuous Hopfield Networks for Optimization
1817
to solve combinatorial optimization problems. The main characteristics of the continuous network are brought to the light. It results are that the state is bounded by the unitary hypercube, and hyperbolic interior equilibria, which are unfeasible points of the optimization problem, are unstable. These results had already been proved for first-order networks (Abe, 1993), whose methodology is not trivially extended to higher-order networks. In this sense, this letter can be regarded as the extension to higher-order networks of known results, but our proofs are less involved. In view of these facts and the existence of a Lyapunov function, the continuous higher-order network can be regarded as an adequate generic technique for combinatorial optimization. On the other hand, nonhyperbolic equilibria are left out of the general picture, since their analysis faces additional difficulties. Numerical simulations permit the conjecture that these points are unstable in general. Indeed, they deserve further research for at least two reasons apart from their theoretical interest:
r r
When the continuous system is discretized, complex bifurcations could branch from nonhyperbolic equilibria, leading to cycles or even strange attractors. Since hyperbolic interior equilibria are saddle nodes, their unstable manifolds separate the basins of attraction of stable vertices (Atencia, Joya, & Sandoval, 2003b). This analysis could be extended to nonhyperbolic fixed points.
The analysis of nonhyperbolic interior equilibria is accomplished in the case of second-order networks with three neurons, establishing the conditions for their existence and proving their instability. The results presented in the article provide some insight into the dynamical behavior of Hopfield networks, in the Abe formulation and its relation to practical solution of combinatorial optimization problems. In particular, the instability of interior points forces the network to provide a feasible point. Two important questions are left for further research: (1) the existence of local minima among feasible points and the size of the corresponding basins of attraction, when compared to that of global minima, and (2) the design of numerical methods that implement the continuous equation on a digital computer, while preserving those favorable properties. Acknowledgments This work has been partially supported by Project No. TIC2001-1758, funded by the Spanish Ministerio de Ciencia y Tecnolog´ıa and FEDER funds. Thanks are due to F. R. Villatoro, P. Guerrero, and R. Riaza for pointing out some useful references. The comments of the reviewers and editor have greatly contributed to enhancing the letter.
1818
M. Atencia, G. Joya, and F. Sandoval
References Abe, S. (1989). Theories on the Hopfield neural networks. In Proc. IEE International Joint Conference on Neural Networks (Vol. 1, pp. 557–564). Washington, DC: IEEE. Abe, S. (1993). Global convergence and suppression of spurious states of the Hopfield neural networks. IEEE Trans. on Circuits and Systems–I, 40(4), 246–257. Abe, S. (1996). Convergence acceleration of the Hopfield neural network by optimizing integration step sizes. IEEE Trans. on Systems, Man and Cybernetics B, 26(1), 194–201. Abe, S., & Gee, A. H. (1995). Global convergence of the Hopfield neural network with nonzero diagonal elements. IEEE Trans. on Circuits and Systems–II, 42(1), 39–45. Algaba, A., Merino, M., Freire, E., Gamero, E., & Rodriguez-Luis, A. (2003). Some results on Chua’s equation near a triple-zero linear degeneracy. International Journal of Bifurcation and Chaos, 13(3), 583–608. Atencia, M. A., Joya, G., & Sandoval, F. (2003a). Hopfield neural networks for parametric identification of a robotic system. In Proc. Engineering Application of Neural Networks Conference (EANN 2003), M´alaga (Spain) (pp. 311–318). M´alaga, Spain: Departamento de Ingenier´ıa de Sistemas y Autom´atica, Universidad de M´alaga. Atencia, M. A., Joya, G., & Sandoval, F. (2003b). Spurious minima and basins of ´ attraction in higher-order Hopfield networks. In J. Mira & J. R. Alvarez, (Eds.), Computational methods in neural modeling. Berlin: Springer. Bharitkar, S., Tsuchiya, K., & Takefuji, Y. (1999). Microcode optimization with neural networks. IEEE Trans. on Neural Networks, 10(3), 698–703. Bronshtein, I., Semendyayev, K., Musiol, G., & Muehlig, H. (2004). Handbook of mathematics. Berlin: Springer. Chen, T., & Amari, S.-I. (2001). New theorems on global convergence of some dynamical systems. Neural Networks, 14, 251–255. Ciarlet, P., & Lions, J. (2002). Handbook of numerical analysis. Amsterdam: Elsevier. Demmel, J. (2000). Hermitian eigenproblems. In Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, & H. van der Vorst (Eds.), Templates for the solution of algebraic eigenvalue problems: A practical guide (pp. 11–14). Philadelphia: SIAM. Feng, J., & Brown, D. (1998). Fixed-point attractor analysis for a class of neurodynamics. Neural Computation, 10(1), 189–213. Golub, G. H., & van Loan, C. F. (1996). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Guckenheimer, J., & Holmes, P. (1997). Nonlinear oscillations, dynamical systems, and bifurcations of vector fields. Berlin: Springer. Hirsch, M. W., & Smale, S. (1974). Differential equations, dynamical systems, and linear algebra. Orlando, FL: Academic Press. Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088– 3092. Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge: Cambridge University Press. Joya, G., Atencia, M. A., & Sandoval, F. (1997). Associating arbitrary-order energy functions to an artificial neural network: Implications concerning the resolution of optimization problems. Neurocomputing, 14, 139–156.
Dynamics of Continuous Hopfield Networks for Optimization
1819
Joya, G., Atencia, M. A., & Sandoval, F. (2002). Hopfield neural networks for optimization: Study of the different dynamics. Neurocomputing, 43(1–4), 219–237. Juang, J.-C. (1999). Stability analysis of Hopfield-type neural networks. IEEE Trans. on Neural Networks, 10(6), 1366–1374. Matsuda, S. (1998). Optimal Hopfield network for combinatorial optimization with linear cost function. IEEE Trans. on Neural Networks, 9(6), 1319–1330. Matsuda, S. (2002). “Optimal” neural representation of higher order for traveling salesman problems. Electronics and Communications in Japan, Part 2, 85(9), 32–42. Pardalos, P. M., & Resende, M. G. (2002). Handbook of applied optimization. New York: Oxford University Press. Riaza, R., & Zufiria, P. J. (2002). Discretization of implicit ODEs for singular rootfinding problems. Journal of Computational and Applied Mathematics, 140, 695–712. Samad, T., & Harper, P. (1990). High-order Hopfield and Tank optimization networks. Parallel Computing, 16, 287–292. Schropp, J. (1995). Using dynamical systems methods to solve minimization problems. Applied Numerical Mathematics, 18, 321–335. Schropp, J. (2000). One-step and multistep procedures for constrained minimization problems. IMA Journal of Numerical Analysis, 20(1), 135–152. Sejnowski, T. J. (1986). Higher-order Boltzmann machines. In J. S. Denker (Ed.), Neural networks for computing (pp. 398–403). College Park, MD: American Institute of Physics. Smith, K. A. (1999). Neural networks for combinatorial optimization: A review of more than a decade of research. INFORMS Journal on Computing, 11(1), 15–34. Stuart, A., & Humphries, A. (1996). Dynamical systems and numerical analysis. Cambridge: Cambridge University Press. Takefuji, Y., & Lee, K. C. (1991). Artificial neural networks for four-coloring map problems and K-colorability problems. IEEE Trans. on Circuits and Systems, 38(3), 326–333. Tank, D., & Hopfield, J. (1985). “Neural” computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Tino, ˇ P., Horne, B., & Giles, C. (2001). Attractive periodic sets in discrete-time recurrent networks (with emphasis on fixed-point stability and bifurcations in twoneuron networks). Neural Computation, 13(6), 1379–1414. Vidyasagar, M. (1995). Minimum-seeking properties of analog neural networks with multilinear objective functions. IEEE Trans. on Automatic Control, 40(8), 1359–1375. Wilson, G., & Pawley, G. (1988). On the stability of the travelling salesman problem algorithm of Hopfield and Tank. Biological Cybernetics, 58(1), 63–70.
Received May 10, 2004; accepted January 10, 2005.
LETTER
Communicated by Morris Hirsch
Some Generalized Sufficient Convergence Criteria for Nonlinear Continuous Neural Networks Jito Vanualailai
[email protected] Department of Mathematics and Computing Science, University of the South Pacific, Suva, Fiji
Shin-ichi Nakagiri
[email protected] Department of Applied Mathematics, Faculty of Engineering, Kobe University, Kobe, Japan
A reason for applying the direct method of Lyapunov to artificial neural networks (ANNs) is to design dynamical neural networks so that they exhibit global asymptotic stability. Lyapunov functions that frequently appear in the ANN literature include the quadratic function, the Persidskii function, and the Lur´e-Postnikov function. This contribution revisits the quadratic function and shows that via Krasovskii-like stability criteria, it is possible to have a very simple and systematic procedure to obtain not only new and generalized results but also well-known sufficient conditions for convergence established recently by non-Lyapunov methods, such as the matrix measure and nonlinear measure. 1 Introduction The direct method of Lyapunov, which uses energy-like functions called Lyapunov functions, is now a well-entrenched technique in the qualitative analysis of mathematical systems governed by differential equations. A flurry of activities by mathematicians, particularly between the early 1940s and the late 1960s, extended the work of Lyapunov to produce results that are now indispensable in many applications. (A modern review of the Lyapunov method and its many applications is by Sastry, 1999.) This letter is motivated to a large extent by modern applications of the Lyapunov method, especially those in artificial neural networks. Historically, the rigorous application of the Lyapunov method to artificial neural networks can be traced back to the pioneering work of the mathematician Grossberg, who in the 1970s started Lyapunov functionand Lyapunov functional-based methods for classifying the dynamical behaviors of a wide variety of competitive dynamical systems. By 1988, he had accumulated sufficient important and fundamental results, which he Neural Computation 17, 1820–1835 (2005)
© 2005 Massachusetts Institute of Technology
Nonlinear Continuous Neural Networks
1821
then summarized in an excellent review (Grossberg, 1988). Between the late 1970s and late 1980s, the physicist Hopfield concentrated on a particular form of dynamical systems that were being considered in general by Grossberg. Hopfield (1984) published the landmark paper that popularized the term Hopfield or Hopfield-type (artificial) neural network. Hopfield designed his network using a Lyapunov function that is now recognized as a special form of the Lyapunov function that was proposed a year earlier by Cohen and Grossberg (1983) for a more general system. Then Hopfield and Tank (1985, 1986) applied Hopfield’s earlier findings to firmly establish the role of Hopfield-type neural networks as standard models that perform some computational task, such as recognition and association, on a given key pattern via interaction between a number of interconnected units having simple functions. The key pattern presented to a Hopfield-type neural network is an initial state of the network. Then the network must be designed (using the Lyapunov method, for example) such that the network’s state settles ultimately to an equilibrium that depends on only the key pattern. In this letter, we also consider this model and discuss recent results. We start by considering the autonomous system of the form x˙ ≡
dx = g(x), x(t0 ) = x0 , t ≥ t0 ≥ 0. dt
(1.1)
Guided by the well-known 1954 result of Krasovskii, we seek to portray a simple method of generating the quadratic Lyapunov function for equation 1.1. The Lyapunov function guarantees global exponential stability. If system 1.1 is further perturbed by time-varying external inputs, then we can prove, by an appropriate extension of the quadratic Lyapunov function, that the perturbed system yields convergence of solutions to the equilibrium points of equation 1.1 when the perturbation is either L 2 or bounded and decays with time. We end by applying the stability and convergence criteria to artificial neural networks of the form of equation 1.1 and to Hopfield-type neural networks. Throughout the letter, we suppose that in equation 1.1, the function g = (g1 , . . . , gn )T is smooth enough to guarantee the existence, uniqueness, and continuous dependence of solutions x(t) = x(t; x0 ), with x = (x1 , . . . , xn )T . It is assumed that readers are familiar with the various standard definitions of Lyapunov stability. (We will use those in Sastry, 1999.) Thus, without loss of generality, we carry the assumption that g(0) = 0, so that 0 is the equilibrium point of equation 1.1.
2 Convergence Criteria In 1954, Krasovskii established an asymptotic stability criterion that avoided the linearization principle. He assumed that g ∈ C 1 [Rn , Rn ] and g(0) = 0.
1822
J. Vanualailai and S. Nakagiri
Then system 1.1 can be written as x˙ =
1
J(sx)xds, x(t0 ) = x0 ,
0
where J is the Jacobian matrix J(x) =
∂gi ∂g (x). (x) = [J ij (x)]n×n , where J ij (x) = ∂x ∂xj
The following result by Krasovskii is a fundamental one in control theory. Theorem 1 (Krasovskii, 1954). Let g ∈ C 1 [Rn , Rn ] and g(0) = 0. If there exists a constant positive definite symmetric matrix P such that x T [P J (x) + J T (x)P]x is a negative definite function, then the zero solution of equation 1.1 is globally asymptotically stable. We next state two different versions of Krasovskii’s theorem that are applicable to artificial neural networks. The first result is applicable to continuous artificial neural networks with the general form of equation 1.1. The second result is applicable to a specific case of equation 1.1. Both results explicitly use each component of system 1.1. Thus, defining D(x) = dij (x) n×n , where dij (x) =
1
J ij (sx)ds,
(2.1)
0
and given that g(0) = 0, we can write system 1.1 as x˙ = D(x)x, x(t0 ) = x0 , the ith component of which is x˙ i = dii (x)xi +
n
dij (x)x j .
j=1 j=i
Our first result is given next. Theorem 2. Let g ∈ C 1 [Rn , Rn ] and g(0) = 0, and define βi (x) = dii (x) +
n 1 |dij (x) + dji (x)|. 2 j=1 j=i
(2.2)
Nonlinear Continuous Neural Networks
1823
Suppose there are constants c i > 0 such that βi (x) ≤ −c i < 0 for i = 1, . . . , n and x ∈ Rn . Then the zero solution of equation 1.1 is globally exponentially stable. Proof. Consider, as a tentative Lyapunov function for system 1.1, V(x) =
n 1 x2 . 2 i=1 i
For c = min{c 1 , . . . , c n } > 0, we have, along a solution of equation 1.1, V˙ (1.1) =
n
xi x˙ i =
i=1
n
xi dii (x)xi +
i=1
dij (x)x j
j=1 j=i
=
n
n n 2 dij (x)xj xi dii (x)xi + i=1
j=1 j=i
n n 1 2 = [dij (x) + dji (x)]xj xi dii (x)xi + 2 j=1 i=1 j=i
n n 1 2 ≤ |dij (x) + dji (x)| x 2j + xi2 dii (x)xi + 4 j=1 i=1 j=i
=
=
n
1 dii (x) + 2 i=1
n
n
|dij (x) + dji (x)| xi2
j=1 j=i
βi (x)xi2
i=1
≤ −c
n
xi2
i=1
= −2cV, so that V(x(t)) ≤ V(x0 )e −2c(t−t0 ) , t ≥ t0 ≥ 0. Hence, the conclusion of theorem 2 follows, and V is a Lyapunov function for system 1.1.
1824
J. Vanualailai and S. Nakagiri
To state our second result, we see that if we define n 1 |J ij (x) + J ji (x)|, 2 j=1
τi (x) = J ii (x) +
(2.3)
j=i
and assume that τi (x) ≤ −c i < 0, then the time derivative of our Lyapunov function can be continued as follows: n n 1 V˙ (1.1) ≤ |dij (x) + dji (x)| xi2 dii (x) + 2 j=1 i=1 j=i
n =
0
i=1
≤
n i=1
1
1
0
n
1
J ii (sx)ds +
2 j=1 j=i
0
1
2 J ij (sx) + J ji (sx) ds
xi
n n τi (sx)ds xi2 ≤ − c i xi2 ≤ −c xi2 = −2cV. i=1
(2.4)
i=1
We thus have the following result: Theorem 3. Let g ∈ C 1 [Rn , Rn ] and g(0) = 0, and define τi (x) = J ii (x) +
n 1 |J ij (x) + J ji (x)|. 2 j=1 j=i
Suppose there are constants c i > 0 such that τi (x) ≤ −c i < 0 for i = 1, . . . , n and x ∈ Rn . Then the zero solution of equation 1.1 is globally exponentially stable. Our final result in this section recognizes that certain continuous artificial neural networks, such as the Hopfield-type networks, may be perturbed by external inputs. If the inputs are constants, then such neural networks can be recast into the form 1.1 and their stability analyzed accordingly. If the external inputs are time varying, then we need to consider a perturbed form of equation 1.1 as follows: x˙ = g(x) + h(t), x(t0 ) = x0 ,
(2.5)
where h(t) = (h i (t), . . . , h n (t))T , a vector function of t, need not be continuous on [0, ∞). Even for a simple perturbed system such as equation 2.5, establishing the asymptotic stability of the zero solution, assuming it is the equilibrium point of the system without h(t), could be challenging without some simplifying assumptions on h(t). A recent result that guarantees at least the convergence of solutions to the zero solution is as follows:
Nonlinear Continuous Neural Networks
1825
Theorem 4 (Vanualailai, Soma, & Nakagiri, 2002). Let the conditions of theorem 2 or theorem 3 hold. Then all solutions of equation 2.5 tend to zero if either one of the following conditions holds: i.
n
i=1 [h i (t)]
2
is bounded on [0, ∞) and
n
i=1 [h i (t)]
2
→ 0 as t → ∞.
ii. h i (·) ∈ L 2 [t0 , ∞) for i = 1, . . . , n.
A simpler result that embraces the two conditions i and ii of theorem 4 is as follows: Theorem 5. Let the conditions of theorem 2 or theorem 3 hold. If n i=1
t+1
[h i (s)]2 ds → 0 as t → ∞,
(2.6)
t
then all solutions of equation 2.5 tend to zero. It is clear that if condition i or condition ii of theorem 4 holds, then condition 2.6 of theorem 5 is satisfied. To prove theorem 5, we need the following lemma, which is a straightforward consequence of theorem A in Hara (1975): Lemma 1 (Hara, 1975). Suppose that there exists a Lyapunov function V(t, x) of equation 2.5, continuous differentiable in [t0 , ∞) × Rn , satisfying the following conditions: i. a (x) ≤ V(t, x) ≤ b(x), where a (r ) ∈ CIP (the family of continuous and increasing positive definite functions), a (r ) → ∞ as r → ∞ and b(r ) ∈ CIP. de f limsup 1 {V(t + h, x + h [g(x) + h(t)]) − V(t, x)} ii. V˙ (2.5) (t, x) = h→0+ h ≤ −cV(t, x) + λ(t)[1 + V(t, x)], where c > 0 is a constant and λ(t) ≥ 0 t+ 1 satisfies t λ(s)ds → 0 as t → ∞.
Then the solution x(t) of equation 2.5 satisfies x(t) → 0 as t → ∞. Proof. Utilizing the function
V(t, x) =
n 1 x2 , 2 i=1 i
1826
J. Vanualailai and S. Nakagiri
and continuing from equation 2.4, we have, along a solution of system 2.5, for c = min{c 1 , . . . , c n } > 0 and for > 0, V˙ (2.5) ≤ −2cV +
n
xi h i (t)
i=1
≤ −c
n
xi2 +
i=1
= −(c − )
n
xi2 +
i=1 n
xi2 +
i=1
= − 2(c − )V +
n 1 [h i (t)]2 4 i=1
n 1 [h i (t)]2 4 i=1
n 1 [h i (t)]2 . 4 i=1
If we take = c/2 and λ(t) =
1 2c
n
i=1 [h i (t)]
2
≥ 0, then
V˙ (2.5) ≤ −cV + λ(t)(1 + V), noting that V is nonnegative. Hence, the conclusion of theorem 5 follows immediately from lemma 1.
3 Application to Artificial Neural Networks Certain artificial neural networks could be considered as dynamical systems for which the convergence of system trajectories to equilibrium states is a necessity. Moreover, for these networks, it is best to guarantee exponential convergence since this implies that the rate of convergence to an equilibrium state can be measured, an important aspect in the stability analysis of neural networks (Yi, Heng, & Fu, 1999). A modern overview of the concepts, both mathematical and biological, associated with neural networks is given in Arbib (2002). In the first part of this section, we consider a continuous neural network that is described thoroughly in Hirsch (1989) and provide a convergence criterion using theorem 2. In the second part, we consider the Hopfieldtype neural network and provide convergence criteria using theorem 3 if the external inputs are constants, and theorem 5 if the external inputs are time-varying. 3.1 An Artificial Neural Network of the Type x˙ = g(x). Following the description by Hirsch (1989), we consider a net that has n units. To the ith unit, we associate its activation state at time t, a real number xi = xi (t); an output function µi ; a fixed bias θi ; and an output signal Ri = µi (xi + θi ). The
Nonlinear Continuous Neural Networks
1827
synaptic weight or connection strength on the line from unit j to unit i is a fixed real number Wij . When Wij = 0, there is no transmission from unit j to unit i. The incoming signal from unit j to unit i is Sij = Wij R j . In addition, there can be a vector I of any number of external inputs feeding into some or all units, so that we may write I = (I1 , . . . , Im )T . A neural network with fixed weights is a dynamical system: given initial values of the activation of all units, the future activations can be computed. The future activation states are assumed to be determined by a system of n differential equations, the ith equation of which is x˙ i = G i (xi , Si1 , . . . , Sin , I) = G i (xi , Wi1 R1 , . . . , Win Rn , I) = G i (xi ; Wi1 µ1 (x1 + θ1 ), . . . , Win µn (xn + θn ); I1 , . . . , Im ) .
(3.1)
With Wij , θi , Ik , and some initial value xi (t0 ), t0 ≥ 0, assumed known, we can write equation 3.1 as x˙ i = gi (x1 , . . . , xn ), xi (t0 ) = xi 0 , which is the ith component of the system x˙ = g(x), x(t0 ) = x0 ,
(3.2)
where g is a vector on Euclidean space Rn . We assume that g is continuously differentiable and satisfies the usual theorems on existence, continuity and, uniqueness of solutions. Thus, since g ∈ C 1 [Rn , Rn ], we can define D(x) as in equation 2.1 but using g in equation 3.2. Hence, if g(0) = 0, then system 3.2 can be written as x˙ = D(x)x , x(t0 ) = x0 ,
(3.3)
the ith component of which is x˙ i = dii (x)xi +
n
dij (x)x j .
j=1 j=i
First, we state a comparable result by Lakshmikantham, Matrosov, and Sivasundaram (1991), who used the concept of vector Lyapunov functions. Theorem 6 (Lakshmikantham et al., 1991). Let g ∈ C 1 [Rn , Rn ] and g(0) = 0. Define βi (x) = dii (x) +
n j=1 j=i
|dij (x)| .
1828
J. Vanualailai and S. Nakagiri
Suppose that βi (x) < 0 if
xi2 ≥ x 2j ,
for i, j = 1, . . . , n and x ∈ Rn , x = 0. Then the zero solution of equation 3.2 is globally asymptotically stable. If we apply theorem 2, we obtain a new convergence criterion. Corollary 1. Let g ∈ C 1 [Rn , Rn ] and g(0) = 0. Define βi (x) = dii (x) +
n 1 |dij (x) + dji (x)|. 2 j=1 j=i
Suppose there are constants c i > 0 such that βi (x) ≤ −c i < 0 for i = 1 , . . . , n and x ∈ Rn . Then the zero solution of equation 3.2 is globally exponentially stable. Remark 1. For the case where dij (x) = −dji (x), i = j, corollary 1 is more effective than theorem 6 in predicting the stability of system 3.2 in that if dii (x) < 0, for i = 1, . . . , n and x ∈ Rn , then network 3.2 is globally asymptotically stable no matter how large |dij (x)| are. If dij (x) = −dji (x) for all i, j = 1, . . . , n, then the matrix D in equation 3.3 is skew symmetric. In such an extreme case, the network 3.2 is stable by corollary 1. We have therefore obtained a result that is similar to Matsuoka’s well-known result (1992), which says that if the synaptic weight matrices in Hopfield-type neural networks are skew symmetric, then the networks are at least stable, but is applicable to a more general network, equation 3.2, of which the Hopfield network is a specific case. The next section shows that Matsuoka’s result is indeed but a special case of theorem 3. 3.2 The Hopfield-Type Neural Network. The Hopfield-type neural network is a specific case of network 3.2 and therefore of 3.1. It is modeled by the nonlinear differential equation, x˙ i = −a i xi +
n
Wij µ j (x j + θ j ) + Ii (t)
j=1
= −a i xi +
n
Wij ν j (x j ) + Ii (t),
(3.4)
j=1
where a i > 0 is the constant decay rate, Ii (t) is the time-varying external input (to the ith neuron) defined almost everywhere on [0, ∞), and νi is the suppressed notation for the fixed θi by having θi incorporated
Nonlinear Continuous Neural Networks
1829
into νi . The function νi is called the neuron activation function. Now define A = diag(−a 1 , . . . , −a n ), x = (x1 , . . . , xn )T , v(x) = (ν1 (x1 ), . . . , νn (xn ))T , W = [Wij ]n×n and h(t) = (I1 (t), . . . , In (t))T . Then equation 3.4 is the ith component of the system x˙ = Ax + Wv(x) + h(t), x(t0 ) = x0 .
(3.5)
3.2.1 Hopfield-Type Neural Networks with Constant External Inputs. Let de f us first look at the case of constant external input vector, h(t) = k = T (I1 , . . . , In ) . Let us assume the unique existence of the equilibrium point x = x∗ when the input vector is constant, so that Ax∗ + Wv(x∗ ) + k = 0. Introduce the vector u(t) = x(t) − x∗ . Then (on suppressing t), u˙ = A[u + x∗ ] + Wv(u + x∗ ) + k = A[u + x∗ ] + Wv(u + x∗ ) + k − [Ax∗ + Wv(x∗ ) + k] = A[u + x∗ − x∗ ] + W [v(u + x∗ ) − v(x∗ )] de f
= Au + Wr(u)
de f
˜ = g(u), u(t0 ) = u0 ,
(3.6)
the ith component of which is u˙ i = −a i ui +
n
Wij [ν j (u j + x ∗j ) − ν j (x ∗j )] = −a i ui +
j=1
n
Wij r j (u j ),
j=1
˜ where r(u) = (r1 (u1 ), . . . , rn (un ))T . It is clear that g(0) = 0, so that theorem 3 is applicable to the zero solution of equation 3.6, and therefore to the equilibrium point x = x∗ of equation 3.5 with constant external inputs. Now, the Jacobian matrix yields J ii (u) =
∂ g˜ i (u) = −a i + Wii νi (ui + xi∗ ) = −a i + Wii ri (ui ) ∂ui
J ij (u) =
∂ g˜ i (u) = Wij ν j (u j + x ∗j ) = Wij r j (u j ), i = j. ∂u j
and
Theorem 3 immediately gives the following new result. Corollary 2. Let the neuron activation functions ri (ui ) be of C 1 -class and ri (0 ) = 0 . Define τi (u) = −a i + Wii ri (ui ) +
n 1 |Wij r j (u j ) + Wji ri (ui )|. 2 j=1 j=i
1830
J. Vanualailai and S. Nakagiri
Suppose there are constants c i > 0 such that τi (u) ≤ −c i for i = 1, . . . , n, and u ∈ Rn . Then the zero solution of equation 3.6 is globally exponentially stable. We shall now show that corollary 2 is an important encompassing result because it is a generalization of two well-known results. First, it is a generalization of a result by Matsuoka (1992). Corollary 2 immediately gives the following: Corollary 3. Let the neuron activation functions ri (ui ) be of C 1 -class and ri (0 ) = 0 . Assume there exist constants ρi > 0 such that 0 ≤ ri (ui ) ≤ ρi for i = 1, . . . , n and u ∈ Rn . If n 1 max |Wij ρ j + Wji ρi | < a i , 1 ≤i≤n Wii ρi + 2 j=1
(3.7)
j=i
then the zero solution of equation 3.6 is globally exponentially stable. Matsuoka’s result (corollary 2, p. 498) is the case where ρi = 1 and a i = 1. Also, it is noted that Matsuoka improved a result by Hirsch (1989), who had 1 n j=1 |Wij | + |Wji | in the second term of equation 3.7. 2 j=i
The first generalization of Matsuoka’s result appeared in 2001, in Liang and Si (2001). However, their derivation employs a full Lur´e-Postnikov function and is unnecessarily difficult compared to the derivation (and hence the statement) of corollary 3. In this respect, corollary 3 is a significant contribution in this article. Corollary 2 is also a generalization of a result by Fang and Kincaid (1996). Via the inequality |a + b| ≤ max{|a + b|, |a |, |b|} ≤ |a | + |b| a , b ∈ R, we can use corollary 2 to obtain two practically useful results: Corollary 4. Let the neuron activation functions ri (ui ) be of C 1 -class and ri (0 ) = 0 . Assume there exist constants ρi > 0 such that 0 ≤ ri (ui ) ≤ ρi for i = 1, . . . , n and u ∈ Rn . Define y+ = max{y, 0 } for all real numbers y and ψi = −a i + Wii+ ρi +
n 1 Wij , 2 j=1 j=i
where Wij = max |Wij ρ j + Wji ρi |, |Wij |ρ j , |Wji |ρi .
Nonlinear Continuous Neural Networks
1831
If ψi < 0 for i = 1, . . . , n, then the zero solution of equation 3.6 is globally exponentially stable. Corollary 5. Let the neuron activation functions ri (ui ) be of C 1 -class and ri (0 ) = 0 . Assume there exist constants ρi > 0 such that 0 ≤ ri (ui ) ≤ ρi for i = 1, . . . , n and u ∈ Rn . Define y+ = max{y, 0 } for all real numbers y and ψi = −a i + Wii+ ρi +
n 1 (|Wij |ρ j + |Wji |ρi ). 2 j=1 j=i
If ψi < 0 for i = 1, . . . , n, then the zero solution of equation 3.6 is globally exponentially stable. Note that we can obtain corollary 5 directly from theorem 3, given that |J ij (x) + J ji (x)| ≤ |J ij (x)| + |J ji (x)|. Now, in corollary 4, if Wij = Wij |ρ j |, then we obtain a slight improvement of Fang and Kincaid (1996, theorem 3.8 ii-b). Corollary 5 corresponds to Fang and Kincaid (1996, theorem 3.8 ii-d). As the following example shows, corollary 4 is easier to apply: Example. Consider the system x˙ 1 = −x1 + 0.90ν2 (x2 ) + I1 ,
x˙ 2 = −x2 − 0.55ν1 (x1 ) + I2 ,
(3.8)
where ν1 (r ) = tanh 2r , ν2 (r ) = tanh r , and I1 and I2 are arbitrary inputs. We have −1 0 0 0.90 A= , W= and ρ1 = 2, ρ2 = 1. 0 −1 −0.55 0 Applying corollary 4, we see that + ρ1 + ψ1 = −a 1 + W11
= −1 +
1 max {|W12 ρ2 + W21 ρ1 |, |W12 |ρ2 , |W21 |ρ1 } 2
1 (| − 0.55| · 2) < 0, 2
and + ψ2 = −a 2 + W22 ρ2 +
= −1 +
1 max {|W21 ρ1 + W12 ρ2 |, |W21 |ρ1 , |W12 |ρ2 } 2
1 (| −0.55| · 2) < 0. 2
1832
J. Vanualailai and S. Nakagiri
Hence, every equilibrium point (x1∗ , x2∗ ) of equation 3.8 corresponding to every constant external input vector k = (I1 , I2 ) is globally exponentially stable. Remark 2. On a historical note, the result by Hirsch (1989) was obtained by the application of the Gershgorin criterion, a non-Lyapunov method. The first major generalization and improvement of Hirsch’s result was made when Matsuoka (1992) used a Persidskii-type Lyapunov function to replace the term |Wij | + |Wji | by |Wij + Wji |. Hence, Matsuoka’s results were effective in predicting the stability of systems in which Wij = −Wji . After Matsuoka, the Persidskii-type Lyapunov function and the Lur´e-Postnikov Lyapunov function were used in the results of many researchers. The frequently cited results include those by Forti and Tesi (1995), Arik and Tavsanoglu (1998), Juang (1999), Yi et al. (1999), Guan, Chen, and Qin (2000), Kaszkurewicz and Bhaya (2000), Liang and Si (2001), and Chen, Lu, and Amari (2002). These results are essentially on establishing the existence of equilibrium solutions, sufficient conditions for exponential stability, rates of exponential stability of Hopfield-type neural networks, and generalized results. One may also note the elegant results of Pakdaman, Malta, and Grotta-Ragazzo (1999), who used the quadratic Lyapunov function as a contraction map to prove both the existence and global stability of the equilibrium points of Hopfield-type neural networks. The major non-Lyapunov methods that have produced similar results (such as corollary 5) and other sufficient conditions for convergence include the matrix measure technique by Fang and Kincaid (1996) and the nonlinear measure technique formulated by Qiao, Peng, and Xu (2001). However, apart from Liang and Si (2001) and this letter, none of the comparable results that appeared after 1992 seems to have reproduced Matsuoka’s superior result that also conclusively established global asymptotic stability in networks with skew-symmetric synaptic weight matrices. The approach by Liang and Si (2001), which used a Lur´e-Postnikov Lyapunov function, is unnecessarily more difficult than the approach in this article. 3.2.2 Hopfield-Type Neural Networks with Time-Varying External Inputs. Let us next consider the case of time-varying external inputs. First, we look at the autonomous system, x˙ = Ax + Wv(x), x(t0 ) = x0 .
(3.9)
For this, we assume that x∗ = (x1∗ , . . . , xn∗ )T is the unique equilibrium point, so that Ax∗ + Wv(x∗ ) = 0. Again, using the variable u(t) = x(t) − x∗ , we have, from equation 3.6, with k = 0, ˜ u˙ = Au + Wr(u) = g(u), u(t0 ) = u0 ,
(3.10)
Nonlinear Continuous Neural Networks
1833
so that we can apply theorem 3 to the zero solution of equation 3.10 and hence to the equilibrium point of equation 3.9. Since either corollary 2, 3, 4, or 5 guarantees stability given any constant external input vector, including 0, we have the following result: Corollary 6. Let the conditions of corollary 2, corollary 3, corollary 4, or corollary 5 hold. Then the zero solution of equation 3.10 is globally exponentially stable. Let us now perturb equation 3.10 with the time-varying external input vector h(t) to get the following system: ˙ = g(u) ˜ u(t) + h(t), u(t0 ) = u0 .
(3.11)
We can thus apply theorem 5, giving the convergence of all solutions of equation 3.11 to the zero solution, and hence convergence of all solutions of equation 3.5 to the solution x∗ . Corollary 7. Let the conditions of corollary 6 hold. If n i=1
t+1
[h i (s)]2 ds → 0 as t → ∞,
t
then all solutions of equation 3.11 tend to zero. Remark 3. Corollary 7 is new and deals with the time-dependent inputs that maintain convergence. It considerably improves the most recent result that first deals with strictly time-varying external inputs, namely, theorem 2, of Vanualailai et al. (2002), by proposing more practically useful results in the form of corollary 6 and corollary 7. If |h i (t)| ≤ ki , for some constant ki > 0, then it is a mistake to use the well-known Malkin’s theorem to conclude global stability or stability under persistent disturbances since h does not depend on x. In fact, to conclude this, it is best to use a practical stability criterion proposed by Lakshmikantham, Leela, and Martynyuk (1990) since it gives more than a mere statement of the existence of the bounds of the disturbances and initial conditions that maintain bounded outputs. Practical stability was first applied to Hopfield-type neural networks by Koksal ¨ and Sivasundaram (1993), but their proposed practical stability criteria were restrictive and were improved recently by Vanualailai et al. (2002). Acknowledgments We thank T. Hara for helpful discussions and the referees for their constructive remarks. J. V. acknowledges the French government and the French embassy, Fiji, for providing funds and assistance to allow him to spend
1834
J. Vanualailai and S. Nakagiri
his sabbatical leave (December 2002–March 2003) at the Laboratoire des Signaux et Syst`emes, Sup´elec, France, where the first draft of this article was written. References Arbib, M. A. (2002). The handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Arik, S., & Tavsanoglu, V. (1998). A comment on “Comments on necessary and sufficient condition for absolute stability of neural networks.” IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, 45, 595–596. Chen, T., Lu, W., & Amari, S. (2002). Global convergence rate of recurrently connected neural networks. Neural Computation, 14, 2947–2957. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, SMC-13, 815–826. Fang, Y., & Kincaid, T. G. (1996). Stability analysis of dynamical neural networks. IEEE Transactions on Neural Networks, 7, 996–1005. Forti, M., & Tesi, A. (1995). New conditions for global stability of neural networks with applications to linear and quadratic programming problems. IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, 42, 354–366. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms and architectures. Neural Networks, 1, 17–61. Guan, Z., Chen, G., & Qin, Y. (2000). On equilibria, stability and instability of Hopfield neural networks. IEEE Transactions on Neural Networks, 11, 534–540. Hara, T. (1975). On the asymptotic behavior of solutions of certain non-autonomous differential equations. Osaka Journal of Mathematics, 12, 267–282. Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3058–3092. Hopfield, J. J., & Tank, D. W. (1985). “Neural” computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Hopfield J. J., and Tank, D. W. (1986). Computing with neural networks: A model. Science, 233, 625–633. Juang, J. (1999). Stability analysis of Hopfield-type neural networks. IEEE Transactions on Neural Networks, 10, 1366–1374. Kaszkurewicz, E., & Bhaya, A. (2000). Matrix diagonal stability in systems and computation. Boston: Birkh¨auser. Koksal, ¨ S., & Sivasundaram, S. (1993). Stability properties of the Hopfield-type neural networks. Dynamics and Stability of Systems, 8, 181–187. Krasovskii, K. K. (1954). On the stability in the large of a system of nonlinear differential equations. Prikladnaya Matematika i Mekhanika, 18, 735–737. Lakshmikantham, V., Leela, S., & Martynyuk, A. A. (1990). Practical stability of nonlinear systems. Singapore: World Scientific.
Nonlinear Continuous Neural Networks
1835
Lakshmikantham, V., Matrosov, V. M., & Sivasundaram, S. (1991). Vector Lyapunov functions and the stability analysis of nonlinear systems. Norwood, MA: Kluwer. Liang, X., & Si, J. (2001). Global exponential stability of neural networks with globally Lipschitz continuous activations and its application to linear variational inequality problem. IEEE Transactions on Neural Networks, 12, 349–359. Matsuoka, K. (1992). Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. Neural Networks, 5, 495–500. Pakdaman, K., Malta, C. P., & Grotta-Ragazzo, C. (1999). Asymptotic behaviour of irreducible excitatory networks of analog graded-response neurons. IEEE Transactions on Neural Networks, 10, 1375–1381. Qiao, H., Peng, J., & Xu, Z. (2001). Nonlinear measure: A new approach to exponential stability analysis of Hopfield-type neural networks. IEEE Transactions on Neural Networks, 12, 360–3702. Sastry, S. (1999). Nonlinear systems: Analysis, stability and control. New York: Springer-Verlag. Vanualailai, J., Soma, T., & Nakagiri, S. (2002). Convergence of solutions and practical stability of Hopfield-type neural networks with time-varying external inputs. Nonlinear Studies, 9, 109–122. Yi, Z., Heng, P. A., & Fu, A. W. C. (1999). Estimate of exponential convergence rate and exponential stability for neural networks. IEEE Transactions on Neural Networks, 10, 1487–1493.
Received December 18, 2003; accepted November 2, 2004.
LETTER
Communicated by Alan Yuille
Estimation and Marginalization Using the Kikuchi Approximation Methods Payam Pakzad
[email protected] Venkat Anantharam
[email protected] Electrical Engineering and Computer Science, Department, University of California, Berkeley, CA 94720, U.S.A.
In this letter, we examine a general method of approximation, known as the Kikuchi approximation method, for finding the marginals of a product distribution, as well as the corresponding partition function. The Kikuchi approximation method defines a certain constrained optimization problem, called the Kikuchi problem, and treats its stationary points as approximations to the desired marginals. We show how to associate a graph to any Kikuchi problem and describe a class of local message-passing algorithms along the edges of any such graph, which attempt to find the solutions to the problem. Implementation of these algorithms on graphs with fewer edges requires fewer operations in each iteration. We therefore characterize minimal graphs for a Kikuchi problem, which are those with the minimum number of edges. We show with empirical results that these simpler algorithms often offer significant savings in computational complexity, without suffering a loss in the convergence rate. We give conditions for the convexity of a given Kikuchi problem and the exactness of the approximations in terms of the loops of the minimal graph. More precisely, we show that if the minimal graph is cycle free, then the Kikuchi approximation method is exact, and the converse is also true generically. Together with the fact that in the cycle-free case, the iterative algorithms are equivalent to the well-known belief propagation algorithm, our results imply that, generically, the Kikuchi approximation method can be exact if and only if traditional junction tree methods could also solve the problem exactly. 1 Introduction In its most general form, the problem of finding the marginals of a product distribution is encountered frequently in various branches of science and engineering. An important special case, the probabilistic inference problem (Cowell, Dawid, Lauritzen, & Spiegelhalter 1999), is to infer the most probable scenarios, given a collection of observations. Under a Bayesian Neural Computation 17, 1836–1873 (2005)
© 2005 Massachusetts Institute of Technology
Estimation and Marginalization Using Kikuchi Methods
1837
causality model, this is equivalent to finding the marginals of a joint probability distribution in the form of the product of certain conditional probability functions. Applications of the probabilistic inference problem range from medical diagnosis to speech recognition and error-correcting codes. Example 1. Consider the soft decoding of an (n, k) binary linear code with (n − k) × n parity check matrix H (see, e.g., Wicker, 1995). Let P(x; y∗ ) represent the joint a posteriori probability density of bits of a code word x := (x1 , . . . , xn ), with noisy observations y∗ := (y1∗ , . . . , yn∗ ) over a binary memoryless channel. Then P(x; y∗ ) can be represented as the product of some indicator functions representing the parity checks between the bits of the code words, as well as conditional probabilities representing the noisy observations, n n 1 n−k P(x; y) = 1 Hi, j x j = 0 P(xi )P(yi∗ |xi ), Z i=1 j=1 j=1 where the summation is modulo-2, and 1(·) is the indicator function, taking values 1 or 0 depending on whether its argument is true or false; Z is a normalizing constant called the partition function. In this case, the marginal Pi (xi ; y∗ ) is used to find the most probable value of the ith bit. In some applications, one is interested mainly in calculating the partition function: Example 2. In a circuit-switched network, one is interested in finding the invariant distribution of calls in progress along routes of the network. It can be shown (see, e.g., Walrand & Varaiya, 1996) that the invariant distribution has the form L q 1 (x1 ) · · · q M (xM ) π(x1 , . . . , xM ) = 1 xi < n j . Z i∈R j j=1 Here, M is the total number of routes, xi is the number of calls along route i, q i (xi ) is the (known) invariant distribution of xi if the links had an infinite number of circuits, L is the number of links in the network, n j is the capacity of link j, and R j ⊂ {1, . . . , M} is the index set of routes that use link j. Finally, Z is the partition function, defined by
Z :=
M x1 ,...,xM i=1
q i (xi )
L j=1
1
i∈R j
xi < n j .
1838
P. Pakzad and V. Anantharam
Therefore, in order to calculate the invariant distribution, one needs only to calculate the partition function Z. As another example, in physics, one can derive various thermodynamical properties of a system, such as the average energy and entropy, if the partition function is known as a function of the temperature (see, e.g., Kittel & Kroemer, 1980). Although the general marginalization problem can be exponentially complex, scientists and engineers have long explored ways to reduce the computational complexity of the calculations required to find the marginals, either exactly or approximately, (see, e.g., Pearl, 1988; Aji & McEliece, 2000; Morita, 1994; Yedidia, Freeman, & Weiss 2001; Luby, 2002; Pakzad & Anantharam, 2004). Most approaches use a graphical model to represent the interdependence of variables in the factor functions and use messagepassing algorithms on this graph to localize the calculations. Belief propagation (Pearl, 1988) is one such algorithm. The success of low-density parity check (LDPC) codes (Gallager, 1963; MacKay & Neal, 1995) and turbo codes (Berrou, Glavieux, & Thitimajshima, 1993), which are decoded using instances of the belief propagation algorithm on a loopy graph (McEliece, MacKay, & Cheng, 1998), motivated many communications engineers to look more closely at belief propagation and junction graphs. So far, however, a general characterization of the quality of approximation and convergence properties of loopy belief propagation has not been discovered, despite a number of excellent partial results, which have considerably increased our understanding of the dynamics of such algorithms (see, e.g., Richardson & Urbanke, 2001; Weiss, 2000; Richardson, Shokrollahi, & Urbanke, 2001; Divsalar, Jin, & McEliece, 1998; Richardson, 2000; MacKay & Neal, 1995). Yedidia et al. (2001) showed recently that there is a close connection between loopy belief propagation and certain approximations to the variational free energy in statistical physics. Specifically, as we will also discuss in this article, the fixed points of the belief propagation algorithm were shown to coincide with the stationary points of the Bethe free energy subject to consistency constraints. Here, the Bethe free energy is an approximation to the variational free energy. The Bethe approximation is only a special case of a more general class of approximations called the Kikuchi approximations (Kikuchi, 1951). A class of iterative message-passing algorithms was introduced in Yedidia et al. (2001), which attempt to find the stationary points of the Kikuchi free energy. Using such message-passing algorithms, one is expected to obtain better approximations to the marginals and the partition function than the ones given by loopy belief propagation. Building on the generalized region-based approach introduced in Pakzad and Anantharam (2002a, 2002b), in this article, we explore a wide range of ideas related to the Kikuchi approximation method. In particular, we discuss necessary conditions for uniqueness of the minimizers of the Kikuchi free energy, introduce graphical representations for the problem, and define
Estimation and Marginalization Using Kikuchi Methods
1839
minimal graphical representations, which result in iterative solutions that are often significantly less complex than the algorithms discussed in Yedidia, Freeman, and Weiss (2001, 2002), and McEliece and Yildrim (2003). Furthermore, we will show that for generic problems, the Kikuchi approximation yields the exact marginals if and only if this minimal graphical representation of the Kikuchi problem is loop free.1 We will also address the more general problem of approximating the entropy of a product distribution in terms of the entropies of its marginals. Yedidia et al. (2002) independently developed a similar general framework based on region graphs. As such, there is some intersection between our work and theirs; in particular, the presentation in sections 2 and 3 should be compared with those in Yedidia et al. (2002), as well as Aji and McEliece (2001) and McEliece and Yildrim (2003). However, the major contributions of this work—the discussion of the graphical representations of Kikuchi problems, the convexity conditions, the necessary and sufficient conditions for exactness of the approximation, and the low-complexity extension of the generalized belief propagation algorithm—have not been previously addressed. We point out these similarities and differences throughout the article. Other researchers have developed various techniques based on related ideas, each with specific advantages over traditional loopy belief propagation. Yuille (2002) derives a double-loop, free-energy-minimizing algorithm that is guaranteed to converge, unlike loopy belief propagation. Welling and Teh (2001) formulate an algorithm of gradient-descent type, which is guaranteed to find a fixed point of the Bethe free energy. Wainwright and Jordan (2003) discuss convex relaxations of the variational principle, resulting in efficient algorithms that yield upper bounds to the partition function. The outline of this letter is as follows. We define the marginalization problem and set up some necessary notation in section 2. In section 3, we review the connection with methods in statistical physics, define the Kikuchi approximation method as one that approximates the desired marginals as the constrained fixed points of an appropriately defined free energy functional, and show that there are iterative message-passing algorithms whose fixed points correspond to the stationary points of the Kikuchi functional. Sufficient conditions for convexity of the Kikuchi functional are also provided. The restriction of these conditions to the Bethe case implies the well-known result on the convergence of loopy belief propagation on graphs with a single loop (see, e.g., Weiss, 2000). In section 4 we introduce the notion of graphical representations for a Kikuchi problem, establish the connection with junction trees, and prove results on the exactness of the Kikuchi approximation. In section 5 we derive
1 By Kikuchi problem, we mean the problem of minimizing the Kikuchi free energy, subject to some consistency constraints.
1840
P. Pakzad and V. Anantharam
the generalized belief propagation (GBP) algorithm of Yedidia et al. (2001) on any arbitrary graphical representation of a Kikuchi problem. This is an extension of the results in Yedidia et al. (2002) and McEliece and Yildrim (2003). Interested readers are referred to Pakzad (2004) for further technical discussions of these topics. Some experimental results are reported in section 6, comparing the convergence properties of the low-complexity GBP algorithm derived here, with the algorithms described in Yedidia et al. (2002) and McEliece and Yildrim (2003). 2 Problem Setup Let x := (x0 , . . . , xN−1 ), where for each i ∈ [N] := {0, . . . , N − 1}, xi is a variable taking value in [q i ] := {0, . . . , q i − 1}, with q i ≥ 2. Let R be a collection of subsets of [N]; we call each r ∈ R a region. We assume that each variable index i ∈ [N] appears in at least one region r ∈ R. Associated with each region r ∈ R is a nonnegative kernel function, αr (xr ), depending only on the variables that appear in r . Then the corresponding R-decomposable (Boltzmann) product distribution is defined as B(x) :=
1 αr (xr ). Z r ∈R
(2.1)
Here Z is the normalizing constant and is the partition function. For a subset s ⊂ [N], we denote by Bs (xs ) := x[N]\s B(x) the s-marginal of B(x). Problem. The problem considered in this letter is that of finding one or more of the Br (xr )’s for r ∈ R, and/or the partition function Z. The methods developed in this article to solve this problem are best described in the language of partially ordered sets or posets (see, e.g., Stanley, 1986). Specifically, the collection R of regions can be viewed as a poset with set inclusion as its partial ordering relation. This is because inclusion is reflexive (∀r ∈ R, r ⊆ r ), antisymmetrical (r ⊆ s and s ⊆ r implies r = s), and transitive (r ⊆ s and s ⊆ t implies r ⊆ t). We write r ⊂ t to denote strict inclusion. We say t covers u in R, and write u ≺ t, if u, t ∈ R, u ⊂ t and ∃v ∈ R s.t. u ⊂ v ⊂ t. Definition 1. Given a poset R, its Hasse diagram G R is a directed acyclic graph (DAG)2 whose vertices are the elements of R and whose edges correspond to cover relations in R; that is, an edge (t → u) exists in G R iff u ≺ t. 2
Traditionally the Hasse diagram is drawn as an undirected graph, with an implied upward direction (see Stanley, 1986). This is indeed equivalent to a DAG, which will be the view used in this letter.
Estimation and Marginalization Using Kikuchi Methods
1841
It follows that for any two distinct nodes r, s ∈ R, we have r ⊂ s iff there is a directed path from s to r in G R . Throughout this letter, we will need the following definitions. Let R be a poset of subsets of [N] with the partial ordering of inclusion. For each subset r ⊆ [N], we define: Ancestors:
A(r ) := {s ∈ R : r ⊂ s}
Descendants: D(r ) := {s ∈ R : s ⊂ r } Forebears:
F(r ) := {s ∈ R : r ⊆ s}
Further, for r ∈ R we define: Parents:
P(r ) := {s ∈ R : r ≺ s}
Children: C(r ) := {s ∈ R : s ≺ r } Note that in each of these definitions, the collection of subsets being defined comprises regions, even though the argument r of A(r ), D(r ), and F(r ) need not be a region itself. For a collection S of subsets of [N], we define F(S) := s∈S F(s). A subset T of R that is in the form T = F(S) for some S ⊆ R, viewed as a sub-poset of R (with the same partial ordering), is called an up-set of R. Finally we define the depth of each region r ∈ R as: d(r ) :=
0 if r is maximal 1 + maxs∈P(r ) d(s) otherwise.
3 Kikuchi Approximation Method 3.1 Connection with Statistical Physics. In the setup described in section 2, we can view xi as the spin of the particle at position i in a system of N particles. Let b(x) denote a probability distribution on the configuration of spins, and consider a function E(x) called the energy function. Suppose the energy function is R-decomposable, that is, E(x) = r ∈R Er (xr ) for certain functions {Er (xr ), r ∈ R}. In statistical physics, one defines the (Helmholtz) variational free energy as the following functional of the distribution F (b(x)) := U(b(x)) − H(b(x)),
(3.1)
where U := x b(x)E(x) is the average energy and H := − x b(x) log(b(x)) is the entropy of the system. We make the connection with the problem
1842
P. Pakzad and V. Anantharam
formulation of section 2 by setting Er (xr ) := − log(αr (xr )). We can then write E(x) = Er (xr ) r ∈R
=−
r ∈R
= − log
log(αr (xr ))
1 αr (xr ) − log(Z) Z r ∈R
= − log(B(x)) − log(Z), where B(x) is the Boltzmann distribution of equation 2.1. Then the variational free energy can be rewritten as follows: F (b) = b(x)(−log(B(x)) − log(Z)) + b(x) log(b(x)) x
=
x
b(x) log
b(x) B(x)
x
− log(Z)
= KL(b||B) − log(Z), where KL(b||B) is the Kullback-Leibler divergence between b(x) and B(x) (see, e.g., Cover & Thomas, 1991). It is then clear that F (b) is uniquely minimized when b(x) equals the Boltzmann distribution B(x) of equation 2.1, and we have F0 := min F (b(x)) = F (B(x)) = −log(Z). b(x)
(3.2)
As mentioned in section 1, equation 3.2 is of great interest in science and engineering. Physicists are interested in finding the log-partition function F0 , as a function of a temperature variable, which we have omitted here, since thermodynamical properties of physical systems can be derived from it. In estimation problems in engineering, one is interested in finding the marginals of the Boltzmann distribution B(x). This is called the probabilistic inference problem. However, equation 3.2, viewed as an optimization problem, does not prescribe a practical way for computing these quantities, as it involves minimization over the exponentially large domain of distributions b(x). Given that the energy function is R-decomposable, to simplify the minimization problem 3.2, one may try to reformulate it in a way that is, loosely speaking, also R-decomposable. A natural way to do this is to try to represent the free energy as a functional of the R-marginals of the distribution b(x). Definition 2. We will call a collection {br (xr ), r ∈ R} of probability functions, which may or may not be the marginals of a single distribution, a collection of
Estimation and Marginalization Using Kikuchi Methods
1843
R-pseudomarginals. A collection of R-pseudomarginals that are also the marginals of a probability distribution b(x) are called the R-marginals of b(x). Define R to be the family of the R-marginals of all probability distributions on x; that is, a collection {br (xr ), r ∈ R} belongs to R if and only if there exists a distribution b(x) s.t. ∀ r ∈ R, br (xr ) = x[N]\r b(x). Then we can rewrite equation 3.2 as F0 =
min
{br (xr )}∈ R
{br∗ (xr )} = arg
F R ({br (xr )})
min
{br (xr )}∈ R
F R ({br (xr )})
(3.3)
where F R ({br }) :=
min
b(x) : {br } R-marginals of b
F (b(x)).
Since E(x) is R-decomposable, the average energy decomposes as
U(b(x)) =
br (xr )Er (xr ),
(3.4)
r ∈R xr
where br (xr )’s are the marginals of distribution b(x). In general, however, the entropy term in the free energy, equation 3.1, cannot be decomposed in terms of the R-marginals of b(x). The key component of the Kikuchi approximation method is to use an approximation of the form H(b(x))
kr Hr (br (xr )),
(3.5)
r ∈R
where H r (br (xr )) := − xr br (xr ) log(br (xr )) is the regional entropy associated with a region r ∈ R, and kr ’s are suitable constants to be determined. As discussed in Pakzad and Anantharam (2002a), we may view R ∪ {[N]} as a poset with partial ordering of inclusion. For each r ∈ R, define cr := −µ(r, [N]) where µ(·, ·) is the Mobius ¨ function. Then the Mobius ¨ inversion formula (see, e.g., Stanley, 1986) shows that cr ’s are defined uniquely by the following equations, cr = 1 −
cs ,
(3.6)
s∈A(r )
where A(r ) is the set of ancestors of r , as defined in section 2. Following Yedidia et al. (2001), we call the cr ’s defined in this manner the overcounting factors. As it turns out, cr ’s are the natural choice for the constants {kr }
1844
P. Pakzad and V. Anantharam
in equation 3.5, as we show in proposition 1 (see Pakzad, 2004, for the proof). Proposition 1. The only choice of factors {kr } that can result in exactness of equation 3.5
for all R-decomposable Boltzmann distributions—that is, distributions b(x) := Z1 r ∈R αr (xr ) for all choices of {αr , r ∈ R}—is the overcounting factors {cr }. In fact, the original choice of {kr } in the Kikuchi approximation method (Kikuchi, 1951) was also {kr } = {cr }. It will also be shown that this exactness happens if and only the collection R of regions is loopfree in an appropriate sense, which will be defined in section 4.1. The Kikuchi approximation method, which will be defined more formally in section 3.2, proposes to solve a constrained minimization problem of the following form (cf. equation 3.3), {Br (xr )} {br∗ (xr )} := arg
min
{br (xr )}∈ KR
F RK ({br (xr )}).
(3.7)
Here F RK ({br }), known as the Kikuchi free energy3 (see, e.g., Kikuchi, 1951), is defined as follows, F RK ({br (xr )}) :=
br (xr )Er (xr ) +
r ∈R xr
cr br (xr ) log(br (xr )), (3.8)
r ∈R xr
and KR is a set of constraints to enforce consistency between the br ’s, defined as KR
:= {br (xr ), r ∈ R} : ∀ t, u ∈ R s.t. t ⊂ u,
and ∀u ∈ R,
b u (xu ) = b t (xt )
xu\t
b u (xu ) = 1 .
(3.9)
xu
Note that in general, the constraints of KR are not enough to guarantee that every collection of pseudomarginals {br , r ∈ R} ∈ KR is in fact the collection of the marginals of a single distribution function b(x). A collection may very well satisfy all the consistency constraints of equation 3.9 and not be the marginals of any distribution. In section 4 we discuss conditions on R that guarantee that the free energy F (b) can be viewed as a functional of the marginals of b(x), that is, {br , r ∈ R}, 3
Yedidia et al. (2002) call this functional the region graph free energy.
Estimation and Marginalization Using Kikuchi Methods
1845
and, as such a functional, equals the Kikuchi functional F RK . Further, we discuss conditions on R under which the constraint set KR equals the family of R-marginals R . 3.2 Kikuchi Approximation Method. In this section we formulate the Kikuchi approximation method for solving the marginalization problem posed in section 2. We will further describe conditions on the collection of regions R, which are expected to improve the quality of the approximations. Let R0 be a collection of regions and {αr0 (xr ), r ∈ R0 } be a collection of kernel functions. We are interested in solving the marginalization problem posed in section 2 for R0 and {αr0 }. Let R be another collection of regions obtained from R0 in such a way that 4 ∀ r ∈ R0 , ∃ r ∈ R s.t. r ⊆ r . Then one can always form a collection of Rkernels {αr (xr ), r ∈ R} so that − r ∈R log(αr (xr )) = − r ∈R0 log(αr0 (xr )) =: E(x).
Now for each r ∈ R, define βr (xr ) := s⊆r αs (xs ). Then the Boltzmann distribution of equation 2.1 takes the following product forms,
0 αr (xr ) βr (xr )cr r ∈R0 αr (xr ) B(x) = = r ∈R = r ∈R , (3.10) Z Z Z where the last equality follows from the fact that by equation 3.6, r ∈F(s) c r = 1 for all s ∈ R. Using approximations 3.8 and 3.9, we are now interested in solving the following: Problem (Kikuchi Approximation). −log(Z) F ∗ :=
min
{br (xr )}∈ KR
and {Br (xr )} {br∗ (xr )} := arg
F RK ({br (xr )}) min
{br (xr )}∈ KR
F RK ({br (xr )}).
(3.11)
Note now that by equation 3.3, if F (b(x)) = F RK ({br (xr )}) for all b(x), and R = KR , then the minimizer collection {br∗ (xr )} of equation 3.11 would correspond exactly to the collection of the marginals of the product function B(x) of equation 3.10. Hence, if the Kikuchi approximate free energy F RK ({br }) is close to F (b), and local consistency constraint set KR is also close to R , the minimizers {br∗ } of equation 3.11 are expected to be close approximations to these marginals. Our focus in the rest of this article shifts to the above problem and to the relation between the solution to this problem and the original one in section 2. An important question, which we address in detail, is when the 4 Note however that the way this assignment is done can have an impact on the quality of the approximations to equation 3.3 provided by equation 3.11.
1846
P. Pakzad and V. Anantharam
br∗ ’s are equal to the marginals Br of the Boltzmann distribution. We also address in detail in sections 4 and 5 message-passing algorithms on graphs that solve equation 3.11. These algorithms are similar in nature to the GBP of Yedidia et al. (2002) and poset belief propagation of McEliece and Yildrim (2003), although as we will show, the extensions presented here are often considerably more efficient than those algorithms. The collection R of regions effectively specifies both the Kikuchi approximation, equation 3.8, and the constraint set, equation 3.9. It is also evident that equation 3.11 as an approximation method can be applied for any given F RK and KR ; better choices of R simply result in better approximations. Therefore, we can define the Kikuchi approximation method as the general class of constrained minimization problems given by equation 3.11, which are parameterized by the poset5 R of regions, and local kernel functions αr (xr ) for each r ∈ R. It remains to specify which choices of R yield good approximations of the marginals. In the remainder of this article, we consider only collections of regions R that have the same maximal regions as R0 . Expansion of the maximal regions corresponds to clustering methods as discussed in Pearl (1988). The techniques developed here to derive low-complexity messagepassing algorithms to solve the Kikuchi approximation problem can also be applied after clustering. It certainly seems that minimization with more local consistency constraints on {br (xr )} should result in better approximations, since the true marginals would satisfy all such constraints. At the same time, the entropy approximations of the type given in equation 3.5 are also expected to improve if more regions are included. Therefore, one might conclude that for a given collection of maximal regions of R0 , augmenting them by introducing additional subregions to form R, where the αr ’s corresponding to the augmented subregions are taken to be 1, should improve the approximation (at the expense of increasing the complexity of the underlying minimization). Let G be a labeled graph whose vertices are identified with subsets of [N]. We define the following connectivity conditions on G: A1: ∀ i ∈ [N], the subgraph of G consisting of the regions in F({i}) is connected. Generalizing this, we can devise condition (An) on G, for each n ∈ {1, . . . , N} as follows: An: ∀ s ⊂ [N], |s| ≤ n, the subgraph of G on regions in F(s) is connected.
5 Note that although inclusion is certainly the most natural partial ordering for R, the problem is well defined for any arbitrary partial ordering.
Estimation and Marginalization Using Kikuchi Methods
1847
We say a poset R has property An iff its Hasse diagram G R satisfies condition An. Note that in the context of the Kikuchi problem, equation 3.11, property An guarantees that the beliefs at all regions will be consistent at the level of any subset xr of the variables of cardinality up to n. It is therefore natural to require that R satisfies at least condition A1. We call a poset R satisfying An for all n a totally connected poset. Inspired by Aji and McEliece (2001), one might insist that acceptable approximations of the entropy term 3.5 are those in which each variable xi appears the same number of times on the two sides of the equality sign: B1:
cr = 1
for each
i = 0, . . . , N − 1.
r ∈F({i})
We can extend this condition as follows: Bn:
cr = 1
for each
s ⊂ [N], |s| ≤ n s.t. F(s) = ∅.
r ∈F(s)
Conditions Bn are called the balance conditions, and we call a poset R satisfying Bn for all n, a totally balanced poset. These conditions are expected to give progressively better approximate solutions, although they will not in general guarantee an exact solution. The original cluster variation method of Kikuchi as defined in Morita (1994) and Yedidia et al. (2001) in effect chooses R to be the smallest collection of regions including R0 , which is closed under nonempty intersection of regions. The following proposition shows that the choice of R made in the cluster variation method is expected to give a reasonable Kikuchi approximation. Proposition 2. Any collection of regions R that is closed under nonempty intersection of regions is totally connected and totally balanced. Proof. See Pakzad (2004). The special case when the Hasse diagram G R has depth 2 (i.e., there are no distinct r, s, t ∈ R such that r ⊂ s ⊂ t) is called the Bethe case in this article. In this case, G R can be viewed as a hypergraph in which the maximal regions of R are the vertices and the minimal regions are the hyperedges. Aji and McEliece (2001) consider the case when the hypergraph view of G R is a graph, that is, the minimal elements of R are covered by at most two regions, so the hyperedges are in fact edges. It can be immediately verified that the junction graph condition given in Aji and McEliece (2001) is simply the intersection of conditions A1 and B1 above. On the other
1848
P. Pakzad and V. Anantharam
Figure 1: Tanner graph of a linear code.
hand, the generalized notion of junction graphs defined in Yedidia et al. (2002) is essentially the same as our Bethe case, with the minor exception that junction graphs are defined to satisfy condtition B1, whereas we do not a priori require that. We now give an example to illustrate some of the notions defined in this section. Example 3. Consider the (16, 8) linear code represented by the bipartite graph of Figure 1, where the top nodes correspond to parity checks and the bottom nodes correspond to symbol bits. This graph can be interpreted as the Hasse diagram of a two-level poset, where the regions associated with the bit nodes are {1}, {2}, . . . , {16}, respectively, and the region associated with each check node is the subset of {1, . . . , 16}, corresponding to the bits that constitute that parity check. This is an example of the Bethe case, where the regions corresponding to the check nodes are maximal and those corresponding to the bit nodes are minimal. The overcounting factors corresponding to the check nodes are equal to 1, while those corresponding to the bit nodes equal “one minus the number of check nodes connected to that bit node.” In this case each bit node is connected to three check nodes, so that the overcounting factors for all bit nodes equal 1 − 3 = −2. In this case, the GBP algorithm we discuss in section 5 will reduce to the original Gallager-Tanner decoding algorithm for LDPC codes (see Gallager, 1963; Tanner, 1981). This poset has property A1 but not A2: for example, the regions corresponding to the first and third check nodes are {2, 6, 7, 14, 15, 16} and {2, 6, 7, 10, 12, 16}, respectively, both containing {2, 6}, but they are not connected through regions that contain {2, 6}. Also, this poset satisfies B1 but not B2: with s := {2, 6}, F(s) is precisely the first and third check-node regions, and r ∈F(s) cr = 1 + 1 = 2 = 1. On the other hand, one can throw in all the intersections of the check node regions to create the poset whose Hasse diagram is shown in Figure 2. Here, the nodes in the middle row correspond to the intersections of the check node regions, in the first row, to which they are connected; for example, the second node in the middle row corresponds to region {2, 6, 7, 16}, which is the intersection of the first and third check node regions.
Estimation and Marginalization Using Kikuchi Methods
1849
Figure 2: Alternative poset for the linear code of example 3.
It is easy to verify that this poset is totally connected and totally balanced. 3.3 Lagrange Multipliers and Iterative Solutions. Lagrange’s method can be used to solve the constrained minimization problem 3.11. We form the Lagrangian: L := (−br (xr ) log(αr (xr )) + cr br (xr ) log(br (xr ))) r ∈R xr
+
r ∈R t≺r
+
r ∈R
κr
xt
λr t (xt ) b t (xt ) − br (xr ) − 1 ,
br (xr )
xr \t
(3.12)
xr
where coefficients λr t (st ) enforce consistency constraints and coefficients κr enforce normalization constraints, and as before t ≺ r means that r covers t. Note that since the edge constraints of G R are a sufficient representation of KR , as discussed before, we need only define λr t for pairs r, t ∈ R with t ≺ r , that is, along the edges of G R . Setting partial derivative ∂L/∂br (xr ) = 0 for each r ∈ R gives an equation for br (xr ) in terms of λur ’s and λr t ’s. The consistency constraints give update rules for each λr t in terms of other λ multipliers. Once a set of messages mr t (from r to t, for each edge (r → t) of G R ) has been defined in terms of the Lagrange multipliers λr t ’s, these update rules define an iterative algorithm whose fixed points are the stationary points of the given constrained minimization problem. In section 5 we will give a detailed derivation for an important algorithm of this type, called the generalized belief propagation (GBP) algorithm (see also Yedidia et al., 2001), and we will also see that the belief propagation algorithm of Pearl (1988) is the restriction of the above algorithm in the Bethe case. The version of GBP algorithm presented here is an extension of the one originally introduced in Yedidia et al. (2001), in which we take advantage of certain systematic complexity-reducing transformations that will be
1850
P. Pakzad and V. Anantharam
described in section 4. The resulting algorithm can be considerably less complex than the original GBP of Yedidia et al. (2001) and its generalizations in Yedidia et al. (2002) and McEliece and Yildrim (2003), constituting a major practical contribution of this work. 3.4 Convexity Conditions. In this section we describe our results regarding the convexity of the optimization problem 3.11, which we first reported in Pakzad and Anantharam (2002a). The Kikuchi free energy (see equation 3.8) constrained on {br } ∈ KR is bounded below, and hence the constrained minimization problem of equation 3.11 always has a global minimum. Therefore, as discussed in section 3.3, the message-passing algorithms derived from the Lagrangian, equation 3.12, always possess at least one fixed point (see Yuille, 2002, for an algorithm that is guaranteed to find a minimum of F RK ). The following result gives sufficient conditions on R for the problem of equation 3.11 to have precisely one minimum: Theorem 1. The Kikuchi free energy functional, equation 3.8, is strictly convex on KR (and hence the constrained minimization problem has a unique solution) if for all subcollections of regions S ⊆ R, the overcounting factors satisfy:
cs ≥ 0 ,
(3.13)
s∈F(S)
where, as defined in section 2, F(S) := ∪s∈S F(s) = {r ∈ R : ∃ s ∈ S s.t. r ⊆ s} is the up-set of S. Proof. See appendix A. Corollary 1. (cf. theorem 3 in Aji & McEliece, 2001). In the Bethe case, the constrained minimization problem of equation 3.11 has a unique solution if the graphical representation G R of R has at most one loop. Proof. See Pakzad (2004). Once we define a suitable notion of graphical representation for a general collection of regions in the next section, we will generalize the result of corollary 1. 4 Graphical Representations of the Kikuchi Approximation Problem In this section, we define the notion of graphical representations for a Kikuchi approximation problem. The algorithms of the type discussed in section 3.3 can then be viewed as message-passing algorithms along the edges of such graphs. We will discuss this in detail in section 5.
Estimation and Marginalization Using Kikuchi Methods
1851
We will further introduce minimal graphical representations for a given collection R of regions, which are graphical representations with the fewest number of edges. Our motivation for introducing such minimal graphs is twofold. First, note that the results of section 3.4 refer to the uniqueness of the solution of the constrained minimization problem of equation 3.11. However, we are also interested in the conditions under which these solutions are the exact marginals of the product distribution of equation 3.10. As we will show in this section, the exactness of approximations obtained using equation 3.11 corresponds directly to the nonexistence of loops in the minimal graphs. In fact, we will show that in the loop-free case, this graph is a junction tree, and the message-passing algorithms of type discussed in section 3.3 correspond to a variation of junction tree algorithm. Second, as we will discuss in detail in section 5, the message-passing algorithms of the type mentioned in section 3.3 on the minimal graphs will be the most compact among all graphical representations of the same problem, and can result in algorithms that are significantly less complex than the ones discussed in Yedidia et al. (2002) and McEliece and Yildrim (2003). Let G be a directed acyclic graph with vertex set V(G) and edge set E(G). Parallel to our definitions in section 2, for each vertex r ∈ V(G), define: Ancestors:
AG (r ) := {s ∈ V : ∃ a directed path from s to r }
Descendants: DG (r ) := {s ∈ V : r ∈ AG (s)} PG (r ) := {s ∈ V : (s → r ) ∈ E(G)}
Parents: Children:
CG (r ) := {s ∈ V : (r → s) ∈ E(G)}
Forebears:
FG (r ) := {r } ∪ AG (r )
As in section 2, for a subset S ⊆ V(G), we define FG (S) := Also define the depth of each vertex r ∈ V(G) as dG (r ) :=
0 1 + maxs∈PG (r ) dG (s)
s∈S
FG (s).
if PG (r ) = ∅ otherwise.
Similarly we define the depth of each edge (t → u) of G as the depth of the child vertex u: dG (t → u) := dG (u). Note that given a poset R of regions, the above definitions for the Hasse diagram G R are consistent with the corresponding definitions for the poset from section 2 (i.e., for all r ∈ V(G R ), AG R (r ) = A(r ) and so on).
1852
P. Pakzad and V. Anantharam
Back to the problem at hand, let R be a collection of regions as before, and let G be a directed acyclic graph whose nodes correspond to the regions r ∈ R. We will further assume that an edge (s → t) exists in G only if t ⊂ s. Definition 3. The edge constraint for an edge (s → t) of G is defined as the following functional of the pseudomarginals {br , r ∈ R}: EC(s→t) ({br , r ∈ R}) :=
b s (xs ) − b t (xt ).
(4.1)
xs\t
When the arguments are clear from the context, we abbreviate this as EC(s→t) . Definition 4. We call G a graphical representation of KR if KR can be represented using the edge constraints of G, that is, KR
=
{br (xr ), r ∈ R} : ∀(s → t) ∈ E(G), EC(s→t) = 0 and ∀r ∈ R,
br (xr ) = 1 .
(4.2)
xr
As mentioned in the previous sections, a poset R is most naturally represented by its Hasse diagram G R ; a Hasse diagram uses the transitivity of partial ordering to represent a poset in the most compact form. Note that our local consistency constraints also have the transitivity property: If (r → s), (s → t) and (r → t) are edges in graph G, then (EC(r →s) = 0) and (EC(s→t) = 0) =⇒ (EC(r →t) = 0). Therefore, the last edge (between “grandfather” and “grandchild”) is redundant. This is why the Hasse diagram G R is a graphical representation of KR . On the other hand, local consistency relations satisfy a property other than transitivity, which can be used to further reduce the representation of KR : Suppose (r → s), (r → u), (s → t), and (u → t) are edges in graph G, then (EC(r →s) = 0) and (EC(r →u) = 0) and (EC(u→t) = 0) =⇒ (EC(s→t) = 0).
Estimation and Marginalization Using Kikuchi Methods
q
t
u
r
v
w
q
s
x
1853
y
t
u
r
v
w
s
x
y
z
z (A)
(B)
Figure 3: Equivalence of edges for removal. In the Hasse diagram, G R depicted in A, {(t → z), (u → z), (v → z)} and {(w → z), (x → z), (y → z)} are the EER classes of z. (B) A realization of SR , the minimal graphical representation of R.
Then a graph obtained by removing the edge (s → t) of G is still a graphical representation of KR since the edge constraint of (s → t) is implied by other edge constraints. We now make precise the reductions in the graphical representation that are implied by this property. Definition 5. Edges (u → r ) and (v → r ) are said to be equivalent edges for removal (EER), and denoted (u → r ) ∼ (v → r ), if there exists a sequence (t0 → r ), . . . , (tk → r ) of edges in G R , with t0 = u and tk = v and with the property that ∀ i = 1, . . . , k, A(ti−1 ) ∩ A(ti ) = ∅. It is easy to verify that this relation ∼ is reflexive, symmetrical, and transitive and is hence indeed an equivalence relation. Therefore, for each region r ∈ R, the collection of all the edges leading to r can be partitioned into equivalence classes of edges for removal (EER classes of region r ). In the example of Figure 3A, (t → z) ∼ (u → z) ∼ (v → z) since A(t) ∩ A(u) = A(u) ∩ A(v) = {q }; also, (w → z) ∼ (x → z) ∼ (y → z) since A(w) ∩ A(x) = {r } and A(x) ∩ A(y) = {s}. But (v → z) ∼ (w → z) since no sequence of edges with the desired property can be found. It follows that {(t → z), (u → z), (v → z)} and {(w → z), (x → z), (y → z)} are the EER classes of z. Definition 6. From each EER class, remove all but one (representative) edge from the Hasse diagram G R . Denote the resulting graph by SR . Figure 3B shows one realization of SR for the Hasse diagram of Figure 3A. Note that graph SR is not unique, since the representative edge of each equivalence class can be arbitrarily chosen. However, the number of the edges in any choice of SR is unique and equals the total number of EER classes of all regions. We further show in Pakzad (2004) that important
1854
P. Pakzad and V. Anantharam
graph-theoretic properties such as the number of loops and connected components are invariant in all choices of SR , and the subgraph of any choice of SR on an up-set T of R is a version of ST for T. Based on these justifications and the next proposition, we call SR the minimal graphical representation, or the minimal graph, of R, and freely talk about the existence of loops in SR as if SR were unique. All results in the remainder of this article apply to every choice of SR . Proposition 3. SR is indeed a minimal graphical representation of KR , that is, a collection of pseudomarginals {br , r ∈ R} lies in KR iff it satisfies all the edge constraints of SR , and further, removal of any of the edges of SR results in misrepresentation of KR . Proof. See Pakzad (2004). As we have seen, to solve the constrained minimization problem, one forms the Lagrangian, introducing multipliers λtr (xr ) for each edge (t → r ) of SR . Since SR has fewer edges than any other graphical representation of R, algorithms based on SR require the fewest message updates per each iteration. 4.1 Connection with Junction Trees. In this section we show that there is a close connection between the minimal graphical representation of a collection R of Kikuchi regions and the junction trees on R. Definition 7. Let {r1 , . . . , r M } be a collection of subsets of the index set [N]. A tree or forest G with vertices {r1 , . . . , r M } is called a junction tree or forest if it satisfies condition A1 of section 3.2, that is, for each i ∈ [N], the subgraph consisting of all the vertices that contain i can be connected.6 Although junction trees are traditionally defined as undirected trees, in the above definition we do not make a distinction between directed and undirected graphs; we call a directed graph a junction tree if replacing all the directed edges with undirected ones yields a junction tree in the usual sense. Let {r1 , . . . , r M } be the maximal elements of R. For the rest of this article, we assume that R is totally connected, that is, it has property An for all n = 1, . . . , N. The following proposition links the concept of minimal graphs to the junction trees on the maximal elements of R: Proposition 4. If SR has no loops, then it is a junction forest, and hence the maximal elements {r1 , . . . , r M } can be put on a junction tree. Conversely, if {r1 , . . . , r M } can be put on a junction tree, then SR has no loops. 6
Note that given that G has no loops, condition A1 implies An for all n.
Estimation and Marginalization Using Kikuchi Methods
1, 2 , 3
2, 3, 4 2, 3
3, 4 , 5 3, 4
1855
1, 2, 3
2, 3, 4
3, 4 , 5
2, 3
3
3, 4
3
(A)
(B)
Figure 4: (A) Hasse diagram. (B) Junction tree.
Proof. See Pakzad (2004). Recall that in section 3.2 we originally defined the Kikuchi problem in terms of a collection R0 of regions. We now define the concept of loopiness of a Kikuchi problem: Definition 8. The original collection R0 of regions is called loop free if there exists a junction tree on its maximal elements and is called loopy if no such junction tree exists. Now as before, let R be any poset of regions with the same maximal regions as R0 , which is totally connected. Then by proposition 4 and definition 8, R0 is called loopy iff SR has a loop. 4.2 Necessary and Sufficient Conditions for Exactness of the Kikuchi Method. It is well known that the belief propagation algorithm converges to the exact marginals, in finite time, if the underlying graph is loop free (see, e.g., Pearl, 1988; Cowell et al., 1999). Likewise, the message-passing algorithms of the type discussed in section 3.3 will converge to yield the exact marginals if the Hasse diagram is loop free, and in fact the value of the Kikuchi functional equals the variational free energy. However, this is a rather weak result, since only very rarely will the Hasse diagram be loop free. In fact, many collections of regions that can be put on a junction tree result in Hasse diagrams that have loops. For example the poset R = {{123}, {234}, {345}, {23}, {34}, {3}} will have a loop in the Hasse diagram as displayed in Figure 4A, but it can be easily handled as a junction tree, as in Figure 4B. Also, in the example of Figure 3, although the Hasse diagram has loops, the solution to the Kikuchi approximation problem equals the exact marginals. This is because not all the loops of G R are “bad” loops that cause trouble for the message-passing algorithm. In fact these “bad” loops are precisely the loops that cannot be broken when one creates SR . In the examples of both Figures 3 and 4, SR is loop free. One can therefore run a message-passing algorithm on SR , which converges to yield the solution for the Kikuchi approximation, equation 3.11, which is identical to the exact
1856
P. Pakzad and V. Anantharam
marginals. In fact, in these examples, the Kikuchi free energy functional equals the variational free energy. The results in this subsection make these observations precise. The following theorem states sufficient conditions for the Kikuchi approximate free energy and the consistency constraint set of pseudomarginals to be exact: Theorem 2. (Exactness of the Kikuchi approximations, KR and F RK ). A. KR = R if SR is loop free. B. Let b(x) be a distribution with marginals
br (xr ). Then F RK ({br , r ∈ R}) = F (b) if b(x) = r ∈R br (xr )cr . Proof. Part A: Suppose SR is loop free, and let {br (xr ), r ∈ R} ∈ KR . Since SR is
a junction forest, there is a distribution b(x) := b (x )/ r r r ∈R (t→u)∈E(SR ) b u (xu ) that marginalizes to {b r (xr ), r ∈ R}. This is a well-known result on the junction trees, which can be verified by marginalizing b(x) in stages, from the leaves (of the undirected version of SR ) toward an arbitrary region r as the root, where at each step, by local consistency there will be cancellation. Therefore, {br (xr ), r ∈ R} ∈ R , and so KR ⊆ R . But clearly R ⊆ KR , since the true marginals of any distribution are locally consistent. Therefore, KR = R . Part B: From discussion of section 3.1, F RK ({br , r ∈ R}) = F (b) if the entropy approximation of equation 3.5 is exact. Now b(x) =
br (xr )cr
r ∈R
=⇒ KL b(x)
br (xr )cr =⇒
x
=⇒
x
b(x) log(b(x)) − log b(x) log(b(x)) −
cr
r ∈R
b(x) log(b(x)) =
r ∈R
=0
cr
br (xr )
cr log(br (xr )) = 0
r ∈R
b(x) log(b(x)) =
r ∈R
x
=⇒
=0
r ∈R
x
=⇒
b(x) log(br (xr ))
x
cr
br (xr ) log(br (xr ))
xr
=⇒ F RK ({br (xr ), r ∈ R}) = F (b(x)).
Estimation and Marginalization Using Kikuchi Methods
1857
Corollary 2. If R0 is loop free, then the constrained minimization problem of equation 3.11 has a unique solution. Further, the solution {br∗ , r ∈ R} is the exact marginal of the product function, and the minimum free energy equals the logpartition function, that is, br∗ (xr ) = Br (xr ) and F ∗ = − log(Z).7 Proof. See Pakzad (2004). The above results show that a sufficient condition for the exactness of the solutions of the Kikuchi approximation method of equation 3.11 is that SR be loop free. In the sequel, we address the necessary conditions for exactness. We first pose the following more abstract question about entropy approximations: Under what conditions is an entropy approximation in the form of equation 3.5 exact for all R-decomposable distributions? The following theorem attempts to answer this question. Theorem 3. Let R be a totally connected poset of subsets of [N], and let {kr , r ∈ R} be a collection of constants. Then the following are equivalent: 1. H(B) = 2. 3.
r ∈R kr
r ∈∪i∈s F(i) kr
r ∈∪s∈S F(s) kr
Hr (Br ) for all R-decomposable distributions B(x).
= 1 for all s ⊆ [N] such that ∪i∈s F(i) is connected in SR . = 1 for all S ⊆ 2[N] such that ∪s∈S F(s) is connected in SR .
Further, given the poset R, there exists a collection {kr , r ∈ R} satisfying 1, 2, and 3 above, iff SR , the minimal graph of R, is loop free. If such a collection {kr , r ∈ R} exists, then it is unique and equals the (M¨obius) overcounting factors {cr , r ∈ R}. Proof. See Pakzad (2004). An immediate consequence of this theorem is, as stated in proposition 1, that the only collection of factors for which the Kikuchi approximate free energy is exact is the (Mobius) ¨ overcounting factors, {cr , r ∈ R}, as defined in section 3.1. As mentioned in section 3.2, given a product distribution with kernels {αr0 (xr ), r ∈ R0 }, where the regions in R0 cannot be put on a junction tree, it is expected that expanding the collection R0 by adding subsets of r ∈ R0 as further regions would improve the quality of approximation obtained by the iterative algorithms such as GBP. The following result, however, shows that it is improbable that exact solutions will be obtained.
7 In fact, when R is loop free, iterative algorithms such as GBP, which we discuss in 0 section 5, converge in finite time to the unique solutions br∗ .
1858
P. Pakzad and V. Anantharam
Theorem 4. If R0 is loopy, then except on a set of measure zero of choices of kernels {αr0 (xr ), r ∈ R0 }, the Kikuchi approximation method of equations 3.11 will not produce exact results for both F0 and {Br }. Proof. Let T denote the set of R0 -kernels {αr0 (xr )} for which the Kikuchi entropy approximation of equation 3.5 is exact for the Boltzmann distribution. Specifically, define T := {αr0 (xr ), r ∈ R0 } : B(x) log(B(x)) x
=
r ∈R
cr
Br (xr ) log(Br (xr )) ,
(4.3)
xr
where, as in section 3.2, B(x) = Z1 r ∈R0 αr0 (xr ) and Br ’s are its marginals. For each region r ∈ R0 , define qr := i∈r q i to be the cardinality of the range of xr , and let l := r ∈R0 qr . Let f : Rl+ −→ R be the error function in approximating entropy as in equation 3.5, that is, f ({αr0 (xr )}) :=
x
=
B(x) log(B(x)) −
r ∈R
cr
Br (xr ) log(Br (xr ))
xr
0 0 s∈R αs (xs ) s∈R αs (xs )
log 0 0 x s∈R αs (xs ) x s∈R αs (xs ) x
x[N]\r s∈R αs0 (xs )
cr − 0 x s∈R αs (xs ) xr r ∈R
0 x[N]\r s∈R αs (xs )
× log . 0 x s∈R αs (xs )
Then T = f −1 (0). Function f (·) above is clearly analytic on its domain, Rl+ . Then, as demonstrated in Federer (1969), either T = Rl+ or µ(T) = 0, where µ(·) is the Lebesgue measure. The first alternative requires that f be identically zero on Rl+ . But from theorem 3, it is evident that if R0 is loopy, then the entropy approximation cannot be exact. This completes the proof. A stronger version of this result can also be derived. In Pakzad (2004), we show that the conditional measure of the set of kernels {αr0 (xr ), r ∈ R0 }, conditioned to be consistent with a predefined, weakly regulated pattern of zeros of the product distribution, for which the Kikuchi approximation method of equation 3.11 is exact, is zero. We conclude this section with a generalization of corollary 1 on sufficient conditions for convexity of the Kikuchi free energy. In particular, we make a
Estimation and Marginalization Using Kikuchi Methods
1859
(purely set-theoretic) connection between the number of loops of SR and the sufficient conditions of theorem 5 for convexity of the Kikuchi free energy. Definition 9. We call a collection R of Kikuchi regions normal, if for all S ⊂ R, there exists a largest region (with respect to set inclusion) m ∈ S S D(s), possibly the empty set, that contains any other region u ∈ S D(s). Note that the notion of normality defined here puts minor restrictions on the choices of collections R of regions, and in most cases of interest, such as the cluster variation method (Yedidia et al., 2001) and the junction graph method (Aji & McEliece, 2001), the collections of regions are in fact normal. Normality simply ensures that the minimal graph will contain no small, bow-tie-shaped loops with precisely two maximal and two minimal elements. Theorem 5. Let R be a normal collection of Kikuchi regions. Then the Kikuchi free energy functional F RK ({br }) is strictly convex if SR has zero or one loop. In particular, the Kikuchi free energy for the cluster variation method of Yedidia et al. (2001) is strictly convex if SR has zero or one loop. Proof. See Pakzad (2004). 5 Generalized Belief Propagation Algorithm We are now in position to describe a class of iterative message-passing algorithms that try to solve the constrained minimization problem, equation 3.11. These algorithms can be viewed as extensions of the GBP algorithm of Yedidia et al. (2001, 2002) and the poset-BP algorithm of McEliece and Yildrim (2003). More precisely, the earlier algorithms are defined and derived only on the full Hasse diagram. The results derived in the earlier sections of this article on the minimal graphs allow us to propose algorithms for solving equation 3.11 that are often substantially less complex than the ones proposed in Yedidia et al. (2002) and McEliece and Yildrim (2003), and which appear to have comparable convergence performance in some examples we have investigated and reported in section 6. Let R be the collection of regions for a Kikuchi approximation problem. In section 3.3 we described how the Lagrange multipliers method can be used to obtain an iterative, message-passing algorithm with fixed points that coincide with the stationary points of equation 3.11. Now let G be any graphical representation of KR as defined in section 4. Then the Lagrangian of equation 3.12 can be rewritten in terms of the edge constraints of G, in which case the “messages” of the resulting iterative algorithm can be identified precisely with the edges of G. This means that for each graphical representation of KR , there is a distinct message-passing algorithm along the edges of that graph. Clearly all such algorithms have
1860
P. Pakzad and V. Anantharam
the same set of fixed points, although the dynamics of each algorithm may be different. So far we have represented the constraint set KR using the edgeconstraints defined in section 4. Motivated by an observation made by Yedidia et al. (2001, 2002), we introduce an alternative but essentially equivalent set of edge constraints. We will then be able to use this alternative representation of the constraint set KR to derive an alternative message-passing algorithm to solve equation 3.11. Definition 10. The YFW edge-constraint8 for an edge (s → t) of G is defined as the following functional of the pseudo marginals {br , r ∈ R }: EC (s→t) ({br , r ∈ R }) :=
cu
b u (xu ),
(5.1)
xu\t
u∈F(t)\F(s)
where R := {r ∈ R, cr = 0} is the collection of regions with nonzero overcounting factors. Note that since c u = 0 for u ∈ R , EC (s→t) is a function of only {br , r ∈ R } as claimed in equation 5.1. When the arguments are clear from the context, we abbreviate these edge constraints as EC (s→t) . Proposition 5. The collection of pseudomarginals represented by the YFW edge constraints is equal to the restriction of KR to R . Namely, if we define R
:= {br (xr ), r ∈ R } : ∀(s → t) ∈ E(G), EC (s→t) ({br , r ∈ R }) = 0
and ∀r ∈ R ,
br (xr ) = 1 ,
(5.2)
xr
then R = KR | R , where KR | R is the restriction of KR to R , that is, a collection {br (xr ), r ∈ R } lies in KR | R iff it has an extension {br (xr ), r ∈ R} ∈ KR . Proof. See appendix B. Remark. Note from equations 3.8 and 3.10 that the Kikuchi free energy can be rewritten as follows: F RK ({br (xr )}) =
(−cr br (xr ) log(βr (xr )) + cr br (xr ) log(br (xr )))
(5.3)
r ∈R xr
From equation 5.3, it is apparent that F RK ({br , r ∈ R}) depends only on {br , r ∈ R }, since the terms involving the pseudomarginals corresponding 8
We call these constraints YFW after Yedidia, Freeman, and Weiss.
Estimation and Marginalization Using Kikuchi Methods
1861
to the regions with zero overcounting factors are multiplied by zero. Therefore, equation 3.11 can be rewritten as min
{br ,r ∈R}∈ KR
and
arg
min
{br ,r ∈R}∈ KR
F RK ({br , r ∈ R}) =
F RK ({br , r
∈ R})
R
min
{br ,r ∈R }∈ R
= arg
F RK ({br , r ∈ R })
min
{br ,r ∈R }∈ R
F RK ({br , r ∈ R }). (5.4)
In other words, the central constrained minimization problem of equation 3.11 is reduced to the following: min{br ,r ∈R }∈ R F RK ({br , r ∈ R }). This observation is also made in Yedidia et al. (2002). We will now write the Lagrangian for equation 5.4 using the YFW edge constraints: br (xr ) L := cr br (xr ) log λr t (xt ) + βr (xr ) r ∈R (r →t)∈E(G) xt cu b u (xu ) + κr br (xr ) − 1 . (5.5) × u∈F(t)\F(r )
xu\t
r ∈R
xr
Setting partial derivative ∂L/∂br (xr ) = 0 for each region r and each value of xr , and identifying messages, for each edge ( p → r ), as m pr (xr ) := e −λ pr (xr ) , we obtain: br (xr ) = k βr (xr ) m pr (xr ) m p d (xd ) , p ∈PG (r )
d∈D(r ) p ∈PG (d)\({r }∪D(r ))
(5.6) where constant k is chosen to normalize br so it will sum to 1, and message m pr is updated to satisfy the original edge constraint x p\r b p (x p ) − br (xr ) = 0: m pr (xr ) =
x p\r β p (x p ) s ∈PG ( p) msp (x p ) d∈D( p) s ∈PG (d)\({ p}∪D( p)) ms d (xd )
, k
βr (xr ) s ∈PG (r )\{ p} msr (xr ) d∈D(r ) p ∈PG (d)\({r }∪D(r )) m p d (xd ) (5.7)
where k is any convenient constant. Note that the common terms from the numerator and denominator of equation 5.7 can be cancelled, but to avoid even more complex expressions, we will not write the explicit form here.
1862
P. Pakzad and V. Anantharam
The fixed points of equations 5.6 and 5.7 set all the derivatives of the Lagrangian equal to zero, and hence are precisely the stationary points of the Kikuchi free energy F RK subject to constraint set KR . The algorithm of equations 5.6 and 5.7 is defined on any graphical representation of KR and has as many messages as the edges of the underlying graph. From the results of section 4, using SR , the minimal graphical representation, yields the least complex such algorithm in this sense. In fact, in most cases, the algorithm on SR is substantially less complex than the full version implemented on the Hasse diagram G R . Remark. Note that the GBP algorithm presented here is not fundamentally different from its original form as presented in Yedidia et al. (2001, 2002); in fact, McEliece and Yildrim (2003) described an algorithm called Poset-BP, which is equivalent to the restriction of our results when G is the Hasse diagram. However, it is important to note that our results show that in general, there are iterative algorithms with strictly fewer messages, and potentially simpler dynamics, that have the same fixed points. In particular, the messages corresponding to the edges of the Hasse diagram that are removed in forming a more compact graphical representation can be set to 1 in the entire algorithm. In addition, the update rules for the remaining messages are also less complex, since they depend on fewer edges. It is also worth mentioning that the proofs of the correctness for the GBP and Poset-BP algorithms given in Yedidia et al. (2002) and McEliece and Yildrim (2003) both presume that the poset is first simplified by removing the regions with zero overcounting factors. We note, however, that removing the regions with zero overcounting factors can in general alter the problem. This is because a region with zero overcounting factor may still serve to ensure consistency between the pseudomarginals at other regions (see, for example, the poset in Figure 5). We have avoided this restriction by proving the results for a general poset. Consider now the restriction of the above algorithm in the Bethe case, that is, each region in R is either maximal or minimal with regard to inclusion. Then P(r ) = ∅ for a maximal region r and D(s) = ∅ for a minimal region s. To demonstrate the connection with the belief propagation algorithm, in
1,5
1,2,3
1,3,4
1,2,4
1,3
1,2
1,4
1
Figure 5: The overcounting factor for region 1 is zero, but it cannot be removed.
Estimation and Marginalization Using Kikuchi Methods
1863
addition to the messages m pr (xr ) for r ⊂ p, we also define messages from a child to parent as follows:
nr p (xr ) := βr (xr )
msr (xr ).
(5.8)
s∈P(r )\{ p}
Then, by equation 5.6, for a maximal region p ∈ R, b p (x p ) = k β p (x p )
n pd (xd ).
(5.9)
d∈D( p)
Similarly, for a minimal region r ∈ R, br (xr ) = k βr (xr )
mdr (xr ).
(5.10)
d∈P(r )
The update equation 5.7 for messages m pr can then be rewritten as
x p\r β p (x p ) d∈D( p) nd p (xd )
m pr (xr ) = k βr (xr ) nr p (xr ) = k
β p (x p ) x p\r
βr (xr )
nd p (xd ).
(5.11)
d∈D( p)\{r }
It is now easy to see that equations 5.8 to 5.11 precisely define the conventional belief propagation algorithm of Pearl (1988) applied on G R .
6 Experimental Results In the previous section, we proved that the fixed points of GBP algorithms on any graphical representation for a poset R coincide with the solutions to the Kikuchi approximation problem of section 3.2. We further argued that the algorithm on the minimal graph SR has the smallest complexity per each iteration. Two important questions are not addressed in this article so far: How close are the Kikuchi approximations to the true marginals? How does the convergence behavior of the GBP algorithm on the minimal graph SR compare to that on the full Hasse graph G R ? In this section, we address these questions with some experiments. We considered three simple loopy posets below. In each case, all the variables were binary. For each run of the experiment for a given poset, first we generated a random collection of kernels {αr (xr )}, where each value αr (xr ) was chosen independently and uniformly in the interval [0, 1]. Next we calculated the product distribution B(x) = r ∈R αr (xr ) together with its true
1864
P. Pakzad and V. Anantharam
marginals Br (xr ). The GBP algorithm of section 5 then was run on each of the two graphs G R and SR for that poset. Further, two different schedules were incorporated to update the messages for each algorithm: parallel and serial. With the parallel schedule, all messages were updated together at each iteration. For the serial schedule, we update the messages one after another, in an order chosen so as to minimize the number of edges that are updated before their requisite set of edges has been updated. Each message is updated exactly once during each iteration. To ensure convergence of some algorithms, we used damping in the update rule for the messages. The quantity w reported for each algorithm is the damping factor. In parn n n ticular, we used mn+1 pr (xr ) = w F ({m }) + (1 − w) m pr (xr ), where m pr is the message at iteration n and F ({mn }) is the pure update rule of equation 5.7. The value of w is always between 0 and 1, with w = 1 corresponding to equation 6.7. For each case, we decreased w gradually to ensure that the algorithm converged. For each poset, we report the savings in complexity per each iteration of GBP on the minimal graph compared to that on the Hasse graph. To compute these savings, we calculated the total arithmetic complexity, that is, the number of additions, multiplications, and divisions involved in update rules of equation 5.7 for both algorithms. Note that this is not simply the fraction of edges that are removed in forming the minimal graph, since the update rules for the messages that remain in the minimal graph are less complex than the ones on the Hasse graph. To summarize the performance of each algorithm, at each iteration we calculated a special measure of distance between the beliefs {br } and the true r |b r (xr )−Br (xr )| marginals {Br }. We define a distance function D(br , Br ) := maxxmax xr Br (xr ) as the measure of distance from the belief br to the marginal Br ; this is a normalized maximum point-wise difference between the two distributions. The closer D is to 0, the closer the belief br (xr ) is to the true marginal Br (xr ) at all configurations of xr . At each iteration we then calculate the maximum 1 distance maxr ∈R D(br , Br ), and the mean distance |R| r ∈R D(b r , Br ). For each poset, the averages of these quantities over 200 runs are reported. Poset 1. The Hasse diagram of this poset has one loop, but the minimal graph is loop free (see Figure 6). There is a saving of 35.7% per each iteration of GBP on the minimal graph compared to that on the Hasse graph. As expected, the Kikuchi approximations coincide with the true marginals in this loop-free case. The serial algorithms converge to the fixed points after one iteration, because we use an optimal schedule for activating the messages. The parallel algorithm on the minimal graph takes four iterations (equal to the girth of the graph). The parallel algorithm on the Hasse graph requires damping and converges much more slowly. Note that in this case, the algorithm on the minimal graph both gives better performance at each iteration and has less complexity per iteration (see Figure 7).
Estimation and Marginalization Using Kikuchi Methods
1, 2,4
1, 3,5
1, 2,3
1, 2
1, 2, 4
1, 3
1865
1, 2,3
1, 2
1
1, 3,5
1, 3
1
(A)
(B)
Figure 6: Poset 1. (A) Hasse graph G R . (B) Minimal graph SR . Poset 1, Mean Distance D
Poset 1, Maximum Distance D
1
1
Serial, GR, w=1.00 Serial, S , w=1.00 R Parallel, GR, w=0.80 Parallel, SR, w=1.00
0.9 0.8
0.8 0.7
0.6
Distance
Distance
0.7
0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1 0
Serial, GR, w=1.00 Serial, S , w=1.00 R Parallel, GR, w=0.80 Parallel, SR, w=1.00
0.9
0.1 0
1
2
3
4
5 6 Iteration
(A)
7
8
9
10
0
0
1
2
3
4
5 6 Iteration
7
8
9
10
(B)
Figure 7: GBP performance on Poset 1. (A) Mean distance. (B) Maximum distance.
Poset 2. The Hasse diagram of this poset has five loops; all but one of these loops are broken in the minimal graph (see Figure 8). There is a saving of 46.2% per each iteration of GBP on the minimal graph compared to that on the Hasse graph. The Kikuchi approximations are at an average distance of about 0.05 from the true marginals, while the worst estimates have distance of about 0.13. Again, the serial algorithms converge very quickly, although the one on the Hasse graph requires a slight damping. Comparing the parallel algorithms, the one on the minimal graph clearly outperforms the one on the full Hasse graph, even with equal damping factors (see Figure 9). Poset 3. The Hasse diagram of this poset has five loops, whereas the minimal graph has only two loops (see Figure 10). There is a saving of 28.5% per each iteration of GBP on the minimal graph compared to that on the Hasse graph. The Kikuchi approximations are at an average distance of about 0.05 from the true marginals, while the worst estimates have distance of about 0.14. Once again, the serial algorithms converge very quickly, without the need for damping. The parallel algorithm on the
1866
P. Pakzad and V. Anantharam
1, 2, 3
2, 3, 4
1, 3, 4
1, 2, 4
1, 2, 3
2, 3, 4
1, 3, 4
1, 2, 4
1, 2
2, 3
3, 4
1, 4
1, 2
2, 3
3, 4
1, 4
1
2
3
4
1
2
3
4
(A)
(B)
Figure 8: Poset 2. (A) Hasse graph G R . (B) Minimal graph SR . Poset 2, Mean Distance D
Poset 2, Maximum Distance D
1
1
Serial, GR, w=1.00 Serial, SR, w=1.00 Parallel, G , w=0.80 R Parallel, SR, w=1.00
0.9 0.8
0.8 0.7
0.6
Distance
Distance
0.7
0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5 6 Iteration
7
8
9
Serial, GR, w=0.90 Serial, SR, w=1.00 Parallel, G , w=0.75 R Parallel, SR, w=0.75
0.9
10
0
0
1
2
3
4
5 6 Iteration
(A)
7
8
9
10
(B)
Figure 9: GBP performance on Poset 2. (A) Mean distance. (B) Maximum distance. 1,6,7
1,5,6
1,6
1,5
1,3,6 3
1
1,2,3,5
2,3,4
1,2
2,4 2
(A)
1,2,4
1,6,7
1,5,6
1,6
1,5
1,3,6
1,2,3,5
2,3,4
1,2
2,4
3
1
1,2,4
2
(B)
Figure 10: Poset 3. (A) Hasse graph G R . (B) Minimal graph SR .
minimal graph again outperforms that on the full Hasse graph, the latter requiring a damping factor w = 0.70 to avoid oscillations (see Figure 11). At least for the simple posets considered here, the less complex GBP algorithm on the minimal graph, developed in this article, seems to perform better than the full GBP on the Hasse graph, especially with the parallel versions of the algorithm. Considering that each iteration of the algorithm on the minimal graph is less complex than that on the full Hasse graph, this
Estimation and Marginalization Using Kikuchi Methods Poset 3, Mean Distance D
Poset 3, Maximum Distance D
1
1
Serial, GR, w=1.00 Serial, S , w=1.00 R Parallel, G , w=0.70 R Parallel, SR, w=1.00
0.9 0.8
Serial, GR, w=1.00 Serial, S , w=1.00 R Parallel, G , w=0.70 R Parallel, SR, w=1.00
0.9 0.8 0.7
0.6
Distance
Distance
0.7
0.5 0.4
0.6 0.5 0.4
0.3
0.3
0.2
0.2
0.1 0
1867
0.1 0
1
2
3
4
5 6 Iteration
7
8
9
(A)
10
0
0
1
2
3
4
5 6 Iteration
7
8
9
10
(B)
Figure 11: GBP performance on Poset 3. (A) Mean distance. (B) Maximum distance.
suggests that there is considerable saving in the complexity to be gained by using the algorithm on the minimal graph. Appendix A: Proof of Theorem 1 Note that the Kikuchi approximate free energy, as a functional of the pseudomarginals {br (xr )} ∈ KR , consists of an energy term, which is linear, and a linear combination of entropy terms, with both positive and negative coefficients. We will show that if the hypothesis of the theorem holds, there is a matching between the negative and the positive terms such that the overall entropy term will be a positive linear combination of Kullback-Leibler (KL) divergence terms that is strictly convex (see, e.g., Cover & Thomas, 1991). We will prove the existence of such matching using results from the bipartite graph theory. Form a bipartite graph G(V + , V − , E) with vertex sets V + and V − and the edge set E as follows:
r r r
|c |
For each r ∈ R with cr < 0, create |cr | nodes {vr1 , · · · , vr r } in V − . For each s ∈ R with c s > 0, create c s nodes {u1s , · · · , ucs s } in V + . To form the edge set E, connect each vir ∈ V − to each usj ∈ V + iff r ⊂ s.
For a subset S ⊆ V − , denote by N(S) the subset of nodes in V + that are connected to a node in S. Then graph G has the following property: ∀S ⊆ V − , |S| ≤ |N(S)|.
(A.1)
To see this, let S = {vsi : (s, i) ∈ I} where the index set I consists of some pairs of the form (s, i) with c s < 0 and 0 < i ≤ |c s |. Now create an-
1868
P. Pakzad and V. Anantharam
other index set I as I := {(s, j) : (s, i) ∈ I for some i , 0 < j ≤ |c s |}, and let S := {vsi : (s, i) ∈ I}. Then clearly S ⊆ S and hence |S| ≤ |S|, but notice that N(S) = N(S). Also note that |S| = − t∈T c t , where T := {t ∈ R : (t, 1) ∈ I}. Further,
ct =
t∈A(T);c t >0
t∈A(T)
ct +
ct
t∈A(T);c t 1). Since the data points are taken independently along each dimension, we theoretically have that MI ≡ 0. We again consider N = 1000 data points and perform 1000 runs. We estimate each of the entropy terms in the expression MI(v) = i H(vi ) − H(v) with our Edgeworth-based method (MIEdgeworth ) and the nearest-neighbor method (MInn ). Furthermore, we compare with the direct method of Kraskov et al. (2004), in which case we take the I(2) algorithm (k = 1), since it was claimed to be more accurate for the higher-dimensional case. We will not consider the MI estimation based on
1908
M. Van Hulle
the Parzen window density estimator because of the inferior results obtained with it in the previous section and its unreasonable computational efforts. The results are shown in Figure 1C. We immediately observe that the MInn approach quickly diverges: seemingly, the errors in estimating the entropy terms do not cancel but quickly accumulate. (This could explain why, to the best of our knowledge, MI estimation from nearest-neighbor entropy estimators has not been reported in the literature.) We observe that our MIEdgeworth approach performs much better in this regard. Note also that the spread of the I(2) results is an order of magnitude larger and fully encompasses the spread of our MIEdgeworth results, at least over the d-range shown. The computational complexity of the I(2) algorithm is on the order of O(N2 d) (a nearest-neighbor search for each data point), whereas ours is O(Nd 3 ). Hence, albeit that our third-order Edgeworth approximation has a bias for the entropy estimates (see Figure 1B), the bias cancels in the MI estimation (cf. the standardized cumulants). To verify this, we took different rates of changes λ for the exponential distributions along the different dimensions: λ(i) = i, i = 1, . . . , d. The results are similar (not shown). Finally, we took the benchmark case of (Kraskov et al., 2004): a 2D gaussian with unit variances along the x- and y-axes and a covariance r for which the theoretical MI = − 12 log(1 − r 2 ). We again use N = 1000 and 1000 runs. The results are given in Figure 1D. We observe that the Edgeworth-based method performs best. 4 Conclusion We have introduced a parametric differential entropy estimator based on the multivariate Edgeworth expansion of a gaussian kernel. We have shown that it can outperform the nonparametric nearest-neighbor method in the higher-dimensional case and that it can be used for mutual information estimation. The latter application could be an interesting starting point for developing a test of independence between the components found with standard ICA algorithms. Finally, besides Shannon entropy, R´enyi and other forms of entropy have been introduced in the past and estimators developed for them. According to Principe and coworkers, the Parzen window density estimator is particularly suited for R´enyi entropy estimation (Erdogmus, Hild, & Principe, 2004). However, Shannon entropy is the only one that possesses all the desired properties of an information measure so that its efficient and accurate estimation is of prime importance. Acknowledgments I thank N. Leonenko, Cardiff School of Mathematics, Cardiff University, and J. Beirlant, Universitair Centrum voor Statistiek, K.U. Leuven, for
Edgeworth Approximation of Multivariate Differential Entropy
1909
helpful discussions. I am supported by research grants received from the Belgian Fund for Scientific Research–Flanders (G.0248.03 and G.0234.04); the Interuniversity Attraction Poles Programme—Belgian Science Policy (IUAP P5/04); the Flemish Regional Ministry of Education, Belgium (GOA 2000/11); and the European Commission (IST-2001-32114, IST-2002-001917, and NEST-2003-012963). References Ahmad, I. A., & Lin, P. E. (1976). A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Trans. Information Theory, 22, 372–375. Amari, S.-I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. Mozer & M. Hasselmo (Eds.), Advances in neural processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Barndorff-Nielsen, O. E., & Cox, D. R. (1989). Inference and asymptotics. London: Chapman and Hall. Beirlant, J., Dudewicz, E. J., Gyorfi, ¨ L., & van der Meulen, E. C. (1997) Nonparametric entropy estimation: An overview. Int. J. Math. and Statistical Sciences, 6, 17–39. Comon, P. (1994). Independent component analysis—a new concept? Signal Process., 36(3), 287–314. Erdogmus, D., Hild, K. E., & Principe, J. C. (2004). Adaptive blind deconvolution of linear channels using Renyi’s entropy with Parzen window estimation. IEEE Trans. on Signal Processing, 52(6), 1489–1498. Friedman, J. (1987). Exploratory projection pursuit. J. American Statistical Association, 82(397), 249–266. Hall, P. (1984). Limit theorems for sums of general functions of m-spacings. Math. Proc. Camb. Phil. Soc., 96, 517–532. Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2), 435–475. Hyv¨arinen, A. (1998). New approximations of differential entropy for independent component analysis and projection pursuit. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural processing systems, 10 (pp. 273–279). Cambridge, MA: MIT Press. Jones, M., & Sibson, R. (1987). What is projection pursuit? J. Royal Statistical Society A, 150, 1–36. Kozachenko, L. F., & Leonenko, N. N. (1987). Sample estimate of the entropy of a random vector. Problems of Information Transmission, 23(2), 95–101. Kraskov, A., Stogbauer, ¨ H., & Grassberger, P. (2004). Estimating mutual information. Phys. Rev. E, 69, 066138. Kwak, N., & Choi, C.-H. (2002). Input feature selection by mutual information based on Parzen window. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(12), 1667–1671. McCullagh, P. (1987). Tensor methods in statistics. London: Chapman and Hallo. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15, 1191–1253. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech. J., 27, 379–423.
1910
M. Van Hulle
Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Viola, P., Schraudolph, N. N., & Sejnowski, T. J. (1996). Empirical entropy manipulation for real-world problems. In D. S. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 851–857). Cambridge, MA: MIT Press.
Received September 27, 2004; accepted February 23, 2005.
NOTE
Communicated by Michael Carter
Perfect Fault Tolerance of the n-k-n Network Elko B. Tchernev
[email protected] Rory G. Mulvaney
[email protected] Dhananjay S. Phatak
[email protected] Computer Science and Electrical Engineering Department, University of Maryland Baltimore County, Baltimore, MD 21250, U.S.A.
It was shown in Phatak, Choi, and Koren (1993) that the neural network implementation of the n-k-n encoder-decoder has a minimal size of k = 2, and it was shown how to construct the network. A proof was given in Phatak and Koren (1995) that in order to achieve perfect fault tolerance by replicating the hidden layer, the required number of replications is at least three. In this note, exact lower bounds are derived for the number of replications as a function of n, for the n-2-n network and for the n-log2 n-n network. 1 Network Parameters of the n-2-n Network The structure of the network and the assumptions are as follows. There are n input units (neurons) and n input sets. Each input set is a vector of zeros, and only one active input is equal to 1. (This allows simple calculation of the hidden-layer weights.) The hidden and output units (neurons) have hyperbolic tangent squashing functions, with outputs in the ±1 range. There are two hidden units and n output units. Consider the x-y plane defined by the output of the two hidden units. Each neuron output pair for one of the n cases defines one point in that plane. The choice of the radius R (distance from the origin to each point) is arbitrary, provided it is within the ±1 neuron output range. The points lie on a circle centered at the origin. The weights connecting the output units to the two hidden units define planes that bisect the space and form a polygon by the intersection points of the lines that separate the classes of the output neurons. Inside this polygon, no output neuron is active. Outside the polygon, only one neuron is active only inside an isosceles triangle formed by one polygon side (its base) and the extensions of the two adjacent sides. Outside of these isosceles triangles, more than one output is active; thus, the desired operation of the network is to translate a valid input into a point inside these triangles. A sample configuration when the number of Neural Computation 17, 1911–1920 (2005)
© 2005 Massachusetts Institute of Technology
1912
E. Tchernev, R. Mulvaney, and D. Phatak
1
r0
0.5
R
l
α
0
b
r s
0.5
-1 -1
-0.5
0
0.5
1
Figure 1: Relationships in the plane formed by the outputs of the hidden units for a sample n-2-n problem configuration with n = 7.
inputs is n = 7 is shown in Figure 1. The following calculations determine the weights of the network to achieve this goal. The rotational angle α between the n points is α = 2π . n The side lengths of the polygon (centered at 0,0) are s = 2r sin α2 = 2r sin πn , where r is the polygon radius. The isosceles triangles that delineate the zone of just one active output neuron have base angles equal to the central angle α = 2π . Their inscribed n radii r0 determine the maximum change in a hidden neuron output that leaves the total output correct. Using one of the standard formulas for inscribed circle radius, r0 = ( p − b) tan α2 , where b is the length of the equilateral side and p is the half perimeter p = 2b+s , we arrive at: 2 r0 =
s α α α π π tan = r sin tan = r sin tan . 2 2 2 2 n n
The distance l from the origin to the decision line of any output neuron (side of the polygon) is l = r cos α2 = r cos πn .
Perfect Fault Tolerance of the n-k-n Network
1913
The distance R from the origin to the centers of the inscribed circles (to the ideal output values of the hidden neurons) is R = l + r0 = r cos
π π π π π π + r sin tan = r cos + sin tan . n n n n n n
Each output neuron has a combined input z = a x + by − c, where x and y are the outputs of the two hidden neurons, respectively; a and b are the weights connecting them to the output; and c is the output’s bias. Setting z = 0 corresponds to a line on the plane, coinciding with the decision line of the corresponding output neuron and with a side of the polygon. In normal form, c is the distance of the line to the origin, or l, and a = cos θ , b = sin θ, where θ is the angle between l and the x-axis. For any and all of the points, cos θ and sin θ range from –1 to +1, and l is subject to choice. Therefore, for the output layer, the maximum absolute weight (needed later for fault-tolerance estimation) is the bigger of 1 (the maximum of sin/cos) and l (the maximum for the bias weight). For the hidden layer, the maximum absolute input weight |wihidden |max occurs whenever only one of the x and y units is active, and is: hidden = tanh−1 R. wmax
As R is the factor that determines the input-to-hidden neuron weight, it is better to express l in terms of R: l = R cos2
π . n
This way, since R ≤ 1 and cos2 x ≤ 1, the maximum output-layer weight output will be wmax = 1. Of course, it is possible to multiply all the terms of the line expression by some m and preserve the output unit function; this will make a = m cos θ, b = m sin θ , c = ml, and output wmax = m.
The largest weight magnitude in the network overall will then be: wmax = max(m, tanh−1 R).
1914
E. Tchernev, R. Mulvaney, and D. Phatak
The expression for the combined correct input for an output-layer node then becomes: π z = m cos θ R cos θ + m sin θ R sin θ − mR cos2 n 2 2 2 π = mR cos θ + mR sin θ − mR cos n 2 π 2 π = mR sin . = mR 1 − cos n n We are free to choose any m that maximizes the overall fault tolerance. The parameters subject to choice, then, become R and m. 2 Calculating the Redundancy 2.1 Fault 1. In this fault, a weight is disconnected (w = 0), or a hidden unit is stuck at 0. Because of the way the network is constructed, disconnection of input to hidden and hidden to output has the same effect. Disconnection of input to hidden will affect one input case; hidden to output will affect all cases. The effect will always be to move the effective combined input toward the origin, thereby reducing its absolute value. If the redundancy ratio is k, the combined (nonfaulty) input to any output unit is zs = kax + kby – kc = kz. The faulty term will be (k – 1)ax or (k – 1)by. Hence, the faulty combined input will be z f = (k − 1)a x + kby − kc. We need to choose k such that the fault preserves at least the sign; therefore, setting z f = 0 will obtain the lower, noninclusive limit for k. Therefore, k > max
ax z
=
max(a x) mR 1 = = z mR 1 − cos2 πn sin2
π n
.
(2.1)
2.2 Fault 2. In this fault, an input-to-hidden weight (hidden bias included) is stuck at ±∞, or a hidden unit is stuck at ±1. A stuck weight will affect only one input case. A stuck unit will affect all cases, but the end effect will be the same: one hidden unit stuck at ±1. This might be of the same polarity as it should be or in the opposite. We look at the two cases. 2.2.1 Opposite Polarity. The faulty combined input to an output unit will be z f = a ((k − 1)x − 1) + kby − kc. After subtracting from the regular expression, the maximum k is:
a (x + 1) k > max z
=
max(2a x) 2mR 2 = = π 2 z mR 1 − cos n sin2
2.2.2 Same Polarity. This case is equivalent to Fault 1: k >
π n
1 sin2
π n
.
. (2.2)
Perfect Fault Tolerance of the n-k-n Network
1915
2.3 Fault 3. Here, the hidden-to-output weight is stuck at ±wmax (output bias excluded). 2.3.1 Opposite Polarity. The faulty combined input to an output unit will be z f = a (k − 1)x − wmax x + kby − kc. After subtracting from the regular expression, the maximum k is:
a x + wmax x z
max(a x) + max(wmax x) z max 1 + wm wmax mR + wmax R 1 = + = . = mR 1 − cos2 πn sin2 πn m sin2 πn sin2 πn max 1 + wm k> . sin2 πn
k > max
=
(2.3)
2.3.2 Same Polarity. z f = a (k − 1)x + wmax x + kby − kc: max
|a x − wmax x| max (|a x − wmax x|) < z z =
d − √1d . Either end of the range is not a good choice. Setting correspondingly large R = d will mean saturated hidden units with a √ magnitude of the hidden weights, while setting R ≈ d − √1d will place the activation too close to the decision plane. We can compromise by setting √ R = d − 2√1 d , and for this choice of R, each hidden unit output will be of magnitude Oh = 1 − 2d1 . 2 shows the network configuration when n = 4, d = 2, and R = √ Figure d − 2√1 d . 3.2 Calculating the Redundancy. Consider k-factor replication; the combined input of each output-layer unit will be zk = kz.
Perfect Fault Tolerance of the n-k-n Network
1917
1
Z
Z
0.5
R l O 0
(0,0)
s
-0.5
Z -1 -1
Z -0.5
0
0.5
1
Figure 2: Geometry of the n-log2 n-n architecture, when n = 4, d = 2, and R = √ d − 2√1 d .
We skip Fault 1, weight disconnected or hidden unit stuck at 0, as it is not the worst case. Consider Fault 2: input-to-hidden weight (including hidden bias) stuck at ±∞ or hidden unit stuck at ±1. A stuck weight will affect only one input case, and a stuck unit will affect all cases, but the end effect will be the same: one hidden unit stuck to ±1. This might be of the same polarity as it should be or in the opposite. We look at the more severe opposite-polarity case: R d −1 m z f = zk − a x − a = k mR − m √ −m − √ d d d √ √ 2 3 k R d + kd − kd − R d − d =m . √ d3
1918
E. Tchernev, R. Mulvaney, and D. Phatak
Because√ this must be positive, after solving for k, we obtain d √ k > d Rd+R . ( d−d+1) Substituting the two choices for R, we obtain: R=
√
d
⇒
m z= √ ; d
km zk = √ ; d
zf =
m (k − 2) . √ d
k > 2 ⇒ k = 3. R=
√ 1 d− √ 2 d k>
(3.1) m z= √ ; 2 d
⇒
km m (kd − 4d + 1) zk = √ ; z f = . √ 2 d d3
4d − 1 ⇒ k = 4. d
(3.2)
Finally, consider Fault 3: hidden-to-output weight stuck at ±wmax (output bias excluded). We look at the more severe opposite-polarity case: d −1 R mwmax z f = zk − a x − a wmax = k mR − m √ −m − √ d d d √ √ k R d 3 + kd − kd 2 − R d − dwmax =m . √ d3 Because √this must be positive, after solving for k, we obtain √ +R d . k > ddwRmax ( d−d+1) Substituting the two choices for R, we obtain: R=
√
d
⇒
m z= √ ; d
k > 1 + wmax
R=
√
1 d− √ 2 d
k > 2 + 2wmax −
1 d
zf =
m (k − 1 − wmax ) . √ d
k = 1 + wmax .
⇒
⇒
km zk = √ ; d
m (kd − 2d + 1 − 2dwmax ) . √ 2 d3 1 k = 2 + 2wmax − . d
(3.3)
zf = ⇒
(3.4)
If wmax is considered to be the maximum existing weight in the network, √ it is clear that this is set to some hardware-imposed limit whenR = d, because the input-to-hidden-layer weights will be tanh–1 (1). On the other
Perfect Fault Tolerance of the n-k-n Network
1919
√ hand, if we select R = d − 2√1 d , then the input-to-hidden-layer weights will be w = tanh−1 (1 − 2d1 ). The largest weight in the network will be one of m a= √ ; d
m (d − 1) c= ; √ d
−1
w = tanh
1 1− . 2d
Since w depends only on d, it cannot be modified; we can take that as maximum and scale a and c to make c equal to w: √ √ d tanh−1 1 − 2d1 w d m= = ; (d − 1) (d − 1)
c = w;
tanh−1 1 − a= (d − 1)
1 2d
.
The value of k will be the biggest of the values determined for the different faults by formulas 3.1 to 3.4. It is easy to see that the maximum is the result of formula 3.4 for the fault of a weight stuck at ±wmax (opposite-polarity case):
1 k = 2 + 2w − d
= 2 + 2 tanh
−1
1 1− 2d
1 − . d
(3.5)
The total number ofhidden units in the log2 n configuration then, becomes nodes = kd = k log2 n . 4 Conclusion This study was undertaken in order to explore the relationship between size and perfect single fault tolerance. Using replication, does the smallest initial network produce the smallest fault-tolerant network? The answer is important from the point of view of necessary resources and the required construction algorithm. The n-k-n encoder-decoder problem allowed us to answer this question analytically for two different network architectures. Figure 3 compares the minimum replication factors k and corresponding hidden node counts for worst-case faults w = wmax for the n-2-n and n-log2 n-n architectures, starting at n = 4. The x- and y-axes are in logarithmic scale. The unit line n is plotted for reference. The result demonstrates that a bigger initial seed network (of size log2 n) requires fewer replications and a smaller total number of units to achieve
1920
E. Tchernev, R. Mulvaney, and D. Phatak
100000 10000 k n-2-n 1000
nodes n-2-n n
100
k n-logn 10
nodes n-logn
1 1
100
10000
n
Figure 3: Minimum number of replications k and number of nodes needed for complete fault tolerance of the n-2-n and n-log2 n-n neural network architectures implementing a solution to the n-k-n encoder-decoder problem.
perfect fault tolerance for the encoder-decoder problem than the minimal seed network of two hidden units, for all n ≥ 4. Acknowledgments This work was supported by NSF grants ECS-9875705 and ECS-0196362. References Phatak, D. S., Choi, H., & Koren, I. (1993). Construction of minimal n-2-n encoders for any n. Neural Computation, 5, 783–794. Phatak, D. S., & Koren, I. (1995). Complete and partial fault-tolerance of feedforward neural Nets. IEEE Transactions on Neural Networks, 6, 446–456.
Received October 15, 2004; accepted March 2, 2005.
NOTE
Communicated by Zoubin Ghahramani
On the Slow Convergence of EM and VBEM in Low-Noise Linear Models Kaare Brandt Petersen
[email protected] Ole Winther
[email protected] Lars Kai Hansen
[email protected] Informatics and Mathematical Modeling, Technical University of Denmark, Building 321, DK = 2300 Kongens Lyngby, Denmark
We analyze convergence of the expectation maximization (EM) and variational Bayes EM (VBEM) schemes for parameter estimation in noisy linear models. The analysis shows that both schemes are inefficient in the low-noise limit. The linear model with additive noise includes as special cases independent component analysis, probabilistic principal component analysis, factor analysis, and Kalman filtering. Hence, the results are relevant for many practical applications. 1 Introduction The expectation maximization (EM) algorithm introduced by Dempster, Laird, and Rubin (1977) is widely used for maximum likelihood estimation in hidden variable models. More recently, a generalization of the EM algorithm, the so-called variational Bayes EM algorithm (VBEM), has been introduced (see, e.g., Attias, 1999), which allows more accurate modeling of parameter uncertainty. EM convergence is known to slow dramatically when the signal-to-noise ratio is high, and a natural question is then: Will the more accurate modeling of parameter variance in VBEM assist the convergence? Here we analyze both schemes and show that they are subject to slow convergence in the low-noise limit. We consider linear models with additive normal noise, xt = Ast + nt , t = 1, . . . , N, where xt ∈ Rm are N observed data vectors, st ∈ Rd unobserved hidden variables, and nt ∼ N (0, ) white gaussian noise. For notational convenience, we construct the matrices X and S, which consist of the observed and unobserved data vectors as columns. The unobserved variables S are assumed to be distributed according to a prior p(S), which can be gaussian (factor Neural Computation 17, 1921–1926 (2005)
© 2005 Massachusetts Institute of Technology
1922
K. Petersen, O. Winther, and L. Hansen
analysis) or nongaussian (independent component analysis). The matrix A is referred to as the mixing matrix, and it can (but does not have to) be square (m = d). If m < d, we speak of the overcomplete case, while the opposite situation, d < m, is denoted overdetermined. In our discussion, the data are assumed prewhitened, that is, XXT = NI, a mild condition that simplifies the notation. 2 Slow Convergence in EM For parameter estimation (A, ) in the linear model, the main challenge is that the marginal likelihood involves an average over all possible configurations of the hidden variables, with a measure that depends on the unknown parameters themselves. EM algorithms break up this stalemate in two separate iterated steps. First, we find the posterior distribution of the hidden variables P(S|X, A, ), for fixed parameters and then improve the parameters by maximizing the log likelihood averaged with regard to the approximate hidden variable posterior (for details, consult Dempster et al., 1977; McLachlan & Krishnan, 1997). Bermond and Cardoso (1999) made an important but seemingly little-known discovery about the convergence properties of the EM algorithm in the low-noise limit. Following their line of thought, and for simplicity considering the case = σ 2 I, we can expand the moments of the posterior S and SST in powers of the noise variance (see Figure 1) to obtain approximate expressions for the parameter updates. Using the notation Γ = X − AS, we get ˜ n + O(σ 4 ) An+1 = XST SST −1 = An + σn2 A 2 σn+1 =
1 Tr(ΓΓT ) mN
2 2 4 = σbia s + σn z + O(σ ),
where An denotes the estimated mixing matrix in the nth iteration. 2 In the square case, the noise update simplifies into σn+1 = σn2 + O(σ 4 ). 2 In the overdetermined case, σbia s = 1 − rank(A)/m and z = rank(A)/m − 2Tr(U)/N, where U is a data and prior dependent matrix. (This result is discussed in Petersen & Winther, 2005a, to which readers are referred for details.) The result indeed explains the poor convergence properties experienced using EM in the low-noise limit. The EM algorithm “freezes,” and an excessive number of iterations are needed for convergence of the mixing matrix. Moreover, for the square case, as also mentioned in Bermond and Cardoso ˜ n is proportional to the gradient of the (1999), the first-order correction A noiseless model’s likelihood, and thus the fix point is to first order equivalent to the fix point of the noiseless model (Bell & Sejnowski, 1995). The slow convergence of the EM algorithm has been debated for a while, and many suggestions for speeding it up have been proposed (McLachlan & Krishnan, 1997). One straightforward method is to use a gradient-based
EM and VBEM in Low-Noise Linear Models
1923
2
10
S Generative S 0.order SS 0.order S 1. order SS 1.order S MF SS MF SS LR
0
10
Error
-2
10
-4
10
-6
10
-8
10
-5
10
-4
10
-3
10
-2
10
noise variance σ
-1
10
0
10
2
Figure 1: This plot demonstrates that the fundamental Taylor expansion of the moments S and SST is reasonably accurate. A set of sources Sgen is generated using a mixture of gaussians (MoG) prior. From this, using a suitable 2 × 2 mixing matrix, the observed data X is constructed for each noise level. Since the source prior is a MoG, the exact posterior moments Se xc , SST e xc can be computed. The error is the mean squared difference of the true mean and the approximation, Err = d1N it (Sit e xc − Sit est )2 , and correspondingly for the second moment. Note that the approximation is fairly accurate when the noise variance is small. As expected, the first-order approximation (triangles) is more accurate than the zeroth-order approximation (circles) in the low-noise regime. Only for noise variance larger than 10−2 is it beneficial to use the generative sources (dots) as estimators for the posterior means. This is possible only for artificial data sets but is included for perspective. The mean field (MF) approximations to the posterior moments (squares) are also included for perspective (see Hojen-Sorensen, Winther, & Hansen, 2002, for details). The MF approach is performing very well indeed, especially when the so-called linear response (LR) correction is taken into account. This is an indicator that in the low-noise regime, ICA techniques such as mean field ICA may prove to be accurate approaches.
optimizer in the M-step. The gradient and the bound value are expressed in terms of the sufficient statistics, which are obtained in the E-step (Olsson, Lehn-Schiøler, & Petersen, 2005). Recently, another general technique, by Salakhudinov and Roweis (2003), called adaptive overrelaxed EM, was proposed, leading to considerable faster convergence (Petersen & Winther, 2005b). The key idea of the adaptive, overrelaxed EM is to boost the update by a factor η ≥ 1. Combining this with the low-noise-limit analysis, we get EM ˜ n + O(σ 4 ). An+1 = An + η An+1 − An = A n + σ 2 η A
1924
K. Petersen, O. Winther, and L. Hansen
That is, adaptive, overrelaxed EM works because the step size factor η directly counters the small magnitude of the noise variance. The only downside is that there is no longer a guarantee of an increase in the likelihood, and a test-step is introduced to remedy this. 3 Variational Bayes EM In variational Bayes EM, we expand the model to include a distribution over the model parameters A and σ 2 , treating them at the same footing as the hidden variables. (See Beal & Ghahramani, 2003, for an introduction to variational Bayes techniques.) The algorithm is aimed at maximizing the lower bound of the marginal log likelihood and allows convenient addition of prior information on the parameters. We choose a zero-mean gaussian prior for the mixing matrix, with covariance p and an inverse gamma distribution with (hyper) parameters α p and β p : 1 T p(A) ∝ exp − Tr A −1 A p 2 p(σ 2 ) ∝ (σ 2 )−(α p +1) exp –β p /σ 2 . Combining these priors with the observation model, we obtain the variational approximations for the posterior distributions, which have the moments that are updated sequentially in the VBEM algorithm. At convergence, we use the posterior mean of these variational distributions as estimators of the unknown model parameters. The statistics that determines the width of the posterior distribution of A is 1/σ 2 . Defining r 2 = 1/1/σ 2 , the low-noise limit corresponds to r 2 → 0, and we can expand the moments involved in updating the A pdf, in powers of r 2 : −1 = An + O(r 2 ) An+1 = XST SST + r 2 −1 p −1 Var(A)n+1 = r 2 SST + r 2 −1 = 0 + O(r 2 ), p and accordingly for the parameters α, β of the inversed-gamma distributed σ 2 , αn+1 = α p +
mN = αn 2
bias βn+1 = β p + Nβn+1 = βn + NO(r 2 ) 2 = βn+1 /αn+1 = rn2 + O(r 2 ), rn+1
EM and VBEM in Low-Noise Linear Models
1925
bias = 1 − rank(A where βn+1 n+1 )/m. Hence, we find that the VBEM update for the mixing matrix and the crucial moment of the noise distribution is freezing exactly as in EM.
4 Discussion The analysis shows that for linear models with low gaussian noise, both the traditional EM algorithm and the variational Bayes extension, which practically degenerates back into an EM algorithm, have serious defects with respect to the rate of convergence. Experience from ICA problems furthermore indicates that the window in which the noise is sufficiently large to make the convergence reasonable and yet not too large with respect to estimation of parameters is indeed very small. Furthermore, note that in Salakhutdinov, Rowels, and Ghahramani (2003), the convergence rate in a gaussian mixture model is demonstrated to be slow when the noise level is large, that is, when the mixtures have considerable overlap. The situation analyzed in this article, however, is a limit of low noise in which the problem intuitively should have an extraordinarily clear and well-defined solution. In that sense, this result is counterintuitive and different from some of the previous observations regarding the slowdown of the EM algorithm. Most likely the explanation is that there is more than one situation in which the EM algorithm becomes slow and that these different situations are not effects of the same underlying reason but rather truly different. Finally, it is crucial for the analysis that the observation model is linear, since we otherwise cannot get closed-form expressions in the M-step. But practical experience and preliminary analysis suggest that this is not the core of the convergence problem and we are instead conjecturing that it is indeed the low-noise limit that is the essence of the matter. Acknowledgments The research for this note was supported financially by Oticon Fonden. References Attias, H. (1999). Inferring parameters and structure of latent variable models by variational bayes. In In Proceedings of Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI. San Francisco: Morgan Kavffman. Beal, M. J., & Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: With application to scoring graphical model structures. Bayesian Statistics, 7, 453–465. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129–1159.
1926
K. Petersen, O. Winther, and L. Hansen
Bermond, O., & Cardoso, J. F. (1999). Approximate likelihood for noisy mixtures. In Proceedings of the First International Workshop on Independent Component Analysis and Blind Source Separation, ICA ’99. Aussois, France. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistics Society, Series B, 39, 1–38. Hojen-Sorensen, P., Winther, O., & Hansen, L. K. (2002). Mean-field approaches to independent component analysis. Neural Computation, 14, 889–918. McLachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. Olsson, R. K., Lehn-Schiøler, T., & Petersen, K. B. (2005). State-space models—from the EM algorithm to a gradient approach. Manuscript submitted for publication. Petersen, K. B., & Winther, O. (2005a). Explaining slow convergence of EM in low noise linear mixtures (Tech. Rep. 2005-2). Kongens Lyngby: Informatics and Mathematical Modelling, Technical University of Denmark. Petersen, K. B., & Winther, O. (2005b). The EM algorithm in independent component analysis. In IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscateway, NJ: IEEE. Salakhutdinov, R., & Roweis, S. (2003). Adaptive overrelaxed bound optimization methods. In Proceedings of International Conference on Machine Learning, ICML. AAAI Press. Salakhutdinov, R., Roweis, S., & Ghahramani, Z. (2003). Optimization with EM and expectation-conjugate-gradient. In Proceedings of International Conference on Machine Learning, ICML. AAAI Press.
Received January 5, 2005; accepted March 2, 2005.
LETTER
Communicated by George Gerstein
Analyzing Functional Connectivity Using a Network Likelihood Model of Ensemble Neural Spiking Activity Murat Okatan
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114-2698, U.S.A.
Matthew A. Wilson
[email protected] Picower Center for Learning and Memory, Riken-MIT Neuroscience Research Center, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
Emery N. Brown
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, and Division of Health Sciences and Technology, Harvard Medical School/Massachusetts Institute of Technology, Boston, MA 02114-2698, U.S.A.
Analyzing the dependencies between spike trains is an important step in understanding how neurons work in concert to represent biological signals. Usually this is done for pairs of neurons at a time using correlationbased techniques. Chornoboy, Schramm, and Karr (1988) proposed maximum likelihood methods for the simultaneous analysis of multiple pair-wise interactions among an ensemble of neurons. One of these methods is an iterative, continuous-time estimation algorithm for a network likelihood model formulated in terms of multiplicative conditional intensity functions. We devised a discrete-time version of this algorithm that includes a new, efficient computational strategy, a principled method to compute starting values, and a principled stopping criterion. In an analysis of simulated neural spike trains from ensembles of interacting neurons, the algorithm recovered the correct connectivity matrices and interaction parameters. In the analysis of spike trains from an ensemble of rat hippocampal place cells, the algorithm identified a connectivity matrix and interaction parameters consistent with the pattern of conjoined firing predicted by the overlap of the neurons’ spatial receptive fields. These results suggest that the network likelihood model can be an efficient tool for the analysis of ensemble spiking activity.
Neural Computation 17, 1927–1961 (2005)
© 2005 Massachusetts Institute of Technology
1928
M. Okatan, M. Wilson, and E. Brown
1 Introduction A major goal of neural data analysis is to characterize how neurons that are part of an ensemble interact with each other (Espinosa & Gerstein, 1988; Gochin, Gerstein, & Kaltenbach, 1990; Eggermont, 1991; Gochin, Miller, Gross, & Gerstein, 1991; Lindsey, Hernandez, Morris, & Shannon, 1992; Wilson & McNaughton, 1994; Vaadia et al., 1995; Skaggs & McNaughton, 1996; Li, Morris, Baekey, Shannon, & Lindsey, 1999; Schoenbaum, Chiba, & Gallagher, 2000; Shannon, Baekey, Morris, Li, & Lindsey, 2000; Louie & Wilson, 2001; Lee & Wilson, 2002). With the advent of the multielectrode recording technology, it is now possible to record the activity of several hundred neurons simultaneously (Wilson & McNaughton, 1993; Nicolelis et al., 2003). This has underlined the need for developing analysis methods that can process these data quickly and efficiently (Brown, Kass, & Mitra, 2004). Statistical dependencies between spike trains may be represented using cross-intensity functions (Cox & Lewis, 1972; Brillinger, 1976a, 1992), product densities, cumulant densities, cumulant spectra, method of moments (Bartlett, 1966; Brillinger, 1975a, 1975b), cross-correlation (Perkel, Gerstein, & Moore, 1967; Moore, Segundo, Perkel, & Levitan, 1970; Aertsen & Gerstein, 1985; Palm, Aertsen, & Gerstein, 1988; Brody, 1998, 1999a, 1999b), coherence (Brillinger, 1976b, 1992), and joint peristimulus time histogram (JPSTH) (Gerstein & Perkel, 1969, 1972; Aertsen, Gerstein, Habib, & Palm, 1989). These methods are usually applied to characterize the dependencies between pairs of neurons at a time, ignoring possible effects from other neurons, although techniques such as partial coherence (Brillinger, 1976b, 1992), conditional cross-correlogram (Eggermont, 1991), and JPSTH (Kristan & Gerstein, 1970) have been applied to the simultaneous analysis of triplets. Logical extension of these methods to the simultaneous analysis of more than three spike trains faces the problem of exponentially increasing number of bins to calculate. This problem is avoided in methods that detect particular spike timing relations among multiple neurons. These include the gravitational clustering algorithm, which is used for the detection of synchronous cell assemblies (Gerstein & Aertsen, 1985; Gerstein, Perkel, & Dayhoff, 1985; Baker & Gerstein, 2000), and spike pattern classification methods, which are used to detect precisely timed spatiotemporal firing patterns or search for a particular sequential firing pattern within the ensemble spiking activity (Abeles & Gerstein, 1988; Louie & Wilson, 2001; Pipa & Grun, ¨ 2003; Lee & Wilson, 2004). Statistical methods based on the maximum likelihood (ML) principle have been proposed as alternative techniques for the simultaneous analysis of neural interactions (Borisyuk, Borisyuk, Kirillov, Kovalenko, & Kryukov, 1985; van den Boogaard, 1986; Brillinger, 1988a, 1988b, 1992; Chornoboy, Schramm, & Karr, 1988). The applicability of the ML method to the analysis of spike trains of more than a few neurons has been limited due to
Network Likelihood Model
1929
the difficulties in computing the ML estimates in large models. Chornoboy et al. (1988) addressed the issue of easy computability of the estimates for large models and proposed iterative estimation algorithms for a class of additive-linear and multiplicative conditional intensity function models. In this article, we explore the computational properties of the algorithm associated with the multiplicative model and use it to analyze both simulated and real neural data. The original formulation of the algorithm is in continuous time. Here we formulate a discrete-time implementation and describe a method that reduces the number of operations. We derive a principled method for computing a good initial parameter vector to start the algorithm. We also derive a data-specific stopping rule to terminate the iteration. We apply this algorithm to the analysis of simulated spike trains and show that it successfully recovers the true time course of the excitatory and inhibitory interactions among simulated neurons. We propose a method to determine optimal connectivity matrices for neural ensembles using a combination of the likelihood ratio test and the Akaike information criterion or the Bayesian information criterion (AIC or BIC; Box, Jenkins, & Reinsel, 1994). We show that this method discovers the correct connectivity matrices in simulations. We apply these methods to the analysis of simultaneously recorded spike trains of a population of rat hippocampal place cells. We determine an optimal connectivity matrix for these cells and analyze its relationship to the degree of overlap between their place fields.
2 Theory In this section, we introduce the network likelihood model and the iterative algorithm for the ML estimation of the model parameters. We derive a method to compute a starting vector for the algorithm and a stopping rule to terminate the iteration. Finally, we describe how the ML estimates are used to determine a connectivity matrix that describes the data optimally. 2.1 Network Likelihood Model. Our method is based on the point process representation of spike trains (Brillinger 1988a, 1992; Chornoboy et al., 1988; Brown, Barbieri, Ventura, Kass, & Frank, 2002; Brown, Barbieri, Eden, & Frank, 2003). Define an interval (0, T], and let 0 < ui,1 < ui,2 0 for all parameters. If the duration of the spike counting window W is short relative to the average interspike interval in the data, the processes I j>0 are zero at
Network Likelihood Model
1933
most time points. This allows computing the denominator in equation 2.12 by executing the summations and multiplications only at the time points where I j are nonzero. In the following, we develop the notation needed to achieve this simplification. We define the set Q j of all time points k at which I j,k > 0. With this definition, the denominator in equation 2.12 is D (n) given by k∈Q j I j,k [ l=0 (γi,l ) Il,k ]. This expression can be further simplified since the products can be computed by multiplying only the parameters that have a nonzero exponent. This can be achieved by defining a set, at each time point, to index these parameters. We thus define the sets J k = {l|0 ≤ l ≤ D, Il,k > 0} for 0 ≤ k ≤ K − 1. With these sets, equations 2.12 and 2.13 may be written as (n+1) γi, j
(n) = γi, j
k∈Q j
(n)
k∈Q j (n)
Pi,k =
I j,k Ni,k:k+1 I j,k Pi,k
βi, j
,
(n) Il,k , γi,l l∈J k
(2.14)
(2.15)
βi,j = k∈Q j
I j,k Ni,k:k+1 . I j,k l∈J k Il,k Ni,k:k+1 k∈Q j
(2.16)
Only the denominator of equation 2.14 needs to be recomputed at each it(n) eration. The product Pi,k is not computed anew for each parameter. It is computed once per iteration, and its values are looked up during the computation of the denominator for individual parameters. Precomputing the sets Q j and J k for later look-up reduces the computation time substantially, especially considering that the parameters are reestimated under several different contingencies during connectivity analyses. The ML estimate αˆ i, j of αi, j is computed as αˆ i, j = log(γˆi, j ) using the invariance principle of the ML estimation (Pawitan, 2001). For large T, the prob√ ability density function of the statistic T(α ˆ i − αi0 ) is multivariate normal with mean 0 and covariance −1 , where αi0 is the true value of the param∂ 2 l T (αi ) 0 . We use this relationship to compute eter vector and jl = − T1 ∂α | i, j ∂αi,l αi =αi
the 100 (1 − α)% confidence interval of αˆ i, j as αˆ i, j ± T1 ( −1 )jj Φ−1 (1 − α2 ), where 0 ≤ α ≤ 1 is the significance level, Φ(·) is the standard normal distribution function, and is evaluated at αi = α ˆ i.
2.3 Formulation of the Starting Vector and the Stopping Rule. It is desirable to start the iteration using a parameter vector that is as close to the solution as possible. Here, we present two equations that can be used to compute such a vector (see the appendix for details). In the following, the
1934
M. Okatan, M. Wilson, and E. Brown (1)
starting vector for neuron i is denoted by γ i . For j > 0, the components (1) γi,j of this vector are computed by solving the following equation: Rj r (1) (1) card Qrj (r νi,0 − νi, j ) γi, j = 0. p γi,j =
(2.17)
r =0
Here, νi, j = k∈Q j I j,k Ni,k:k+1 , Qrj is the set of all time points k at which I j,k = r , card(Qrj ) is the cardinality of Qrj , and R j = max(I j,k ). By Descartes’ k (1) sign rule, at most one of the roots of the equation p(γi, j ) = 0 is strictly (1) positive (Struik, 1986). If this root exists, we use it as γi, j . If it does not exist, (1) then some γi, j > 0 that minimizes | p(γi, j )| may be used as γi, j . We compute (1) γi,0 as the average firing rate, (1)
γi,0 =
νi,0 . K
(2.18)
We stop the iteration when the change in the parameter values is below a certain threshold. This is achieved by terminating the iteration when the following inequality is achieved: δγ i,n = max j
(n+1)
γi, j
(n)
γi, j
(n)
− 1,
γi, j
(n+1)
γi, j
− 1 < θ.
(2.19)
When the iteration is stopped using the i,n < θ, the change in the log ruleδγ D T likelihood is around θ 2 , where = Dj=0 l=0 0 I j (u)Il (u)dNi (u) (see the appendix for details). is completely specified given the ensemble spike train data. Therefore, this relationship can be used to select a data-specific threshold θ. was smaller than 4.8 104 for all place cells in our study. In our analysis, we used θ = 10−5 , which resulted in a change in the log likelihood of less than 4.8 10−6 at the end of the iterations. 2.4 Computing an Optimal Connectivity Matrix. We used a combination of the likelihood ratio test and the AIC or the BIC to determine the optimal order of the network model, which in this case corresponds to the optimal number and configuration of connections to include. Minimizing the AIC or the BIC strikes a balance between maximizing the log likelihood and reducing the number of parameters in the model (Box et al., 1994). We use these criteria and the likelihood ratio test as follows (see the appendix for details). We first compute the significance of each connection using the likelihood ratio test. This test determines whether the removal of one pair-wise connection from the network changes the likelihood significantly. Therefore, it computes this significance by taking into account all other pairwise interactions. We obtain a new network configuration by discarding the
Network Likelihood Model
1935
connections whose significance is lower than a variable threshold. We compute the AIC and the BIC for such configurations by reestimating the surviving model parameters. We find the thresholds that minimize these criteria. The optimal parameter vector that minimizes the AIC (BIC) is denoted by AI C α1:C (αBIC 1:C ). These vectors constitute optimal representations of the changes in the firing probability of neurons following a spike in any neuron in the network. They also imply the optimal connectivity matrices G AI C and G BIC . In these binary matrices, a 1 at the entry (i, c) indicates that the functional connection from neuron c to neuron i was kept in the optimal model. Here, the term connection refers to statistical dependencies between spike trains (neurons) and does not necessarily imply the existence of an anatomical connection between the corresponding neurons. 3 Applications 3.1 Simulation of Ensemble Spiking Activity. To demonstrate that the method successfully recovers inhibitory and excitatory interactions among neurons, we generated spike trains that had known interaction matrices using both types of interactions. We simulated the ensemble spiking activity as a multivariate point process using the discrete version of the time rescaling method, with a temporal resolution of = 1 ms (Brown et al., 2002). The simulated ensembles consisted of 4, 8, and 20 model neurons. These are shown in Figures 1A, 4A, and 6A, respectively. The firing probability of each neuron depended on the activity of other neurons through inhibitory and excitatory interactions. Each neuron inhibited itself and had at least one inhibitory and one excitatory interaction with other neurons. These interactions are shown in Figure 1B and are described by the following functions, α 0 (u) = −2 sin 2π 0.08−1 u exp −0.04−1 u , α + (u) = 2 sin 2π0.06−1 u exp −0.04−1 u , α − (u) = −3 sin 2π 0.12−1 u exp −0.04−1 u ,
(3.1)
where u is in seconds and α 0 (u), α + (u), and α − (u) represent the autoinhibitory, excitatory, and cross-inhibitory interactions among neurons, respectively. In addition, each neuron had a spontaneous firing rate of 5 Hz. These functions are dimensionless and represent the logarithm of dimensionless multipliers of the spontaneous firing rate. They represent statistical dependencies between neurons without reference to physiological processes. To generate the simulated spike trains, we computed the firing probability of neuron i in the time interval (t, t + ] as λi (t |αi , Ht ) using equation 2.9 with a history duration of 120 ms and a spike counting window of W = 1 ms. This resulted in M = 120 parameters for each pair-wise
M. Okatan, M. Wilson, and E. Brown
A
B 2
4
α+(u)
α-(u)
1
α0(u)
1936
3
2 0 −2 2 0 −2 2 0 −2 0
30
60
90
120
Time (ms)
Normalized Counts per Bin
C
0.01 0.005 0 0
60 120 180 240 300 360 420 480 ISI (ms)
dB/Hz
−48
−50
−52
0
16
32
48 64 80 96 112 Frequency (Hz)
Figure 1: (A) The four-neuron network. Each neuron has a spontaneous firing rate of 5 Hz. Neurons interact through inhibitory (black) and excitatory connections. (B) The firing probability of each neuron is modulated by an autoinhibitory interaction (α 0 (u)) in addition to the cross-inhibitory (α − (u)) and the excitatory (α + (u)) interactions. (C) The interspike interval (ISI) distribution (top) and the power spectral density (PSD) of the spike train for neuron 1. The exponential function superimposed on the ISI distribution shows the distribution that would be obtained from a 5 Hz Poisson spike train.
interaction. We now explain how the parameters for neuron 1 were obtained from equation 3.1 in the network of four neurons, which is shown in Figure 1A. In this network, each neuron excites its neighbor in the clockwise direction and inhibits its neighbor in the opposite direction. Using equation 2.8 with M = 120, the component α1,(c−1)M+m of the parameter vector α1 describes how the number of spikes fired by neuron c in the interval [t − mW, t − (m − 1)W) influences the firing probability of neuron 1 at time t. In the present case, this interval contains at most one spike because W = 1 ms. When neuron 1 fires a spike at time t − mW, its effect on its own firing probability at time t is computed as α 0 (mW) using the auto-interaction function in equation 3.1. The effect of a spike that occurs within the interval [t − mW, t − (m − 1)W) may be represented by the mean value of the
Network Likelihood Model
1937
function α 0 (u) for u ∈ ((m − 1)W, mW]. Since α 0 (u) does not change by a large amount within 1 ms intervals, we approximated this mean by the value of the function at the beginning of the interval. Therefore, the value of the parameter α1,m was computed as α1,m = α 0 ([m − 1]W) for 1 ≤ m ≤ 120. Similarly, for 1 ≤ m ≤ 120, the parameters that describe the influence of neurons 2 and 4 on neuron 1 are computed as α1,M+m = α − ([m − 1]W), and α1,3M+m = α + ([m − 1]W), respectively. On the other hand, the parameters α1,2M+m are all zero since neuron 3 does not directly influence the firing of neuron 1. The parameter α1,0 is equal to log(5) since the spontaneous firing rate is 5 Hz for all neurons. Model parameters were obtained in an identical fashion for the other simulated neurons. The simulations were stopped after each neuron fired at least 5000 spikes. An absolute refractory period of 1 ms was enforced by preventing spikes from being generated in adjacent time steps. We designed the simulated networks to have nontrivial connectivity while avoiding excessive visual complexity. These examples are intended to show that for increasingly complex networks under no a priori assumption about the connectivity, our algorithm has a high likelihood of identifying whatever connectivity is in the system. The network model generates both stationary and nonstationary spike trains depending on whether its parameters satisfy certain constraints. Since our simulation and analysis methods do not impose any specific constraints on the parameters, the application of our method is not restricted to stationary spike trains.
3.2 Analysis of the Simulated Ensemble Spiking Activity. We use the four-neuron network as a simple example to illustrate our method. Figure 1C shows the interspike interval (ISI) distribution (top) and the power spectral density (PSD) of the spike train for neuron 1. The exponential function superimposed on the ISI distribution shows the distribution that would be obtained from a 5 Hz Poisson spike train. These plots are representative of all neurons in this and the other simulated networks due to the symmetry and the use of the same interaction functions in each network. It can be seen that the neuron has a certain tendency to fire with ISIs of about 60 ms, which shows up as a peak at around 16 Hz in the PSD. This structure can be explained in terms of the temporal distribution of excitation within the interaction functions (see Figure 1B). Namely, the auto-inhibition has an excitatory rebound that peaks at around 56 ms, and the excitation has two excitatory peaks separated by 60 ms. The former results in an interval of increased firing probability at around 56 ms after each firing of neuron 1, and the latter results in two intervals of increased firing probability, separated by about 60 ms, after each firing of an excitatory trigger neuron, which is the neuron 4 in this case. For the parameter estimation, we used a history duration of 120 ms and W = 10 ms, which resulted in 12 parameters per connection. An optimal value of W can be found by trying different values and selecting the one
1938
M. Okatan, M. Wilson, and E. Brown
that minimizes the AIC or the BIC. This is done in the analysis of the place cell data in section 3.3. Here, the value of 10 ms was chosen intuitively to obtain a relatively small number of parameters per connection while maintaining the temporal resolution sufficiently high to capture the important features of the interaction functions. Figure 2A shows α ˆ 1:4 , the ML estimate of the parameter vector α1:4 , superimposed on the actual interaction functions. The ML estimates are shown by the black dots, and each estimate is plotted at the center of the 10 ms window that it is associated with. The error bars show the 99.99% confidence intervals of the estimates. The asterisks indicate the parameters that are significantly different from zero ( p < 0.0001). The diagonal plots show the auto-inhibitory interaction estimates. The connectivity pattern implied by Figure 2A matches the actual diagram in Figure 1A. It can be seen that the temporal evolution of the interactions is captured in the parameter values accurately. The estimates of the spontaneous firing rates for each cell are shown in Figure 2B. Figure 2A shows that the parameters that represent the connections between the neuron pairs (1, 3) and (2, 4) are close to zero. As a result, these were the least significant connections in the network. Both the AIC and the BIC were minimized when these connections were removed from the model. AI C The resulting optimal vectors α1:4 and αBIC 1:4 were identical. In these vectors, the parameters that correspond to the removed connections are zero. The other parameters are reestimated under this constraint, and their values were found to be very close to the values shown in Figure 2A. In other words, the optimal parameter vectors correctly identified the true matrix of excitatory and inhibitory interactions among the simulated neurons. The AI C connectivity matrices G AI C and G BIC that are implied by α1:4 and αBIC 1:4 were also identical and are shown in Figure 2C. In this matrix, connections that were kept in the optimal model are indicated by the black squares. The diagonal entries represent the dependence of the cells on their own activity history. This matrix corresponds to the true connectivity matrix of the network in Figure 1A. Note that the optimal connectivity matrices do not show a connection between neurons that do not have direct interactions, even though they have indirect interactions. For instance, neuron 1 has a bidirectional interaction with neuron 3 through neuron 2. In other words, neuron 2 acts as an “interneuron” between neurons 1 and 3. This relation is successfully detected in Figures 2A and 2C without mistakenly concluding that neurons 1 and 3 are in direct interaction. For comparison, we also computed the interaction estimates using autoand cross-intensity functions (Brillinger, 1976a). The estimates for neuron 1 are shown in Figure 3. These estimates are representative for the other neurons of this network as well. Here, the square root of the instantaneous firing rate of neuron 1 is shown as a function of time after a trigger spike. Assuming that the neurons do not fire simultaneous spikes and that the bivariate point processes that consist of the spike trains of neuron 1 and each of
Network Likelihood Model
1939
Triggers
A
1 2
2 0 −2
3
2 0 −2 2 0 −2
Targets
2 0 −2
4
1
2
3
4
0 40 80 120 0 40 80 120 0 40 80 120 0 40 80 120
Time (ms) C
1.8
1
Targets
B
αi,0
1.6 1.4 log(5 Hz)
2 3 4
1.2
1
1
2
3
4
2
3
4
Triggers
Cells Figure 2: (A) The ML estimates of the parameter vectors computed using the algorithm, superimposed on the actual interaction functions. The estimates are shown by the black dots. The error bars show the 99.99% confidence intervals of the estimates. The asterisks indicate the parameters that are significantly different from zero ( p < 0.0001). The interaction estimates are dimensionless and represent the logarithm of dimensionless numbers that modulate the spontaneous firing rate. (B) The spontaneous firing-rate estimates. (C) G AI C and G BIC were identical and corresponded to the true connectivity matrix. Black squares indicate the presence of a connection.
the four trigger neurons are stationary, the square root of the auto- or crossintensity function has a nearly normal asymptotic distribution (Brillinger, 1976a). In each graph, the thick curve shows the transformed true interaction this curve is the function ! function. For instance, in the top left graph, 5 exp(α 0 (u)) for 0 ≤ u ≤ 120 ms, where α 0 (u) is given by equation 3.1 and
1940
M. Okatan, M. Wilson, and E. Brown
Root Firing Rate (s-1/2)
Cell 1 to Cell 1
Cell 2 to Cell 1
4
4
3
3
2
2
1
1
0
40
80
120
0
Cell 3 to Cell 1 4
3
3
2
2
1
1 40
80
80
120
Cell 4 to Cell 1
4
0
40
120
0
40
80
120
Time (ms) Figure 3: Square root of the auto- and cross-intensity functions for neuron 1. The thick curves show the transformed true interaction functions (see text for details). In each graph, the dashed line indicates the square root of the estimated average firing rate. Of the three parallel curves, the middle curve shows the square root of the auto- or cross-intensity function. The upper and lower curves show the 99.99% confidence interval for the estimates. The two raster traces below the curves show the points where the estimates significantly differ from the average firing rate (top raster trace) or the true interactions (bottom raster trace). The estimates differ significantly from the true interactions in 16% to 35% of the 120 ms interval.
5 Hz is the true spontaneous firing rate of neuron 1. The curves that show the true crossinteractions between neuron 1 and neurons 2 and 4 were obtained similarly, using α − (u) and√α + (u), respectively, in place of α 0 (u). For neuron 3, the thick curve shows 5 s−1/2 , which is the square root of the spontaneous firing rate. In each graph, the dashed line indicates the square root of the estimated average firing rate of neuron 1, which has the same value in all plots. Of the three parallel curves, the middle curve shows the square root of the intensity function estimate. We computed this function using a bin size of 10 ms such that its value at time τ represents the square root of the average intensity in the interval (τ − 5 10−3 , τ + 5 10−3 ] (Brillinger, 1976a). Therefore, these estimates can be directly compared with the interaction estimates computed using our method (see Figure 2A). The upper and lower curves show the 99.99% confidence interval around the estimates. The two raster traces below the curves show the points where the estimates
Network Likelihood Model
1941
significantly differ from the estimated average firing rate (top raster trace) or the true interaction functions (bottom raster trace) ( p < 0.0001). It can be seen that the interaction estimates are significantly different from the true interaction functions in certain segments of the 120 ms interval. These segments cover 16% to 35% of the 120 ms interval. This coverage is 40% to 60% using 95% confidence intervals (not shown). In contrast, the true interaction functions are within the confidence intervals of all parameters in Figure 2A. These results show that although the auto- and cross-intensity functions provide a descriptive estimate of the interactions, they could not recover the true interaction functions accurately. To test our algorithm on a larger network with a denser connectivity pattern, we designed a network of fully interconnected eight neurons. All neurons had the same properties as those in the previous network. In this network, each neuron receives only one excitatory connection, which is from the previous neuron in the clockwise direction, and six cross-inhibitory connections from the remaining neurons. This network is shown in Figure 4A, where only the connections made to three neurons are shown for clarity. The spontaneous firing-rate estimates and the best and worst cases of the interaction estimates, based on the mean squared error, are shown in Figures 4B and 4C, respectively. Even in the worst case, the estimates were very close to the true values of the parameters. Figure 5 shows α ˆ 1:8 , the full matrix of the interaction estimates. The diagonal plots show the auto-inhibitory interactions. Each neuron is seen to have one excitatory interaction with the proper neuron and inhibitory interactions with the remaining neurons, in agreement with the true network connectivity. All entries in G AI C and G BIC were 1’s, indicating that the optimal connectivity matrices corresponded to a fully interconnected network. AI C It also follows that the optimal parameter vectors α1:8 and αBIC 1:8 were identical to α ˆ 1:8 . To try a network of nontrivial size, we constructed a network of 20 neurons using the four-neuron network as a building block, as shown in Figure 6A. This network contains 10 neuron triplets, such as the triplets (4←3→5) and (2←6→5), in which two neurons receive a shared input without having direct interactions. Therefore, this network can be used to test whether the method can differentiate between direct interaction and shared input. The optimal connectivity matrices G AI C and G BIC were identical and corresponded to the true connectivity matrix shown in Figure 6B. In this case, G AI C corresponded to this matrix when the interactions were estimated at W = 5 ms resolution instead of 10 ms. At 10 ms resolution, G AI C had three false-positive connections. This suggested that 12 parameters per connection did not provide a sufficiently accurate representation of the interactions, which caused the model to overfit the data by forcing some parameters to be significantly different from zero even though they were part of the connections that did not exist in the original network. Indeed, after doubling
1942
M. Okatan, M. Wilson, and E. Brown
A
B
1 2
7
3
αi,0
8
1.8 1.6 1.4
6
log(5 Hz)
4
1.2
5
1
2
3
4
5
6
7
8
Cells
Target 5
C 2
Worst
0
−2 0
40
80
120
Target 4
Time after cell 2 spike (ms) 2
Best
0
−2 0
40
80
120
Time after cell 4 spike (ms) Figure 4: (A) The eight-neuron network. Same properties as in Figure 1. Each neuron receives only one excitatory connection, which is from the previous neuron in the clockwise direction, and six cross-inhibitory connections from the remaining neurons. Only the connections made to three neurons (8, 1, and 2) are shown for clarity. (B) The spontaneous firing-rate estimates. (C) The best and the worst cases of the interaction estimates based on the mean squared error. Same figure properties as in Figure 2.
the number of parameters per connection to 24 (a resolution of 5 ms), G AI C corresponded to the true connectivity matrix. Also, the minimum AIC was smaller at 5 ms resolution, indicating that the true connectivity matrix was indeed the optimal solution across temporal resolutions, as expected. This suggests that in an unknown experimental situation, a range of values for W should be tried, and values that minimize the AIC or the BIC (or both) should be used in the analysis. On the other hand, G BIC corresponded to the true connectivity matrix at both temporal resolutions, and the minimum BIC was lower at 10 ms resolution. Figure 6B shows that G AI C and G BIC do not indicate direct interactions between the neuron pairs that receive shared input without direct interaction, while the fact that they receive a shared input is visible in these matrices. This shows that the method can make this
Network Likelihood Model
1943 Triggers
2
3
4
5
6
7
8
4 5 8
7
6
Targets
3
2
1
1
Time 0-120 ms
Figure 5: The full matrix of the interaction estimates. The diagonal plots show the auto-interaction for each neuron. Each row has one excitatory interaction. The other plots show the cross-inhibitory interactions. Same figure properties as in Figure 2.
distinction successfully, provided that the shared input source is accurately modeled in the conditional intensity functions of the neurons. The spontaneous firing-rate estimates and the best and worst cases of the interaction estimates using a 10 ms resolution are shown in Figures 6C and 6D, respectively. As for the other networks, the estimates were highly accurate even in the worst case. The interaction estimates in Figures 2 and 4 through 6 show that the model fit the true interactions with high accuracy. The goodness-of-fit of the conditional intensity functions may be assessed by using KolmogorovSmirnov (KS) plots. The KS plots are obtained by rescaling the spike times of a spike train using the conditional intensity function that is fit to that spike train and comparing the distribution of the intervals between the rescaled times (rescaled ISIs) to the ISI distribution of a Poisson process with unit rate. The rescaling is done using the following formula,
ui,k
(ui,k ) = 0
λ(u|Hu )du,
(3.2)
1944
M. Okatan, M. Wilson, and E. Brown 1
2
B 1
4
3
5
17 18
5
6
20 19
8
7
Targets
A
10 15
13 14
9 10
16 15
12 11
αi,0
C
20 1
5
10
15
20
Triggers
1.8 1.6 1.4 log(5 Hz) 1.2
D
1
5
10
15
20
Target 12
Cells
Worst
2 0 −2 0
40
80
120
Target 17
Time after cell 15 spike (ms) Best
2 0 −2 0
40
80
120
Time after cell 20 spike (ms)
Figure 6: (A) The 20-neuron network. Same properties as in Figure 1. (B) G AI C and G BIC were identical and corresponded to the true connectivity matrix. Black squares indicate the presence of a connection. (C) The spontaneous firing-rate estimates. (D) The best and the worst cases of the interaction estimates based on the mean squared error. Same figure properties as in Figure 2.
whe