REVIEW
Generalized Deformable Models, Statistical Physics, and Matching Problems Alan L. Yuille Division of Applied Sc...
6 downloads
621 Views
15MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
REVIEW
Generalized Deformable Models, Statistical Physics, and Matching Problems Alan L. Yuille Division of Applied Science, Harvard University, Cambridge, M A 02138 USA
We describe how to formulate matching and combinatorial problems of vision and neural network theory by generalizing elastic and deformable templates models to include binary matching elements. Techniques from statistical physics, which can be interpreted as computing marginal probability distributions, are then used to analyze these models and are shown to (1)relate them to existing theories and (2) give insight into the relations between, and relative effectivenesses of, existing theories. In particular we exploit the power of statistical techniques to put global constraints on the set of allowable states of the binary matching elements. The binary elements can then be removed analytically before minimization. This is demonstrated to be preferable to existing methods of imposing such constraints by adding bias terms in the energy functions. We give applications to winner-takeall networks, correspondence for stereo and long-range motion, the traveling salesman problem, deformable template matching, learning, content addressable memories, and models of brain development. The biological plausibility of these networks is briefly discussed. 1 Introduction
A number of problems in vision and neural network theory that involve matching features between image frames, detecting features, learning, and solving combinatorial tasks (such as the traveling salesman problem) can be formulated in terms of minimizing cost functions usually involving either binary decision fields (for example, Hopfield and Tank 1985) or continuous fields, for example the elastic net algorithms (Durbin and Willshaw 1987). Elastic net models have been used for problems in vision, speech, and neural networks (Burr 1981a,b; Durbin and Willshaw 1987; Durbin et aZ. 1989; Kass et al. 1987; Terzopolous et al. 1987). They are also related to work of Pentland (1987) and Fischler and Elschlager (1973). Deformable templates (Yuilleet al. 1989b; Lipson et ai. 1989)are simple parameterized shapes that interact with an image to obtain the best fit of their parameters. They have more structure than elastic nets and fewer degrees of Neurul Computation 2, 1-24 (1990)
@ 1990 Massachusetts Institute of Technology
2
Alan L. Yuille
freedom (numbers of parameters). Elastic nets can be, loosely, thought of as the limit of deformable templates as the number of parameters goes to infinity. In this paper we will generalize the ideas of elastic nets and deformable templates to include binary matching fields. The resulting models, generalized deformable models, will be applied to a range of problems in vision and neural net theory. These models will contain both binary and continuous valued fields. We will describe techniques, typically equivalent to computing marginal probability distributions, that can be used to convert these models into theories involving one type of field only. This will relate our model to existing theories and also give insight into the relationships between, and relative effectivenesses of, existing theories. These techniques are best described using a statistical framework. Any problem formulated in terms of minimizing an energy function can be given a probabilistic interpretation by use of the Gibbs distribution (see, for example, Parisi 1988). This reformulation has two main advantages: (1) it relates energy functions to probabilistic models (see for example, Kirkpatrick et al. 1983; Geman and Geman 1984), thereby allowing powerful statistical techniques to be used and giving the theory a good philosophical basis founded on Bayes theorem (17831, and (2) it connects to ideas and techniques used in statistical physics (Parisi 1988). We will make use of techniques from these two sources extensively in this paper. The connection to statistical physics has long been realized. The Hopfield models (Hopfield 1984; Hopfield and Tank 1985) and some of the associated vision models (Yuille 1989) can be shown to correspond to finding solutions to the mean field equations of a system at finite temperature. A number of algorithms proposed for these models will then correspond (Marroquin 1987) to deterministic forms of Monte Carlo algorithms. The free parameter X in the Hopfield models corresponds to the inverse of the temperature and varying it gives rise to a form of deterministic annealing. An important advance, for computer vision, was taken by Geiger and Girosi (1989), who were studying models of image segmentation with an energy function E(f ,I ) (Geman and Geman 1984)that depended on a field f , representing the smoothed image, and a binary field I representing edges in the image. Geiger and Girosi (1989) observed that if the partition function of the system, defined by
was known then the mean fields f and 1 could be calculated and used as minimum variance estimators (Gelb 1974) for the fields (equivalent to the maximum a posteriori estimators in the limit as p -+ 00). They showed that the contribution to the partition function from the 1 field could be
Generalized Deformable Models
3
explicitly evaluated leaving an effective energy E e ~f )( depending on the f field only. Moreover, this effective energy was shown to be closely related (and exactly equivalent for some parameter values) to that used by Blake (1983, 1989) in his weak constraint approach to image segmentation. Eliminating the I field in this way can be thought of (Wyatt, private communication) as computing the marginal probability distribution pm(f ) = C Ep(f, I ) , where p ( f ,1 ) is defined by the Gibbs distribution p(f, I ) = e-aEcf,E)/Z with Z a normalization constant (the partition function). This yields p,( f = e-fiEeff/Zeff,where Zeff is a normalization constant. It was then realized that using the partition function gave another important advantage since global constraints over the set of possible fields could be enforced during the evaluation of the partition function. This was emphasized by Geiger and Yuille (1989) in their analysis of the winner-take-all problem, discussed in detail in the next section. The constraint that there is only one winner can be imposed either by (1) adding a term in the energy function to bias toward final states with a single winner, or ( 2 ) evaluating the partition function for the system by evaluating it only over configurations with a single winner. For this problem method (2) is definitely prefFrable since it leads directly to a formula for the minimum variance estimators. On the other hand, (1) only leads to an algorithm that might converge to the minimum variance estimators, but that might also get trapped in a local minimum. This suggested that it was preferable to impose global constraints while evaluating the partition function since the alternative led to unnecessary local minima and increased computation. This conclusion was reinforced by applications to the correspondence problem in stereo (Yuille et ul. 1989a) and long-range motion (see Section 3). Both these problems can be posed as matching problems with some prior expectations on the outcome. Further support for the importance of imposing global constraints in this manner came from work on the traveling salesman problem (TSP). It has long been suspected that the greater empirical success (Wilson and Pawley 1988; Peterson 1990) of the elastic net algorithm (Durbin and Willshaw 1987; Durbin et al. 1989) compared to the original Hopfield and Tank (1985) algorithm was due to the way the elastic net imposed its constraints. It was then shown (Simic 1990) that the elastic net can be obtained naturally (using techniques from statistical mechanics) from the Hopfield and Tank model by imposing constraints in this global way. In Section 4 we will provide an alternative proof of the connection between the Hopfield and Tank and elastic net energy functions by showing how they can both be obtained as special cases of a more general energy function based on generalized deformable models with binary matching fields (I was halfway through the proof of this when I received a copy of Simic’s excellent preprint). Our proof seems more direct than Simic’s and is based on our results for stereo (Yuille et ul. 1989b).
4
Alan L. Yuille
In related work Peterson and Soderberg (1989) also imposed global constraints on the Hopfield and Tank model by mapping it into a Potts glass. They also independently discovered the relation between the elastic net and Hopfield and Tank (Peterson, private communication). In benchmark studies described at the Keystone Workshop 1989 (Peterson 1990) both the elastic network algorithm and the Peterson and Soderberg model gave good performance on problems with up to 200 cities (simulations with larger numbers of cities were not attempted). This contrasts favorably with the poor performance of the Hopfield and Tank model for more than 30 cities. It strongly suggests that the global constraints on the tours should be enforced while evaluating the partition function (a "hard constraint in Simic's terminology) rather than by adding additional terms to bias the energy (a "soft" constraint). Generalized deformable models using binary matching elements are well suited to imposing global constraints effectively. We show in Section 5 that our approach can also be applied to (1) matching using deformable templates, (2) learning, (3) content addressable memories, and (4) models of brain development. The Boltzmann machine (Hinton and Sejnowski 1986)is another powerful stochastic method for learning that can also be applied to optimization in vision (Kienker et al. 1986). The application of mean field theory to Boltzmann machines (Peterson and Anderson 1987; Hinton 1989) has led to speed ups in learning times. 2 Using Mean Field Theory to Solve the Winner Take All
We introduce mean field (MF) theory by using it to solve the problem of winner take all (WTA). This can be posed as follows: given a set {T,} of N inputs to a system how does one choose the maximum input and suppress the others. For simplicity we assume all of the T, to be positive. We introduce the binary variables V , as decision functions, V, = 1, V, = 0 , z # w selects the winner w. We will calculate the partition function 2 in two separate ways for comparison. The first uses a technique called the mean field approximation and gives an approximate answer; it is related to previous methods for solving WTA using Hopfield networks. The second method is novel and exact. It involves calculating the partition function for a subset of the possible V,, a subset chosen to ensure that only one V, is nonzero.
2.1 WTA with the Mean Field Approximation. Define the energy function
Generalized Deformable Models
5
where v is a parameter to be specified. The solution of the WTA will have all the V, to be zero except for the one corresponding to the maximum T,. This constraint is imposed implicitly by the first term on the right-hand side of 2.1 (note that the constraint is encouraged rather than explicitly enforced). It can be seen that the minimum of the energy function corresponds to the correct solution only if v is larger than the maximum possible input T, (otherwise final states with more than one nonzero V, may be energetically favorable for some inputs). Now we formulate the problem statistically. The energy function above defines the probability of a solution for {K} to be P({V,}) = (l/Z)e-oEE‘{K)),where /3 is the inverse of the temperature parameter. The partition function is
Observe that the mean values into 2.2) from the identity
of the V , can be found (substituting 2.1
(2.3)
We now introduce the mean field approximation by using it to compute the partition function 2
When calculating the contribution to Z from a specific element V , the mean field approximation replaces the values of the other elements V, by their mean values This assumes that only low order correlations between elements are important (Parisi 1988). From Zappro,we compute the mean value V , = (-l/b) (8In Zapprox/8Tz) and obtain some consistency conditions, the mean field equations, on the
q.
This equation may have several solutions, particularly as 3( + co. We can see this by analyzing the case with N = 2. In the limit as ,8 + 00 solutions of 2.5 will correspond to minima of the energy function given by 2.1. The energy will take values -TI and -T2 at the points (V,,V2) = ( 1 , O ) and (V,, = (0,l). On the diagonal line V, = V2 = V separating these points the energy is given by E(V) = uV2 - V(Tl + T2). Using the condition that v > T,,, we can see that for possible choices of TI,T2 (with TI = v - el, T2 = v - e2 for small el and e2) the energy on the diagonal is larger than -TI and -T2, hence the energy function has at least two minima.
Alan L. Yuille
6
There are several ways to solve (2.5). One consists of defining a set of differential equations for which the fixed states are a solution of (2.5). An attractive method (described in Amit 1989) applied to other problems in vision (Marroquin 1987) and shown to be a deterministic approximation to stochastic algorithms (Metropolis et al. 1953) for obtaining thermal equilibrium is (2.6) If the initial conditions for the satisfy an ordering constraint 1 (which can be satisfied by initially setting the q s to be the same), then, by adapting an argument from Yuille and Grzywacz (1989b), the system will converge to the correct solution. However, if the ordering condition is violated at any stage, due to noise fluctuations, then it may give the wrong answer.
K for all i
2.2 WTA without the Mean Field Approximation. We now impose the constraint that the V , sum to 1 explicitly during the computation of the partition function. The first term on the right-hand side of 2.4 is now unnecessary and we use an energy function
(2.7) We compute Z by summing over all possible (binary) constraint that they sum to 1. This gives
z=
C { V , = O , l } : ~ , v,=1
e-PEyTAIKI =
CeP%
under the
(2.8)
i
In this case no approximation is necessary and we obtain
Thus, as p -+ 00 the V , corresponding to the largest Tiwill be switched on and the other will be off. This method gives the correct solution and needs minimal computation. This result can be obtained directly from the Gibbs distribution for the energy function 2.1. The probabilities can be calculated
where 2 is the normalization factor and it follows directly that the means are given by 2.9. This second approach to the WTA is clearly superior to the first. In the remainder of the paper we will extend the approach to more complex problems.
Generalized Deformable Models
7
3 Long-Range Motion Correspondence and Stereo
Both the vision problems of stereo and long-range motion can be formulated in terms of a correspondence problem between features in a set of images. We will chiefly concentrate on long-range motion since the application of these statistical ideas to stereo and some connections to psychophysical experiments is discussed elsewhere (Yuille et al. 1989a). However, many of the results for long-range motion can be directly adapted to stereo. In his seminal work Ullman (1979) formulated long-range motion as a correspondence problem between features in adjacent image time frames and proposed a minimal mapping theory, which could account for a number of psychophysical experiments. The phenomenon of motion inertia (Ramachandran and Anstis 1983; Anstis and Ramachandran 1987) shows that the past history of the motion is also relevant. However, recent theoretical studies (Grzywacz et al. 1989) and comparisons between theoretical predictions and experiments (Grzywacz 1987, 1990) suggest that the past history can often, though not always, be neglected. We will make this assumption during this section. Ullman's minimal mapping theory (1979) proposes minimizing a cost function
where the {K,}s are binary matching elements (KJ= 1 if the ith feature in the first frame matches the jth feature in the second frame, KJ = 0 otherwise) and dz3 is a measure of the distance between the zth point x , in the first frame and the gth point y3 in the second frame (&, = Ix, - y, I, for example). The cost function is minimized with respect to the {K,}S while satisfying the cover principle: we must minimize the total number of matches while ensuring that each feature is matched. If there are an equal number of features in the two frames then the cover principle ensures that each feature has exactly one match. An alternative theory was proposed by Yuille and Grzywacz (1988, 1989a), the motion coherence theory for long-range motion, which formulated the problem in terms of interpolating a velocity field v(x) between the data points. Supposing there are N features {xt} in the first frame and N features {y3} in the second they suggest minimizing (Yj -
(3.2)
with respect to {V,,} and v ( x ) , where D2% = V 2 nand ~ DZn+lv= V(V2"v). The second term on the right-hand side of 3.2 is similar to the standard
8
Alan L. Yuille
smoothness terms used in vision (Horn 1986; Bertero et al. 1987), but the use of it in conjunction with binary matching fields was novel. Yuille and Grzywacz (1989a) showed that by minimizing E [ { K j } v(x)] , with respect to the velocity field we can obtain a linear set of equations for v(x) in terms of the { Kj}and the data. Substituting this back into the energy function gives (3.3) where we have used the summation convention on the repeated indices z and J , and with d,, = xi - ya. The function G,, = G(x, - y3 : a) is a Gaussian with standard deviation a (it is the Green function for the smoothness operator). It ensures a smooth velocity field and the range of the smoothing depends on a. If we take the limit as a + 0 we see that the motion coherence theory becomes equivalent to the minimal mapping theory with the squared distance norm (this is because (AS,, + G,j)-l cx SzJ). For nonzero values of 0 the motion coherence theory gives a smoother motion than minimal mapping and seems more consistent with experiments on motion capture (Grzywacz et al. 1989). Minimal mapping and motion coherence are both formulated in terms of cost functions, which must be minimized over a subset of the possible configurations of the binary variables. It is natural to wonder whether we can repeat our success with the winner take all and compute the partition function for these theories. For minimal mapping we seek to compute
where the sum is taken over all possible matchings that satisfy the cover principle. The difficulty is that there is no natural way to enforce the cover principle and we are reduced to evaluating the right-hand side term by term, which is combinatoriallyexplosive for large numbers of points [a method for imposing constraints of this type by using dummy fields was proposed by Simic (1990) for the traveling salesman problem, however this will merely transform minimal mapping theory into a version of the motion coherence theory]. If we impose instead the weaker requirement that each feature in the first frame is matched to a unique feature in the second (i.e., for each i there exists j such that V,, = 1 and V , k = 0 for k # j ) then we will still obtain poor matching since, without the cover principle, minimal mapping theory reduces to "each feature for itself" and is unlikely to give good coherent results (for example, all features in the first frame might be matched to a single feature in the second frame).
Generalized Deformable Models
9
The situation is rather different for the motion coherence theory. Although it is equally hard to impose the cover principle the requirement that each feature in the first frame has a unique match will still tend to give a coherent motion because of the overall smoothness requirement on the velocity field (for nonzero a). We now compute the partition function using this requirement. (3.5) where E,[v] is the smoothness terms (the second term on the right-hand side of 3.2) and the "sum" over v(x) is really an integral over all possible velocity fields. Computing the sum over all binary elements satisfying the unique match requirement from frame 1 to frame 2 gives
This can be rewritten as
where the effective energy is (3.8) It is straightforward to check that the marginal distribution of v is given by (3.9) where Zeffis a normalization constant. Thus minimization of Eeff[v]with respect to v(x) is equivalent to taking the maximum a posteriori estimate of v given the marginal probability distribution. Observe that as P + cc the energy function Eefi[vl diverges unless each feature x, in the first frame is matched to at least one feature yJ in the second frame and is assigned a velocity field v(x,) FZ (yJ - x,) (if this was not the case then we would get a contribution to the energy of - log 0 = +m from the ith feature in the first frame). Thus, the constraint we imposed during our summation (going from 3.5 to 3.8) is expressed directly in the resulting effective energy.
Alan L. Yuille
10
This suggests a method to force unique correspondence between the two frames. Why not make the effective energy symmetric by adding a term (3.10)
By a similar argument to the one above minimizing this cost function will ensure that every point in the second frame is matched to a point in the first frame. Minimizing the energy Eeff[v(x)]+ E,,, will therefore ensure that all points are matched (as p 4 00). We have not explicitly ensured that the matching is symmetric (i.e., if x, in the first frame is matched to y3 in the second frame then yj is matched to x~);however, we argue that this will effectively be imposed by the smoothness term (nonsymmetric matches will give rise to noncoherent velocity fields). We can obtain the additional term E,,, by modifying the cost function of the motion coherence theory to be N
iu
g2n
+ X ~ - n!Zn -/IDnv12dx
(3.11)
ri=O
where the {V,}and {U,,} are binary matching elements from the first to the second frame and from the second to the first, respectively. As we sum over the possible configurations of the {V,}and {U,,} we restrict the configurations to having unique matches for all points in the first frame and the second frame, respectively. This gives us an effective energy for the motion coherence theory, which is somewhat analogous to the elastic net algorithm for the traveling saiesman problem (Durbin and Willshaw 1987; Durbin et al. 19891, and which can be minimized by similar deterministic annealing techniques (steepest descent with P gradually increased). Current simulations demonstrate the effectiveness of this approach and, in particular, the advantages of matching from both frames. Notice in the above discussion the play off between the set of configurations we sum over and the a priori constraints (the smoothness term E,[v]). We wish to impose as many constraints as possible during the computation of the partition function, but not at the expense of having to explicitly compute all possible terms (which would be a form of exhaustive search). Instead we impose as many constraints on the configurations as we can that make the resulting effective energy simple and
Generalized Deformable Models
11
rely on the a priori terms to impose the remaining constraints by biases in the energy. This play off between the set of configurations we sum over and the energy biases caused by the a priori terms occurs in many applications of this approach. Notice that minimal mapping theory does not contain any a priori terms and hence cannot make use of this play off. We should point out, however, that there are ways to minimize the minimal mapping energy using an analogue algorithm based on a formulation in terms of linear programming (Ullman 1979). We can also obtain an effective energy for the velocity field if line processes (Geman and Geman 1984) are used to break the smoothness constraint. The line process field and the matching elements can be integrated out directly. This calculation has been performed for stereo (Yuille et al. 1989a). Other examples are discussed in Clark and Yuille (1990). 4 The Traveling Salesman Problem The traveling salesman problem (TSP) consists of finding the shortest possible tour through a set of cities { x i > , i = 1 , .. . , N, visiting each city once only. We shall refer to tours passing through each city exactly once as admissible tours and everything else as inadmissible tours. We propose to formulate the TSP as a matching problem with a set of hypothetical cities {y,}, j = 1, . . . ,N, which have a prior distribution on them biasing toward minimum length. We can write this in the form (4.1) $3
J
where the V,, are binary matching elements as before. V,, = 1 if x, is matched to y3 and is 0 otherwise. The second term on the right-hand side of 4.1 minimizes the square of the distances; other choices will be discussed later. The idea is to define a Gibbs distribution (4.2) for the states of the system. Our “solutions” to the TSP will correspond to the means of the fields {V,,} and {y,} as p + co. We must, however, put constraints that the matrix V,, has exactly one “1” in each row and each column to guarantee an admissible tour. Observe that if the {l&}s are fixed the probability distributions for the {yJ}s are products of Gaussians. The relation between Hopfield and Tank (1985) and the elastic net algorithm (Durbin and Willshaw 1987) follows directly from 4.1. To obtain Hopfield and Tank we eliminate the {y,} field from 4.1 to obtain an energy depending on the {I&> only. The elastic net is obtained by averaging out the {V,,}s. The relation between the two algorithms has
Alan L. Yuille
12
been previously shown by Simic (1990) using different, though closely related, techniques from statistical mechanics. The elastic net algorithm seems to perform better than the Hopfield and Tank algorithm on large numbers of cities. The Hopfield and Tank algorithm has problems for more than 30 cities (Wilson and Pawley 1988), while the elastic net still yields good solutions for at least 200 cities (see benchmark studies at the Keystone Workshop 1989 presented in Peterson 1990). 4.1 Obtaining the Elastic Net Algorithm. We start by writing the partition function
(4.3)
where the sum is taken over all the possible states of the {V,,}s and the {y,}s (strictly speaking the {y,}s are integrated over, not summed over). We now try evaluating Z over all possible states of the {V,}. As noted previously, we are guaranteed a unique tour if and only if the matrix V , contains exactly one ”1” in each row and each column. Enforcing this constraint as we compute the partition function leads, however, to a very messy expression for the partition function and involves a prohibitive amount of computation. Instead we choose to impose a weaker constraint: that each city x, is matched to exactly one hypothetical city y, (i.e., for each i V,, = 1 for exactly one 3 ) . Intuitively this will generate an admissible tour since only states with each x, matched to one y, will be probable. In fact one can prove that this will happen for large p (Durbin et a2. 19891, given certain conditions on the constant v. Thus, our computation of the partition function will involve some inadmissible tours, but these will be energetically unfavorable. Writing the partition function as (4.4)
we perform the sum over the {&}s there exists exactly one j such that
using the constraint that for each i
V , = 1. This gives (4.5)
This can be written as
Generalized Deformable Models
13
where
is exactly the Durbin-Willshaw-Yuille energy function for the elastic net algorithm (Durbin and Willshaw 1987). Finding the global minimum of E,ffI{y,}I with respect to the { y J } in the limit as 0 .+ 00 will give the shortest possible tour (assuming the square norm of distances between cities). Durbin and Willshaw (1987) perform a version of deterministic annealing, minimizing E,&{y,}] with respect to the {y,} by steepest descent for small 0 and then gradually increasing p (which corresponds to 1/K2 in their notation). Computer simulations show that this gives good solutions to the TSP for at least 200 cities (see Keystone Workshop simulations reported in Peterson 1990). This algorithm was analyzed in Durbin et al. (1989) who showed that there was a minimum value of (maximum value of K ) below which nothing interesting happened and which could therefore be used as a starting value (it corresponds to a phase transition in statistical mechanics terminology). The present analysis gives a natural probabilistic interpretation of the elastic net algorithm in terms of the probability distribution 4.2 (see also Simic 1990). It corresponds to the best match of a structure {y,} with a priori constraints (minimal squared length) to the set of cities {x,}. This is closely related to the probabilistic intepretation in Durbin et al. (1989) where the { x ~ }were interpreted as corrupted measurements of the {y,}. There is nothing sacred in using a squared norm for either terms in 4.3. The analysis would have been essentially similar if we had, for example, ( . would simply replaced it by terms like V,, Ix, - y,I and ly, - Y ~ + ~ This give an alternative probability distribution. There do, however, seem to be severe implementation problems that arise when the modulus is used instead of the square norm (Durbin, personal communication). Until these difficulties are overcome the square norm seems preferable. 4.2 Obtaining Hopfield and Tank. The Hopfield and Tank algorithm can be found by eliminating the {y,}s from the energy 4.1. This could be done by integrating them out of the partition function (observe that this would be straightforward since it is equivalent to integrating products of Gaussians). Instead we propose, following Yuille and Grzywacz (1989a), to eliminate them by directly minimizing the energy with respect to the {y,}s, solving the linear equations for the {y,}s and substituting back into the energy to obtain a new energy dependent only on the {V,,}s. (The two methods should give identical results for quadratic energy functions.) This process will give us an energy function very similar to Hopfield and Tank, provided we make a crucial approximation.
Alan L. Yuille
14
Minimizing the energy 4.1 with respect to the {yj}s gives a set of equations
for each j . We use the constraint that there is a unique match to set xi &yj = yj. This gives (4.9)
(4.10) where the Aij = 26, obtain
-
S2,,+l -
yz = C{&zk+ vAZk)-l k
C WQ
We can now invert the matrix to (4.11)
1
We now make the important approximation that if v is small then
{L+ ~ A , k } - lM
{& - v A , ~ }
(4.12)
We can see from the form of the matrix A,, that this corresponds to assuming only nearest neighbor interactions on the tour (the full expansion in 4.11 would involve higher powers of A,, and hence would introduce interactions over the entire tour). Assuming the approximation 4.12 we can now use 4.11 to substitute for yz into the energy function 4.1. After some tedious algebra we obtain
(4.13) where djk = Ix, - XkI is the distance between the jth and kth cities. Equation 4.13 is the basis for the Hopfield and Tank (1985) energy function (although they use d 3 k instead of d$). To impose the constraints on admissible tours they incorporate additional terms B C ,Cj,k, V,,V,k+ B C , C,,k, f k ~ , V ~ , + C V,, { ~- ,N, } 2 into the energy function. These terms will make inadmissible tours energetically unfavorable, if the constants B and C are chosen wisely, but at the cost of introducing many local minima into the energy function. We see two drawbacks to this scheme that decrease its effectiveness by comparison with the elastic net algorithm. First, when we make the approximation 4.12 we keep only nearest neighbor interactions between the cities and second, the constraint on admissible tours is enforced by a
Generalized Deformable Models
15
bias in the energy function ("soft," Simic 1990) rather than explicitly in the calculation of the partition function ("hard," Simic 1990). Both these factors contribute to increase the number of local minima of the energy function, probably accounting for the diminished effective of this scheme for more than 30 cities. Hopfield and Tank attempt to minimize E[{L$}]by an algorithm that gives them solutions to the mean field equations for the {K:,}. Again, as for the winner take all, there are many possible solutions to the mean field equations (roughly corresponding, in the limit as p -+ co, to the number of local minima of the energy function). There is a simple intuitive way to see the connection between the matching energy function 4.1 and that of Hopfield and Tank in the small v limit (Abbott, private communication). For small v minimizing the energy will encourage each x, to be matched to one y:, (y, M C,KJx,). Substituting this in the second term gives us the Hopfield and Tank energy 4.13 with the squared distance. A similar argument suggests that if the second term is the absolute distance, rather than the distance squared, we would expect to obtain the exact Hopfield and Tank energy. This argument might be made rigorous by integrating out the {y3} from the energy 4.1 using the absolute distance instead of the squared distance. 5 Deformable Templates, Learning, Content Addressable Memories, and Development
We now briefly sketch other applications of these ideas to deformable templates, learning, content addressable memories, and theories of development. Computer simulations of these ideas are described elsewhere. 5.1 Deformable Templates and Matching. A deformable template (Yuille et al. 1989b) is a model described by a set of parameters so that variations in the parameter space correspond to different instances of the model. In this section we will reformulate deformable templates in terms of matching elements. Suppose we have a deformable template of features { y j ( a ) } , with j = 1,.. . , N , depending on a set of parameters a and we want to match it to a set of features {x,} with i = 1 , . . . ,M . We can define a cost function
(5.1) where A,, is a compatibility matrix A,, = 1 if the features labeled by i and j are totally compatible and A , = 0 if they are totally incompatible, the {Kj] are binary matching elements, X is a constant, and ,!?,,(a)imposes prior constraints on the parameters a. There is nothing sacred about
Alan L. Yuille
16
using the square norm, n = 2, as a measure of distance in 5.1; for some applications other norms are preferable. We can now impose constraints on the possible matches by summing over the configurations of {qj}.We require that for each j there exists a unique i such that = 1. Calculating the sum over these configurations gives an effective energy
x,
(5.2) a
where n will typically be 2. Minimizing &@[a]with respect to a corresponds to deforming the template in parameter space to find the best match. We can extend this from features to matching curves, or regions, by defining
(5.3) where 4(x) is a compatibility field, which could represent the edge strength, between the properties of the template and those of the image. In the limit as n co and X + 0 this becomes the form --f
E [ y ( a ) l=
14Iy(a)ldy+
(5.4)
Ep(4
used by Yuille ei al. (1989b). A special case comes from rigidly matching two-dimensional sets of features. Here the parameters correspond to the rotation and translation of one set of the features. In other words the energy is
E[{r/z,}l, T, QI
=
C A,K,
(x, - TyJ -
+ E p V , Q)
(5.5)
w
where T denotes the rotation matrix and a the translation. If N = M , A,, = 1 and Ep = 0 this reduces to the well-known least-squares matching problem. Although we could integrate out the {K,} again and obtain an effective energy for this problem in terms of the T and a, however, it is not obvious how to minimize this and there seem to be other techniques based on moments that are preferable. 5.2 Learning. We will briefly sketch how some of these ideas may be applied to learning and to content addressable memories. In a paper that has attracted some attention Poggio and Girosi (1989) have argued that learning should be considered as interpolation using least-squares minimization with a smoothing function. They then note that the resulting function can be written as a linear combination of basis functions that, for suitable choices of smoothness operator, can fall
Generalized Deformable Models
17
off with distance and can even be chosen to be Gaussians (Yuille and Grzywacz 1989a) (see Section 3). This gives a direct link to learning theories using radial basis functions (Powell 1987). We argue, however, that smoothness is often not desirable for learning. If the function to be learned needs to make decisions, as is usually the case, then smoothness is inappropriate. In this case the types of models we have been considering would seem preferable. Suppose we have a set of input-output pairs x, yz. Poggio and Girosi (1989) obtain a function that minimizes
-
where L is a smoothness measure. Instead we would suggest minimizing
E[{v,},YI
=
c v,
/(z- d2 + (y
-
yiq
(5.7)
with respect to {q}and y. Computing the effective energy by summing over the {K}s with the constraint that V, = 1 for only one i, we obtain
(5.8) In the limit as P 403 this corresponds to finding the nearest neighbor zi to x and assigning the value of yi to it. As we reduce /3 the range of interaction increases and the xj further away from x also influences the decision. We can also prune the number of input-output pairs in a similar manner. We pick a hypothetical set of input-output pairs w3 + z j and minimize
The { Kj} are summed over with the assumption that for each i there is only one j such that K j = 1. This forces the hypothetical input-output pairs to match the true input-outputs as closely as possible. We obtain (5.10)
In related work Durbin and Rumelhart (personal communication) have applied the elastic net idea (without binary decision units) to solve clustering problems (given a set of data points { x i } find an elastic net with a smaller number of points that best fits it) but have not extended it to learning. There may also be a connection to a recent function interpolation algorithm of Omohundro (personal communication).
18
Alan L. Yuille
5.3 Content Addressable Memories. The basic idea of a content addressable memory is to store a set of vectors {y”} so that an input vector y gets mapped to the closest y,. An easy way to do this is once again to define matching elements V, and minimize
(5.11) This is very similar to the winner-take-all network discussed in Section 2. Summing over all configurations with C,, V, = 1 and calculating the mean fields gives (5.12) It is straightforward to define a network having these properties. Observe that convergence is guaranteed to a true memory, unlike most content addressable memories (which have local minima). Our memory requires approximately ( N + l ) M connections and N ”neurons” to store N memories consisting of real valued vectors with M components. By constrast the the Hopfield storage recipe requires N 2 connections and N “neurons” to store up to 0.15N binary valued vectors with N components (Hopfield 1982; Amit 1989). The capacity of the Hopfield network can be improved to a maximum of 2N memories by using the optimal storage ansatz (Cover 1965; Gardner 1988). Redundancy could be built into our proposed network by using groups of neurons to represent each memory. The destruction of individual neurons would minimally affect the effectiveness of the system. 5.4 Developmental Models. In an interesting recent paper Durbin and Mitchison (1990) propose that the spatial structure of the cortical maps in the primary visual cortex (Blasdel and Salama 1986) arise from a dimension reducing mapping f from an n-dimensional parameter space R” (they choose n = 4 corresponding to retinotopic spatial position, orientation, and occular dominance) to the two-dimensional surface of the cortex. Their model assumes that the mapping f:R” --+ R2 should be as smooth as possible, thereby reducing the total amount of ”wiring” in the system. This is imposed by requiring that the map minimizes the functional
(5.13)
where the sum is taken to lie over neighboring points x and y in Rn and 0 < p < 1 (this range of values of p is chosen to give good agreement between the resulting distribution of the parameters on the surface cortex).
Generalized Deformable Models
19
Durbin and Mitchison propose an elastic network algorithm analogous to the elastic net for the TSP (Durbin and Willshaw 1987). It is not clear, however, that their elastic net minimizes the energy function 5.13. We now use statistical techniques to relate the energy function 5.13 directly to an elastic net algorithm. This is similar, but not identical, to that used by Durbin and Mitchison. Like them we will illustrate the result on a mapping from a square lattice in R2 to a lattice in R (it is straightforward to generalize the method to higher dimensions). The square in R2 consists of a set of points (z2, y3) for i = 1 . . . , N and j = 1,.. . , N . The lattice in R has points z, for a = 1,.. . , M . We once again define binary matching elements I&, such that KJa= 1 if the point (z2) y,) corresponds to 2,. We can now define a cost function (5.14) where E,[f(z,y)] imposes smoothness requirements on the mapping f ( z , y) and X is a constant. A typical form would be
&[f(z,y)l = ~ { l f ( & + ly3) ,
-
f(% YJP
+ If(z2,%+1)
-
f(G)Y,)lPl
23
(5.15) It is straightforward to see that as n -+ co (and X + 0) the energy function E[&, f ( z , y)] approaches that of Durbin and Mitchison (1990) (the function f will map lattice points into lattice points and will satisfy a similar smoothness measure). Once again we obtain an effective energy function for f(z,y) that will give rise to an elastic net algorithm (by steepest descent). The situation is exactly analogous to the long-range motion theory described in Section 3. We should probably impose the uniqueness matching condition in both directions (see Section 3), although it might be sufficient to require that each point in the lattice in R receives at least one input from the lattice in R2. Performing similar calculations to Section 3 gves
+
If(.i,
Y j + d - f(.i,
Y3)IP)
(5.16)
The algorithm proceeds by minimizing Eeff[f(xry)] with respect to its values f(xi, y,) on the lattice [clearly a continuum limit can be attained
20
Alan L. Yuille
by imposing a differential smoothing constraint on f(z,y) and minimizing with respect to f(z,y), as for the velocity field in Section 31. The parameter p can be varied to allow a coarse to fine approach. We have not simulated this algorithm so we cannot compare its performance to that of Durbin and Mitchison (1990). But the above analysis has shown the theoretical relation between their optimality criteria 5.13 and their mechanism (elastic nets) for satisfying it. 6 Conclusion
We have described a general method for modeling many matching problems in vision and neural networks theory in terms of elastic network and deformable templates with binary matching fields. Our models are typically defined in terms of an energy function that contains both continuous valued fields and binary fields. A probabilistic framework can be developed using the Gibbs distribution, which leads to an interpretation in terms of Bayes theorem. Eliminating the binary field by partially computing the partition function, or by computing the marginal probability distributions of the continuous field, reduces our models to the standard elastic network form. Alternatively, eliminating the continuous variables often reduces our models to previous theories defined in terms of the binary fields only. Eliminating the binary fields is usually preferable since it allows us to impose global constraints on these fields. By contrast, when we eliminate the continuous variables global constraints must be imposed by energy biases, which leads to unnecessary local minima in the energy function thereby making such theories less effective (unless global constraints are imposed by the methods used by Peterson and Soderberg 1989). It is good to put as many constraints as possible into the computation of the partition function. But too many constraints may make this too complicated to evaluate (as for the TSP). However, good choices of the prior constraints on the continuous fields enable us to impose some constraints by energy biases. It is interesting to consider biological mechanisms that could implement these algorithms and, in particular, allow for learning and content addressable memories. Preliminary work suggests that biologically plausible implementations using thresholds are possible, provided that the thresholds vary during learning or memory storage. This suggests an alternative to the standard theory in which learning and memory storage are achieved by changes in synaptic strength. Interestingly a biologically plausible mechanism for altering thresholds for classical conditioning has been recently proposed by Tesauro (1988). Finally it would be interesting to compare the approach described in this paper to the work of Brockett (1988, 19901, which also uses analog systems for solving combinatorial optimization problems. His approach
Generalized Deformable Models
21
involves embedding the discrete group of permutations into a Lie group. The solution can then be thought of as the optimal permutation of the initial data, and can be found by steepest descent in the Lie group space. This technique has been sucessfully applied to list sorting and other combinatorial problems.
Acknowledgments I would like to thank the Brown, Harvard, and M.I.T. Center for Intelligent Control Systems for a United States Army Research Office grant number DAAL03-86-C-0171 and an exceptionally stimulating working environment. I would also like to thank Roger Brockett for his support and D. Abbott, R. Brockett, J. Clark, R. Durbin, D. Geiger, N. Grzywacz, D. Mumford, S. Ullman, and J. Wyatt for helpful conversations.
References Anstis, S. M., and Ramachandran, V. S. 1987. Visual inertia in apparent motion. Vision Res. 27, 755-764. Amit, D. J. 1989. Modeling Brain Function. Cambridge University Press. Cambridge, England. Bayes, T. 1783. An essay towards solving a problem in the doctrine of chances. Phi/. Trans. Roy. Soc. 53, 370-418. Bertero, M., Poggio, T., and Torre, V. 1987. Regularization of ill-posed problems. A.I. Memo. 924. MIT A.I. Lab., Cambridge, MA. Blake, A. 1983. The least disturbance principle and weak constraints. Pattern Recognition Left. 1, 393-399. Blake, A. 1989. Comparison of the efficiency of deterministic and stochastic algorithms for visual reconstruction. P A M l ll(l),2-12. Blasdel, G. G., and Salama, G. 1986. Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature (tondon) 321,579-585. Brockett, R. W. 1988. Dynamical systems which sort lists, diagonalize matrices and solve linear programming problems. Proceedings of the 1988 I E E E Conference on Decision and Control. IEEE Computer Society Press, Washington, D.C. Brockett, R. W. 1990. Least squares matching problems. J. Linear Algebra Appl. In press. Burr, D. J. 1981a. A dynamic model for image registration. Comput. Graphics Image Process. 15, 102-112. Burr, D. J. 1981b. Elastic matching of line drawings. I E E E Trans. Pattern Anal. Machine Intelligence. PAMl 3(6), 708-713. Clark, J. J., and Yuille, A. L. 1990. Data Fusion for Sensory Information Processing Systems. Kluwer Academic Press. Cover, T. M. 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. I E E E Trans. Electron. Cornput. EC-14,326.
22
Alan L. Yuille
Durbin, R., and Mitchison, G. 1990. A dimension reduction framework for cortical maps. Nature (London). In press. Durbin, R., and Willshaw, D. 1987. An analog approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689-691. Durbin, R., Szeliski, R., and Yuille, A. L. 1989. An analysis of the elastic net approach to the travelling salesman problem. Neural Comp. 1, 348-358. FiscNer, M. A., and Elschlager, R. A. 1973. The representation and matching of pictorial structures. IEEE Truns. Comput. 22, 1. Gardner, E. 1988. The space of interactions in neural network models. J. Phys. A: Math. Gen. 21, 257-270. Geiger, D., and Girosi, F. 1989. Parallel and deterministic algorithms from MRFs: Integration and surface reconstruction. Artificial Intelligence Laboratory Memo 1224. Cambridge, M.I.T. Geiger, D., and Yuille, A. 1989. A Common Framework for h u g e Segmentation. Harvard Robotics Laboratory Tech. Rep. No. 89-7. Gelb, A. 1974. Applied Optimal Estimation. MIT Press, Cambridge, MA. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. PAMI 6, 721-741. Grzywacz, N. M. 1987. Interactions between minimal mapping and inertia in long-range apparent motion. Invest. Ophthalmol. Vision 28, 300. Grzywacz, N. M. 1990. The effects of past motion information on long-range apparent motion. In preparation. Grzywacz, N. M., Smith, J. A., and Yuille, A. L. 1989. A computational theory for the perception of inertial motion. Proceedings IEEE Workshop on Visual Motion. Irvine. Hinton, G. E. 1989. Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Comp. 1, 143-150. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Programming, Vol. 1, D. E. Rummelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Nutl. Acad. Sci. U.S.A.79,25542558. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of a two-state neuron. PYOC.Natl. Acad. Sci. U.S.A. 81,3088-3092. Hopfield, J. J., and Tank, D.W. 1985. Neural computation of decisions in optimization problems. Biol. Cybernet. 52, 141-152. Horn, B. K. P. 1986. Robot Vision. MIT Press, Cambridge, MA. Kass, M., Witkin, A., and Terzopoulos, D. 1987. Snakes: Active Contour Models. In Proceedings of the First lntemationa?Conference on Computer Vision, London. IEEE Computer Society Press, Washington, D.C. Kienker, P. K., Sejnowski, T. J., Hinton, G. E., and Schumacher, L. E. 1986. Separating figure from ground with a parallel network. Perception 15, 197216. Kirkpatrick, S., Gelatt, C. D., Jr., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671-680. Lipson, P., Yuille, A. L., OKeefe, D., Cavanaugh, J., Taaffe, J., and Rosenthal, D.
Generalized Deformable Models
23
1989. Deformable Templatesfor Feature Extraction from Medical Images. Harvard Robotics Laboratory Tech. Rep. 89-14. Marr, D. 1982. Vision. Freeman, San Francisco. Marroquin, J. 1987. Deterministic Bayesian estimation of Markovian random fields with applications to computational vision. In Proceedings of the First International Conference on Computer Vision, London. IEEE Computer Society Press, Washington, D.C. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equation of state calculations by fast computing machines. J. Phys. Ckem. 21,1087-1091. Parisi, G. 1988. Statistical Field Theoy. Addison-Wesley, Reading, MA. Pentland, A. 1987. Recognition by parts. In Proceedings of the First International Conference on Computer Vision, London. IEEE Computer Society Press, Washington, D.C. Peterson, C. 1990. Parallel distributed approaches to combinatorial optimization problems - benchmark studies on TSP. LU TP 90-2. Manuscript submitted for publication. Peterson, C., and Anderson, J. R. 1987. A mean field theory learning algorithm for neural networks. Complex Syst. 1,995-1019. Peterson, C., and Soderberg, B. 1989. A new method for mapping optimization problems onto neural networks. lnt. J. Neural Syst. 1(1),3-22. Poggio, T., and Girosi, F. 1989. A theory of networks for approximation and learning. A.I. Memo 1140. M.I.T. Powell, M. J. D. 1987. Radial basis functions for multivariate interpolation. In Algorithms for Approximation, J. C. Mason and M. G. Cox, eds. Clarendon Press, Oxford. Ramachandran, V. S., and Anstis, S. M. 1983. Extrapolation of motion path in human visual perception. Vision Res. 23, 83-85. Simic, P. 1990. Statistical mechanics as the underlying theory of "elastic" and "neural" optimization. NETWORK: Comp. Neural Syst. I(1), 89-103. Terzopoulos, D., Witkin, A., and Kass, M. 1987. Symmetry-seeking models for 3D object recognition. In Proceedings of the First International Conference on Computer Vision, London. IEEE Computer Society Press, Washington, D.C. Tesauro, G. 1988. A plausible neural circuit for classical conditioning without synaptic plasticity. Proc. Natl. Acad. Sci. U.S.A. 85, 2830-2833. Ullman, S. 1979. The Interpretation of Visual Motion. MIT Press, Cambridge, MA. Wilson, G. V., and Pawley, G. S. 1988. On the stability of the travelling salesman problem algorithm of Hopfield and Tank. Biol. Cybernet. 58, 63-70. Yuille, A. L. 1989. Energy functions for early vision and analog networks. Biol. Cybernet. 61, 115123. Yuille, A. L., and Grzywacz, N. M. 1988. A computational theory for the perception of coherent visual motion. Nature (London) 333, 71-74. Yuille, A. L., and Grzywacz, N. M. 1989a. A mathematical analysis of the motion coherence theory. lntl. J. Comput. Vision 3, 155-175. Yuille, A. L., and Grzywacz, N. M. 1989b. A winner-take-all mechanism based on presynaptic inhibition feedback. Neural Comp. 1,334-347.
24
Alan L. Yuille
Yuille, A. L., Geiger, D., and Biilthoff, H. H. 1989a. Stereo Integration, Mean Field and Psyckopkysics. Harvard Robotics Laboratory Tech. Rep. 89-10. Yuille, A. L., Cohen, D. S., and Hallinan, P. W. 1989b. Feature extraction from faces using deformable templates. Proceedings of Computer Vision and Pattern Recognition, San Diego. IEEE Computer Society Press, Washington, D.C.
Received 8 January 1990; accepted 1 February 1990.
Communicated by Nabil H. Farhat
An Optoelectronic Architecture for Multilayer Learning in a Single Photorefractive Crystal Carsten Peterson* Stephen Redfield James D. Keeler Eric Hartman Microelectronics and Computer Technology Corporation, 3500 West Balcones Center Drive, Austin, TX 78759-6509 USA
We propose a simple architecture for implementing supervised neural network models optically with photorefractive technology. The architecture is very versatile: a wide range of supervised learning algorithms can be implemented including mean-field-theory, backpropagation, and Kanerva-style networks. Our architecture is based on a single crystal with spatial multiplexing rather than the more commonly used angular multiplexing. It handles hidden units and places no restrictions on connectivity. Associated with spatial multiplexing are certain physical phenomena, rescattering and beam depletion, which tend to degrade the matrix multiplications. Detailed simulations including beam absorption and grating decay show that the supervised learning algorithms (slightly modified) compensate for these degradations. Most network models are based on "neurons," V,, interconnected with synaptic strengths zJ,and a local updating rule:
where g(z) is a nonlinear gain function. The majority of neural network investigations are performed with software simulations, either with serial or parallel architectures. Many applications would benefit tremendously from custom-made hardware that would facilitate real-time operation. Neural network algorithms require a large number of connections (the matrix elements T& of equation 1.1). Optics offers the advantage of making these connections with light beams. Several architectural proposals for optical implementations now exist (Psaltis and Farhat 1985; Soffer et aI. 1986; Psaltis ef at. 1982). Most *Present Address: Department of Theoretical Physics, University of Lund, Solvegatan 14A, S-22362 Lund, Sweden.
Neural Computation 2, 25-34 (1990)
@ 1990 Massachusetts Institute of Technology
Carsten Peterson et al.
26
deal with Hebbian learning (no hidden units) using either spatial light modulators (SLMs) or photorefractive crystals. The latter technology, in which the Zj-elements are represented by holographic gratings, is particularly well suited for neural network implementations. The gratings decay naturally, and this can be exploited as a beneficial adaptive process. In Psaltis et al. (1982), a multilayer architecture of several photorefractive crystals was designed to implement the backpropagation algorithm with hidden units. In this letter, we present a single crystal architecture that is versatile enough to host a class of supervised learning algorithms, all of which handle hidden units. In contrast to other approaches we use spatial multiplexing (Anderson 1987; Peterson and Redfield 1988) rather than angular multiplexing of the interconnects. This spatial multiplexing implies rescattering and beam depletion effects for large grating strengths and large numbers of neurons. We demonstrate, on the simulation level, how the supervised learning models we consider implicitly take these effects into account by making appropriate adjustments. Photorefractive materials can be used as dynamic storage media for holographic recordings. A recording takes place as follows: With the object (1) and reference (2) beam amplitudes defined as
El = Ale-a&.'t
(1.2)
the intensity pattern of the two-wave mixing takes the form'
where intensities I = 1 @ ; 1 2 have been introduced. The refractive index is locally changed in the photorefractive material proportional to this periodic intensity pattern. The so-called grating efficiency 7 is to a good approximation proportional to the incoming beam intensities
Consider a crystal where grating strengths qa3have been created with the interference of equation 1.4. Let a set of light beams = impinge on these gratings. The outgoing electric fields can be written as
sa
23 -- AJe-arZ,F =
c z
vI12ea2,, 23
?Ate-& i=
c q:FAte-atCl
't
1
Weglecting a constant phase shift due to the relative phases of the beams.
(1.6)
Optoelectronic Architecture for Multilayer Learning where
27
& = i i 5. That is, a matrix multiplication of amplitudes -
(1.7) is performed by the photorefractive medium. Thus, identifying the amplitudes A, with the neuronic values V,, and :? : with the connection strengths TzJ,the matrix multiplication of equation 1.1 can be irnplemented. Correspondingly, equation 1.5 implements a Hebb-like learning rule. The reconstruction, or readout, process is partially destructive. The efficiency decays exponentially with the readout duration, for a given read energy density. In the past this grating decay has been a problem because the use of the neural network would gradually fade the recordings. A technique has recently been discovered for controlling this destruction rate by choosing appropriate applied fields and polarizations of the object and reference beams3 (Redfield and Hesselink 1988). Equation 1.7 can be implemented either with angular (Psaltis et al. 1982) or spatial multiplexing (Anderson 1987; Peterson and Redfield 1988) techniques. In Redfield and Hesselink (1988) it was observed that at most 10-20 gratings can be stored at the same localized region with reasonable recall properties using angular multiplexing. For this reason we have chosen the spatial multiplexing approach, which corresponds to direct imaging (see Fig. 4). With direct imaging, the intensities from the incoming plane of pixels become depleted and rescattered when passing through the crystal, causing the effective connection strengths to differ from the actual matrix elements TzJ(see Fig. 1). For relatively small systems and grating (vl/’) sizes these effects are negligible; in this case the identification of Tt3with :?: is approximately valid. However, these effects are likely to pose a problem for realistic applications with large numbers of neurons. We estimate the rescattering and beam depletion effects on Ifut = A: (equation 1.7) by explicit simulation of the reflection [coefficient qtJ1and transmission [coefficient (1-qtJ)] of the intensity arriving at each grating4 In Figure 1 we show the emergent light given random ’7:1 values on the interval [0,0.1]. The data represents an average of different random input patterns of intensity strengths in the interval [0,1]. The major effects are in the “end” of the crystal. Clearly, if the 7:: were set according to a pure (unsupervised) Hebbian rule, a large network would produce incorrect answers due to the depletion and rescattering effects. We demonstrate below how to overcome this problem with supervised learning algorithms together with a temperature gradient procedure for the output amplifiers. In our architecture, the input and output neurons are planes of n2 pixels and the connection matrix is a volume that can map at most n3 ~~
3This technique yields a recording half-life of O(1Ohrs) for continuous readout. *It is sufficient for our purposes to investigate this effect on a slice of the volume.
Carsten Peterson
28
et al.
0 55
0.50
0.45
E m T g
0.40
e
t
I 0.35 1
'J
t 0.30
0.25
0 20
a
100
200
300
4 DO
500
Distance into Crystal
Figure 1: Emergent light Utet) as function of distance (number of grating sites) into the crystal (i) for random T$ values and input intensities (see text). connections. If we want full connectivity we are thus short one dimension (Psaltis et al. 1982). The volume can serve only n3I2 neurons with full connectivity so we need an n3f2+ n3I2 mapping between the SLMs. There are many ways to accomplish this. We have chosen to use multiple pixels per neuron. The input plane is organized as follows: each of the n rows contains neurons replicated ,hi times. In the output plane, each row contains fi replicas of the sequence i, i + 1,. . . ,fi,where i = 1 for the first row, i = 1 + 6for the second row, z = 1 + 2 6 for the third row, etc. By deliberately omitting selected elements, architectures with limited (e.g., layered) connectivity can be obtained. We begin by describing how to implement the mean-field-theory ( M I V learning algorithm (Peterson and Hartman 1988). We then deal with feedforward algorithms. The MET algorithm proceeds in two phases followed by a weight update:
+
1. Clamped phase. The visible units are clamped to the patterns to be learned. The remaining units then settle (after repeated updating) according to (1.8)
Optoelectronic Architecture for Multilayer Learning
29
where the “temperature” T sets the strength of the gain of the gain function g(z) = $11+ tanh(z)l. 2. Free phase. The input units alone are clamped and the remaining units settle according to
In each of these phases, the units settle with a decreasing sequence of temperatures, To > TI > . . . > T&l. This process is called simulated annealing (Rumelhart and McClelland 1986). Equations 1.8 and 1.9 are normally updated asynchronously. We have checked that convergence is also reached with synchronous updating, which is more natural for an optical implementation. After the two phases, updating (or learning) takes place with the rule
AT,,
= P(V,V,
-
Y’Y’)
(1.10)
where P is the learning rate. As it stands, equation 1.10 requires storing the solutions to equations 1.8,1.9 and subtracting, neither of which is natural in an optical environment. There are no fundamental obstacles, however, to doing intermediate updating. That is, we update with
AT,, = PKV,
(1.11)
after the clamped phase and
AT,, = -PV,’V,’
(1.12)
after the free phase. We have checked performance with this modification and again find very little degradation. The grating strengths qi!* must necessarily be positive numbers less than 1. However, most neural network algorithms require that Tz, can take both positive and negative values, constituting a problem for both optical and electronic implementations. The most straightforward solution of several possible solutions (Peterson ef al. 1989)to this problem is to have two sets of positive gratings, one for positive enforcement (T;) and one for negative (T,;). The negative sign is then enforced electronically with a subtraction and equation 1.1 reads (1.13)
In the modified MFT learning algorithm, the adjustment of equation 1.11 is always positive while the adjustment of equation 1.12 is always negative. So the clamped phase need only affect positive weights and the free phase need only affect negative weights. In Figure 2, generic read and write cycles for MET are shown (only a slice of the volume is
Carsten Peterson et al.
30
READ CYCLE
[PRODUCTION]
= electronic subtraction
a
= sigmoid amplifier
WRITE CYCLE [LEARNING]
p l =phase 1 p2 = phase 2
Figure 2: (a) An MET read (production)cycle. (b) An MFT write (learning) cycle. shown). As can be seen, each connection strength is represented in the crystal by two gratings, TG and Thus n neurons require 2n2 connections (for full connectivity). In the read (production) cycle, the reference beam is iteratively processed through the crystal until it has settled according to equation 1.8 (or equation 1.9). The write (learning) cycle (Fig. 21, is slightly more complicated and differs between the clamped and free phases. In the clamped phase, the beam again settles according to equation 1.8. It is then replicated as two
z;.
Optoelectronic Architecture for Multilayer Learning
31
beams, “reference” and ”object”. The two beams impinge simultaneously on the crystal with the object beam hitting only the TG columns. The interconnect weights are then adjusted where the beams cross. The free phase proceeds in the same way except that the object beam hits only the Tt; columns. We now briefly discuss the implementation of feedforward models, again using a single crystal with spatial multiplexing. Optical implementations of the backpropagation (BP) algorithm have been investigated elsewhere (Psaltis et al. 1982), but these investigations have focused mainly on angular multiplexing with a multi-crystal architecture. We restrict the discussion to three layer networks (one hidden layer) with input-to-hidden and hidden-to-output connections. For such networks, symmetric feedback output-to-hidden connections are required for BP (as they are in MET). The neurons are arranged on the SLMs in layers. We denote the input-to-hidden weights as T,k ,hidden-to-output as W, ,and output-to-hidden as W,, = W,, . We have developed a procedure for implementing BP in the crystal such that the updates to the W,, weights are exact, and the updates to the T3k weights are correct to first-order in p. Hence, for learning rates small enough compared with the magnitude of W,, and g’(h,), this procedure is faithful to the exact BP algorithm. While space does not allow a detailed description (see Peterson et al. 1989), the procedure entails a two-stage backward pass analogous to the MET modification of equations 1.11,1.12: for each pattern, the set of “positive” (“negative”) gratings are written with the weight change component due to the positive (negative) term in the error expression. Other learning algorithms for feedforward networks, such as the Kanerva associative memory (Kanerva 1986) and the Marr and Albus models of the cerebellum (Keeler 1988), can be implemented in a very similar (and even simpler) manner. We have conducted simulations of the modified MET and BP learning algorithms on the spatially multiplexed architecture. In addition to rescattering, beam depletion, and double columns of weights, the simulations contained the following ingredients: Temperature gradient. Beam depletion (see Fig. 1 and absorption, below) is the one effect in the crystal for which the MET and BP algorithms were not able to automatically compensate. We found that we could counterbalance the asymmetry of the emergent light by varying the gain (or inverse temperature) across the crystal. The gain increases with depth into the crystal. Without this technique, none of our simulations were successful. Absorption. The crystal absorbs light, both in read and write phases, exponentially with depth into the crystal. Grating decay. During continuous exposure to illumination, the crystal gratings decay exponentially in time.
Carsten Peterson ef al.
32
Simulations were performed for three different problems: random mappings of random (binary) vectors, the exclusive-or (XOR) or parity problem, and the 6 x 6 mirror symmetry problem (Peterson and Hartman 1988). Both MFT and BP successfully learned all three problems in simulations. In Figure 3 we show the results for the mirror symmetry problem using 12 hidden units and 36 training patterns. As can be seen from this figure, the supervised learning algorithms have the expected property of adjusting the weights such that the various physical effects are taken care of. The system configuration has two principal optical paths, a reference path (1)and an object path (2) (see Fig. 4). Each path has a spatial filter, a beam splitter, a SLM, and an imaging lens system. The object path ends with a CCD array. The photorefractive crystal is SBN and an argon ion laser is used as a coherent light source. Thresholding (1/2(1 + tanh(z)))
105
i 100 t
s 0
r r
95 90
85
e
=t
80
75 70 0
20
40
60
80
100
Epochs
Figure 3: Learning performance of MET and BP on the 6 x 6 mirror symmetry problem (horizontal axis is number of training epochs).
33
Optoelectronic Architecture for Multilayer Learning
n __
n CCD /-/
SLM
Neuron outputs
c)
n Neuron inputs
Interconnect Volume
(a)
Laser
Path a 2
Figure 4: System configuration.
34
Carsten Peterson et al.
and loading of the SLMs take place in E electronically. Both mean-fieldtheory and backpropagation learning algorithm implementations have distinctive read (a1 plus a2) and write (a2 plus b) phases that use this generic architecture.
References Anderson, D. 1987. Adaptive interconnects for optical neuromorphs: Demonstrations of a photorefractive projection operator. In Proceedings of the I E E E First International Conference on Neural Networks 111, 577. Kanerva, P. 1986. Sparse Distributed Memory. MIT Press/Bradford Books, Cambridge, MA. Keeler, J. D. 1988. Comparison between Kanerva’s SDM and Hopfield-type neural networks. Cognitive Science 12,299. Peterson, C. and Hartman, E. 1988. Explorations of the mean field theory learning algorithm. Neural Networks 2, 475494. Peterson, C. and Redfield, S. 1988. A novel optical neural network learning algorithm. In Proceedings of the ICO Topical Meeting on Optical Computing, J. W. Goodman, P. Chavel, and G. Roblin, eds. SPIE, Bellingham, WA, 485496. Peterson, C., Redfield, S., Keeler, J. D., and Hartman, E. 1989. Optoelectronic implementation of multilayer neural networks in a single photorefractive crystal. MCC Tech. Rep. No. ACT-ST-146-89 (to appear in Optical Engineering). Psaltis, D. and Farhat, N. 1985. Optical information processing based on associative memory model of neural nets with thresholding and feedback. Optics Letters 10,98. Psaltis, D., Wagner, K., and Brady, D. 1987. Learning in optical neurocomputers. Proceedings of the IEEE First International Conference on Neural Networks, III549; Wagner, K., and Psaltis, D. 1987. Multilayer optical learning networks. Applied Optics 3, 5061. Redfield, S. and Hesselink, B. 1988. Enhanced nondestructive holographic readout in SBN. Optics Letters 13, 880. Rumelhart, D. E. and McClelland, J. L., eds. 1986. Parallel Distributed Processing: Explorations of the Microstructure of Cognition. Vol. 1: Foundations. MIT Press, Cambridge, MA. Soffer, B. H., Dunning, G. J., Owechko, Y., and Marom, E. 1986. Associative holographic memory with feedback using phase conjugate mirrors. Optics Letters 11, 118.
Received 18 August 1989; accepted 21 December 1989.
Communicated by Joshua Alspector and Gail Carpenter
VLSI Implementation of Neural Classifiers Arun Rao Mark R. Walker Lawrence T. Clark L. A. Akers R. 0. Grondin Center for Solid State Electronics Research, Arizona State University, Tempe, AZ 85287-6206 USA
The embedding of neural networks in real-time systems performing classification and clustering tasks requires that models be implemented in hardware. A flexible, pipelined associative memory capable of operating in real-time is proposed as a hardware substrate for the emulation of neural fixed-radius clustering and binary classification schemes. This paper points out several important considerations in the development of hardware implementations. As a specific example, it is shown how the ART1 paradigm can be functionally emulated by the limited resolution pipelined architecture, in the absence of full parallelism. 1 Introduction
The problem of artificial pattern recognition is generally broken down into two sub-tasks - those of abstraction and classification. The abstraction task preprocesses the raw data from the sensors into a form suitable for the classifier (Lippmann 1987), which takes the converted (and perhaps compressed) data and makes a decision as to the nature of the input pattern. The classification task is related to the concept of clustering, which refers to the grouping of patterns into "clusters," which describe the statistical characteristics of the input. The resurgence of interest in neural-network-based approaches to the classification problem has generated several paradigms based (to a greater or lesser degree) on the human nervous system. All share the properties of being highly parallel having a multitude of simple computational elements. In addition, some are also fault-tolerant. The parallel nature of neural network classifiers makes them potentially capable of high-speed performance. However, most such classifiers exist only as computer simulations and are consequently only as parallel as the machines they run on. Neural network schemes can hence be put to efficient practical use only if good hardware implementation schemes are developed. Neural Computation 2, 35-43 (1990)
@ 1990 Massachusetts Institute of Technology
36
Arun Rao et al.
2 Neural Network Classifiers
Similarities in the computing architectures of neurally inspired clustering and classification schemes suggests that a single, flexible VLSI approach would suffice for the real-time implementation of many different algorithms. In most paradigms, the input is applied in the form of a finite-length vector. The inner product of the weights and the input is formed, generating a vector of activations representing the scores of best match calculations or projections onto orthogonal basis vectors. Subsequent processing using thresholding and/or lateral inhibition may be employed to establish input membership in classes that are specified a priori by the user, or statistical clusters and orthogonal features detected by self-organizing adaptation algorithms. Feedback may be utilized to optimize existing class detectors or to add new ones. (See Lippmann 1987 for a comprehensive review.) Common neural net classifiers that operate on binary input vectors include the Hopfield net, the Hamming net, and ARTl. Other classifiers are theoretically capable of accepting continuous input, but actual implementations usually represent continuous quantities in binary form, due to the ease with which Boolean logic may be applied for the calculation of most Euclidean distance metrics. ARTl differs substantially, since it provides mechanisms for developing class detectors in a self-organizing manner. If class samples are assumed normally distributed with equal variance, fixed-radius clustering of binary vectors may approach the accuracy of parametric Bayesian classifiers if a means of adjusting iteratively the initial location of cluster centers is provided (Kohonen 1988). The normal distribution assumption may be overly restrictive. Classifier algorithms employing a finite number of clusters of fixed radius will be suboptimal for non-Gaussian sample classes. Multilayer perceptrons employing hyperplanes to form arbitrarily complex decision regions on the multidimensional input space are better suited for such situations (Huang and Lippmann 1988). The next section describes a hardware implementation of a binary classification and clustering scheme which is functionally equivalent to any binary neural-network-based scheme. 3 A Pipelined Associative Memory Classifier
A pipelined associative memory that functions as a general minimumdistance binary pattern classifier has been designed and constructed in prototype form (Clark and Grondin 1989). This function is similar to that performed by neural network classifiers. The prototype device implements clustering based on the Hamming distance between input patterns. The basic architecture, however, could support any distance metric
VLSI Implementation of Neural Classifiers
37
that is computable from a comparison between each stored exemplar and an input bit-pattern. The nucleus of the associative memory is a pipeline of identical processing elements (PE’s), each of which performs the following functions: 1. Comparison of the present input to the stored exemplar.
2. Calculation of the distance based on this comparison. 3. Gating the best matching address and associated distance onward. Each input vector travels downward through the pipeline with its associated best matching address and distance metric. The output of the pipeline is the PE address whose exemplar most closely matches the input vector and the associated distance metric. In the event of identical Hamming distances the most recent address is preserved. This, in combination with the nonaddressed writing scheme used, means that inputs are clustered with the most recently added cluster center. Writing is a nonaddressed operation. Each PE has an associated ” u s e d bit, which is set at the time the location is written. Figure 1 is a block diagram of a single PE as implemented in the prototype (Clark 1987), and illustrates the essential components and data flow of the architecture. The write operation, which may be interspersed with compare operations without interrupting the pipeline flow, writes the first uninitialized location encountered. The operation (here a “write”), which flows down the pipeline with the data, is toggled to a ”compare.” In this manner the error condition that the pipeline is out of uninitialized locations is flagged naturally by a ”write” signal exiting the bottom of the pipe. Finally, a write generates a perfect match, so that the PE chosen is indicated by the best matching location as in a compare operation. Input data are only compared with initialized locations by gating the compare operation with the “used“ bit. The pipelined CAM bears little resemblance to conventional CAM architectures, where in the worst case, calculation of the best matching location could require 2N cycles for N bits. Some recent implementations of such devices are described by Wade and Sodini (1987) and Jones (1988). Blair (1987), however, describes an independently produced architecture that does bear significant resemblance to the pipelined CAM design. Here, the basic storage element is also a circular shift-register. Additional similarities are that uninitialized locations are masked from participating in comparisons and writing is nonaddressed. The classifier consists of the pipelined content-addressable memory with appropriate external logic to provide for comparison of the distance metric output with some threshold, and a means for feeding the output back to the input.’ If this threshold is exceeded, the input vector does ’This threshold parameter performs a function identical to that of the “vigilance parameter” used by the ART1 network described by Carpenter and Grossberg (1987a).
A r m Rao et al.
38
this PE address
best match address
t
I
t !
t'
serial data
og
9 bit
shift
4 2z
control in
Figure 1: Processing element block diagram.
not adequately match any of the stored exemplars and must be stored as the nucleus of a new cluster. The input vector (which constitutes one of the outputs of the pipeline along with the distance metric and the best matching PE address) is fed back. The input stream is interrupted for one cycle, while the fed-back vector is written to the first unused PE in the pipeline. This effectively constitutes an unsupervised "learning" process
VLSI Implementation of Neural Classifiers
39
that proceeds on a real-time basis during operation. It should be noted that this learning process is functionally identical to that performed by the unsupervised binary neural classifiers mentioned in Section 2 when they encode an input pattern in their weights. Figure 2 illustrates the overall classifier structure.
Input data vectors
t
input select MUX
Operation
Best match CAM pipeline
I
Quality metric out
Input data feedback
Output vector
Figure 2: Classifier structure.
Control
40
Arun Rao et al.
Classificationis not performed in parallel as it would be in a true parallel realization of an ANN. However, the pipelining achieves a sort of “temporal” parallelism by performing comparison operations on many input vectors simultaneously. This operation has no real anthropomorphic equivalent, but results in extremely high throughput at the expense of an output latency period that is proportional to the length of the pipeline. Architectural enhancements and overhead processing necessary to implement specific neural classifiers are facilitated by the modular nature of the device. The removal of infrequently referenced cluster centers, for example, is accomplished by the addition of a bank of registers to the control section of Figure 2. Each register would be incremented each time its associated cluster center was referenced, and would be decremented occasionally. If a register reaches zero, its associated PE would have its ”used bit toggled off. Thus relatively unused cluster centers can be eliminated, freeing space for those more often referenced (which are hopefully more indicative of the present data). This effectively emulates the function performed by the weight decay and enhancement mechanisms of ARTl. In addition, varying cluster radii would be supported by the addition of a register to each PE to control the replacement of the previous best matching address. Replacement would occur only in the case that the input was within the cluster radius indicated. 3.1 The Prototype. The prototype device was constructed using the MOSIS 3pm design rule CMOS process. Due to testing limitations, static logic was used throughout. The prototype device consists of 16 PEs handling 9-bit input vectors. Chips can be cascaded to yield longer pipelines. Longer pipelines result in higher throughput, but are accompanied by an increase in the latency period. A photomicrograph of the prototype chip is shown in Figure 3. The datapath logic of the device was found to be the limiting factor in the performance, and was tested up to 35 MHz. The corresponding pipeline bit throughput is 18.5 Mbits/sec. The latency was thus approximately 500 nsec/PE.
4 The Effect of Limited Parameter Resolution on Classifier Emulation Details regarding the implementation of specific neural models using the pipelined CAM may be found in Rao et al. (1989). In this section we seek to analyze the effects resulting from the representation of continuous network quantities with discrete binary vectors. Specifically considered is the ARTl model. Adaptive Resonance Theory (ART) is a neural network-based clustering method developed by Carpenter and Grossberg (1987a,b). Its inspiration is neurobiological and its component parts are
VLSI Implementation of Neural Classifiers
41
Figure 3: Prototype chip. intended to model a variety of hierarchical inference levels in the human brain. Neural networks based upon ART are capable of the following: 1. "Recognizing " patterns close to previously stored patterns according to some criterion.
Arun Rao et al.
42
2. Storing patterns which are not close to already stored patterns.
An analysis performed by Rao et aI. (1989) shows that the number of bits of resolution required of the bottom-up weights is given by
This is a consequence of the inverse coding rule. For most practical applications, it is impossible to achieve this resolution level in hardware. This, combined with the fact that ARTl requires full connectivity between input and classification layers, makes a direct implementation of ARTl networks impossible to achieve with current technology. The most obvious method of getting around this problem is to sacrifice parallelism to facilitate implementation. If this is done, and if in addition the inverse coding rule is eliminated by using an unambiguous distance metric (Rao et al. 19891, the ARTl network reduces to a linear search best-match algorithm functionally and structurally equivalent to the associative memory described in the previous section. The pipelined associative memory thus represents an attractive state-of-the-art method of implementing ART1-like functionality in silicon until such time that technology allows the higher degree of parallelism required for direct implementation. 5 Conclusion
The pipelined associative memory described can emulate the functions of several neural network-based classifiers. It does not incorporate as much parallelism as neural models but compensates by using an efficient, well-understood pipelining technique that allows matching operations on several input vectors simultaneously. Neural classifiers are potentially capable of high speed because of inherent parallelism. However, the problems of high interconnectivity and (especially in the case of ARTl) of high weight resolution preclude the possibility of direct implementation for nontrivial applications. In the absence of high connectivity, neural classifiers reduce to simple linear search classification mechanisms very similar to the associative memory chip described. Acknowledgments L. T. Clark was supported by the Office of Naval Research under Award N00014-85-K-0387.
VLSI Implementation of Neural Classifiers
43
References Blair, G. M. 1987. A content addressable memory with a fault-tolerance mechanism. I E E E J. Solid-state Circuits SC-22, 614-616. Carpenter, G. A., and Grossberg, S. 1987a. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput. Vision, Graphics Image Process. 37, 54-115. Carpenter, G. A., and Grossberg, S. 1987b. ART2: Self-organization of stable category recognition codes for analog input patterns. Appl. Opt.: Special Issue on Neural Networks. Clark, L. T. 1987. A novel VLSI architecture for cognitive applications. Masters Thesis, Arizona State University. Clark, L. T., and Grondin, R. 0. 1989. A pipelined associative memory implemented in VLSI. I E E E I. Solid-state Circuits 24, 28-34. Huang, W. Y., and Lippmann, R. P. 1988. Neural net and traditional classifiers. Neural lnformation Processing Systems, pp. 387-396. American Institute of Physics, New York. Jones, S. 1988. Design, selection and implementation of a content-addressable memory for a VLSI CMOS chip architecture. I E E E Proc. 135, 165-172. Kohonen, T. 1988. Learning vector quantization. Abstracts of the First Annual INNS Meeting, p. 303. Pergamon Press, New York. Lippmann, R. P. 1987. An introduction to computing with neural nets. l E E E ASSP Mag. 4-22. Rao, A., Walker, M. R., Clark, L. T., and Akers, L. A. 1989. Integrated circuit emulation of ART1 Networks. Proc. First I E E E Conf. Artificial Neural Networks, 3741. Institution of Electrical Engineers, London. Wade, J., and Sodini, C. 1987. Dynamic cross-coupled bit-line content addressable memory cell for high-density arrays. I E E E J. Solid-state Circuits SC-22, 119-121.
Received 31 March 1989; accepted 21 December 1989.
Communicated by Richard A. Andersen
Coherent Compound Motion: Corners and Nonrigid Configurations Steven W.Zucker* Lee Iverson Robert A. Hummelt Computer Vision and Robotics Laboratory, McGilJ Research Centre for Intelligent Machines, McGill University, Montrhl, QuCbec H3A 2A 7 Canada
Consider two wire gratings, superimposed and moving across each other. Under certain conditions the two gratings will cohere into a single, compound pattern, which will appear to be moving in another direction. Such coherent motion patterns have been studied for sinusoidal component gratings, and give rise to percepts of rigid, planar motions. In this paper we show how to construct coherent motion displays that give rise to nonuniform, nonrigid, and nonplanar percepts. Most significantly, they also can define percepts with comers. Since these patterns are more consistent with the structure of natural scenes than rigid sinusoidal gratings, they stand as interesting stimuli for both computational and physiological studies. To illustrate, our display with sharp comers (tangent discontinuities or singularities) separating regions of coherent motion suggests that smoothing does not cross tangent discontinuities, a point that argues against existing (regularization) algorithms for computing motion. This leads us to consider how singularities can be confronted directly within optical flow computations, and we conclude with two hypotheses: (1)that singularities are represented within the motion system as multiple directions at the same retinotopic location; and (2) for component gratings to cohere, they must be at the same depth from the viewer. Both hypotheses have implications for the neural computation of coherent motion. 1 Introduction
Imagine waves opening onto a beach. Although the dominant physical direction is inward, the visual impression is of strong lateral movement. This impression derives from the interaction between the crests of waves *Fellow,Canadian Institute for Advanced Research. +Courant Institute of Mathematical Sciences and Center for Neural Science, New York University, New York, NY, and Vrije University, Amsterdam, The Netherlands.
Neural Computation 2 , 4 4 5 7 (1990)
@ 1990 Massachusetts Institute of Technology
Coherent Compound Motion
45
adjacent in time, and is an instance of a much more general phenomenon: whenever partially overlapping (or occluding) objects move with respect to one another, the point where their bounding contours intersect creates a singularity (Zucker and Casson 1985). Under certain conditions this singularity represents a point where the two motions can cohere into a compound percept, and therefore carries information about possible occlusion and relative movement. Another example is the motion of the point of contact between the blades of a closing scissors; the singular point moves toward the tip as the scissors are closed. The scissors example illustrates a key point about coherent motion: hold the scissors in one position and observe that it is possible to leave the singular point in two different ways, by traveling in one direction onto one blade, or in another direction onto the other blade. Differentially this corresponds to taking a limit, and intuitively leads to thinking of representing the singular point as a point at which the contour has two tangents. Such is precisely the representation we have suggested for tangent discontinuities in early vision (Zucker et al. 1989), and one of our goals in this paper is to show how it can be extended to coherent motion computations. The previous discussion was focused on two one-dimensional contours coming together, and we now extend the notions of singular points and coherent motion to two-dimensional (texture) patterns. In particular, if a ”screen” of parallel diagonal lines is superimposed onto a pattern of lines at a different orientation, then a full array of intersections (or singular points) can be created. The proviso, of course, is that the two patterns be at about the same depth; otherwise they could appear as two semitransparent sheets. Adelson and Movshon (1982) extended such constructions into motion, and, using sinusoidal gratings, showed that coherent compound motion can arise if one pattern is moved relative to the other. To illustrate, suppose one grid is slanted to the left of the vertical, the other to the right, and that they are moving horizontally in opposite directions. The compound motions of each singular point will then cohere into the percept of a rigid texture moving vertically. Thus the compound pattern can be analyzed in terms of its component parts. But compound motion arises in more natural situations as well, and gives rise to coherent motion that is neither rigid nor uniform. Again to illustrate, superimposed patterns often arise in two different ways in densely arbored forest habitats (e.g., Fig. 2 in Allman et al. 1985). First, consider an object (say a predator) with oriented surface markings lurking in the trees; the predator’s surface markings interact with the local orientation of the foliage to create a locus of singular points. A slight movement on either part would create compound motion at these points, which would then cohere into the predator’s image. Thus, singular points and coherent motion are useful for separating figure from ground. More complex examples arise in this same way, e.g., between nearby trees
46
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
undergoing flexible or different motions, and suggests that natural coherent motion should not be limited to that arising from rigid, planar objects; nonrigid and singular configurations should arise as well. Second, different layers of forest will interact to create textures of coherent motion under both local motion and an observer's movement; distinguishing these coherent motion displays from (planar) single textures (e.g., a wallpaper pattern) can depend on depth as well. Thus, nonrigid pattern deformations, discontinuities, and depth matter, and our first contribution in this paper is to introduce a new class of visual stimuli for exhibiting them. The stimuli build on the planar, rigid ones previously studied by Adelson and Movshon (1982), but significantly enlarge the possibilities for psychophysical, physiological, and computational studies. In particular, the perceptual salience of "corners" within them implies that algorithms for the neural computation of coherent motion require significant modification from those currently available. We propose that a multiple tangent representation, known to be sufficent to represent tangent discontinuities in curves, can be extended to handle them, and show how such ideas are consistent with the physiology of visual area MT. Finally, the interaction of motion coherence and depth is briefly considered. 2 Nonuniform Coherent Motion Displays
Compound motion displays are created from two patterns, denoted PI and P2 where, for the Adelson and Movshon displays, the P, were sinusoidal gratings oriented at 0: and 0;, respectively. Observe initially that patterns of parallel curves work as well as the sinusoidal gratings; the components can be thought of as square waves (alternating black and white stripes) oriented at different angles. Parallel random dot Moire patterns ("Glass patterns") work as well (Zucker and Iverson 19871, and we now show that patterns that are not uniformly linear also work. It is this new variation (in the orthogonal direction) that introduces nonuniformities into the coherent motion display. We consider two nonuniform patterns, one based on a sinusoidal variation, and the other triangular. As we show, these illustrate the variety of nonrigid and singular patterns that can arise naturally. 2.1 Sinusoidal Variation. The first nonuniform compound motion pattern is made by replacing one of the constant patterns with a variable one, say a grating composed of parallel sinusoids rather than lines (Fig. 1). Note that this is different than the Adelson and Movshon display, because now the sinusoidal variation is in position and not in intensity. Sliding the patterns across one another, the result is a nonconstant motion field
Coherent Compound Motion
47
Figure 1: Illustration of the construction of smooth but nonuniform coherent motion patterns. The first component pattern (left) consists of a field of displaced sinusoinal curves, oriented at a positive angle (with respect to the vertical), while the second component consists of displaced parallel lines (right) oriented at a negative angle. The two patterns move across each other in opposite directions, e.g., pattern (left) is moved to the left, while pattern (right) is moved to the right. Other smooth functions could be substituted for either of these.
(Fig. 2), for which three coherent interpretations are possible (in addition to the noncoherent, transparent one):
1. Two-dimensional sliding swaths, or a flat display in which the compound motion pattern appears to be a flat, but nonrigid rubber sheet that is deforming into a series of alternating wide strips, or swaths, each of which moves u p and down at what appears to be a constant rate with “elastic” interfaces between the strips. The swath either moves rapidly or slowly, depending on the orientation of the sinusoid, and the interfaces between the swaths appear to stretch in a manner resembling viscous flow. The situation here is the optical flow analog of placing edges between the “bright” and the “ d a r k swaths on a sinusoidal intensity grating. 2. Three-dimensional compound grating, in which the display appears to be a sinsusoidally shaped staircase surface in depth on which a crosshatched pattern has been painted. The staircase appears rigid, and the cross-hatched pattern moves uniformly back and forth across it. Or, to visualize it, imagine a rubber sheet on which two bar
48
Steven W. Zucker, Lee Iverson, and Robert R. Hummel gratings have been painted to form a cross-hatched grating. Now, let a sinusoidally shaped set of rollers be brought in from behind, and let the rubber sheet be stretched over the rollers. The apparent motion corresponds to the sheet moving back and forth over the rollers. 3. Three-dimensional individual patterns, in which the display appears as in (2), but only with the sinusoidal component painted onto the staircase surface. The second, linear grating appears separate, as if it were projected from a different angle. To illustrate with an intuitive example, imagine a sinusoidal hill, with trees casting long, straight shadows diagonally across it. The sinusoid then appears to be rigidly attached to the hill, while the "shadows" appear to move across it.
Figure 2: Calculation of the flow fields for the patterns in Figure 1: (upper left) the normal velocity to the sinusoidal pattern; (upper right) the normal velocity to the line pattern; (bottom)the compound velocity.
Coherent Compound Motion
49
Figure 3: Illustration of the construction of nonuniform coherent motion patterns with discontinuities. The first component pattern (left) consists of a field of displaced triangular curves, oriented at a positive angle, while the second component again consists of displaced parallel lines (right) oriented at a negative angle. The two patterns move across each other as before. Again, other functions involving discontinuities could be substituted for these.
2.2 Triangular Variation. Replacing the sinusiodal grating with a triangular one illustrates the emergence of percepts with discontinuities, or sharp corners. The same three percepts are possible, under the same display conditions, except the smooth patterns in depth now have abrupt changes, and the swaths in (1)have clean segmentation boundaries between them (see Figs. 3 and 4). Such discontinuity boundaries are particularly salient, and differ qualitatively from patterns with high curvatures in them (e.g., high-frequency sinusoids). The subjective impression is as if the sinusoidal patterns give rise to an elastic percept, in which the imaged object stretches and compresses according to curvature, while the triangular patterns give rise to sharp discontinuities.
2.3 Perceptual Instability. To determine which of these three possible percepts are actually seen, we implemented the above displays on a Symbolics 3650 Lisp Machine. Patterns were viewed on the console as black dots on a bright white background, with the sinusoid (or triangular wave) constructed as in Figure 1. The patterns were viewed informally by more than 10 subjects, either graduate students or visitors to the laboratory, and all reported a spontaneous shift from one percept to another. Percepts (1) and (2) seemed to be more common than (31, but individual variation was significant. The spontaneous shifts from
50
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
one percept to another were qualitatively unchanged for variations in amplitude, frequency, and line width (or dot size, if the displays were made with Glass patterns) of about an order of magnitude. The shifting from one percept to another was not unlike the shifts experienced with Necker cube displays. Many subjects reported that eye movements could contribute to shifts between percepts, and also that tracking an element of one component display usually leads to a percept of two transparent sheets.
Figure 4: Calculation of the flow fields for the patterns in Figure 3: (upper left) the normal velocity to the triangular pattern; (upper right) the normal velocity to the line pattern; (bottom) the compound velocity. Velocity vectors at the singular points of the triangle component are shown with small open circles, indicating that two directions are associated with each such point. In the bottom figure, the open circles indicate what we refer to as the singular points of coherent motion, or those positions at which two compound motion vectors are attached.
51
Coherent Compound Motion 3 Local Analysis of Moving Intersections
Given the existence of patterns that exhibit nonuniform compound motion, we now show how the characterization of rigid compound motion can be extended to include them. To begin, observe that one may think of compound motion displays either as raw patterns that interact, or as patterns of moving "intersections" that arise from these interactions. Concentrate now on the intersections, and imagine a pattern consisting of gratings of arbitrarily high frequency, so that the individual undulations shrink to lines. Each intersection is then defined by two lines, and the distribution of intersections is dense over the image. (Of course, in realistic situations only a discrete approximation to such dense distributions of intersections would obtain.) Now, concentrate on a typical intersection, whose motion we shall calculate. (Observe that this holds for each point in the compound image.) The equations for the lines meeting at a typical intersection x = ( 2 ,y) can be written nl . x = c1 + w l t n2 ' x = c 2
+ v2t
where n,, i = 1 , 2 are the normals to the lines in the first and the second patterns, respectively, c, are their intercepts, and w, are their (normal) velocities. Observe that the simultaneous solution of these equations is equivalent to the Adelson and Movshon (1982) "intersection of constraints" algorithm (their Fig. 1). In matrix form we have
(;: 3(;If",'> (::3 =
We can rewrite this equation as N x ( t ) = c + tv Differentiating both sides with respect to t, we obtain NX(t) = v
where v = (q, 7~2)~. Thus, the velocity of the intersection of two moving lines can be obtained as the solution to a matrix equation, and is as follows (from Cramer's rule):
where A is the matrix determinant, A
Y(t)l.
= n11n22
-
721272221, and X ( t ) =
I.k(t),
52
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
Several special cases deserve comment. Let (nll, 1212) = nl and = n 2 . Now, suppose that nl and n2 are perpendicular, so that the two lines meet at a right angle. Once again, assume that the normal velocities of the two lines are w1 and w2, respectively. Then the velocity of the intersection is readily seen to be the vector sum of the velocities of the two Lines: (~~21,1222)
x(t) =
211
. nl
+
w2. n2
The simplest case involving a distribution of intersections is two sets of parallel lines, each set having orientations given by nl and n2, and moving with a uniform (normal) velocity of v1 and 212, respectively. Then all of the intersections will have the same velocity, given by the solution x to the matrix equation. As Adelson and Movshon found, the overall percept in this situation is a uniform motion of x. More generally, as we showed there may be many lines and edges, oriented and moving as a function of their position. Thus, there will be many intersections moving according to the above matrix equation. If the line elements (with normals nl and n2)are associated with objects that are themselves moving with velocities v 1 and v2, then the normal velocities of the lines are obtained from the projections w1 = v 1 . nl and v2 = v 2 . n2. The velocity of the intersection point then satisfies the matrix equation NX = (w1,~ 2 at ) each instant t. Thus reliable estimates of the velocity and normal (or tangent) at each position are integral to compound motion. 4 Implications for Neural Computation It is widely held that theories of motion need not treat discontinuities directly, and that "segmentation" is a problem separate from motion. This view has lead to a rash of "regularized" algorithms with three key features: (1) smoothing is uniform and unconditional; (2) single, unique values are demanded as the solution at each point; and (3) discontinuities, if addressed at all, are relegated to an adjoint process (Bulthoff et al. 1989; Wang et al. 1989; Yuille and Grzywacz 1988). We believe that all three of these features need modification, and submit that the current demonstrations are evidence against them; if regularization-style algorithms were applied to the sine- and the triangle-wave coherent motion patterns, the smoothing would obscure the differences between them. To properly treat these examples, algorithms must be found in which discontinuities are naturally localized and smoothing does not cross over them. Furthermore, we question whether single values should be required at each position, and submit that representations permitting multiple values at a position should be sought. Such multiple-valued representations are natural for transparency, and, as we show below, are natural for representing discontinuities (in orientation and optical flow) as well.
Coherent Compound Motion
53
Before beginning, however, we must stress that there is not yet sufficent information to state precisely how the computation of compound motion is carried out physiologically, or what the precise constraints are for coherence. The analysis in the previous section represents an idealized mathematical competence, and its relationship to biology remains to be determined. Nevertheless, several observations are suggestive. First, it indicates that one need not try to implement the graphical version of the Adelson and Movshon (1982) ”intersection of constraints” algorithm literally, but, now that the mathematical requirements are given, any number of different implementations become viable formally. Biologically it is likely that the computation invokes several stages, and the evidence is that initial measurements of optical flow are provided by cells whose receptive fields resemble space-time filters, tuned for possible directions of (normal) motion (Movshon et al. 1985). Abstractly the filters can be thought of as being implemented by (e.g.) Gabor functions, truncated to local regions of space-time. Such filters provide a degree of smoothing, which is useful in removing image quantization and related affects, but which also blurs across distinctions about which filter (or filters) is (are) signaling the actual motion at each point. In fact, because of their broad tuning, many are usually firing to some extent. An additional selection process is thus required to refine these confused signals, and it is in this selection process that the inappropriate regularization has been postulated to take place. To illustrate, a selection procedure for compound motion was proposed by Heeger (1988) from the observation that a translating compound grating occupies a tilted plane in the frequency domain. (This comes from the fact that each translating sinusoidal grating occupies a pair of points along a line in spatial-frequency space; the plane is defined by two lines, one from each component grating.) After transforming the Gabor filters’ responses into energy terms, Heeger ’s selection process reduces to fitting a plane. However, the fitting cannot be done pointwise; rather, an average is taken over a neighborhood, effectively smoothing nearby values together. This is permissible in some cases, e g , for the planar, rigid patterns that Adelson and Movshon studied. But it will fail for the examples in this paper, rounding off the corners within the triangle waves. It cannot handle transparency either, because a single value is enforced at each point (only one plane can be fit). Other variations in this same spirit, based on Tikhonov regularization or other ad hoc ( e g , “winner-take-all”) ideas, differ in the averaging that they employ, but still impose smoothness and single-valuedness on the solution (Bulthoff e t al. 1989; Wang et al. 1989; Yuille and Grzywacz 1988). They cannot work in general. A different variation on the selection procedure relaxes the requirement that only a single value be assigned to each position, incorporates a highly nonlinear type of smoothing, and is designed to confront discontinuities directly. It is best introduced by analogy with orientation selection
54
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
(Zucker and Iverson 1987). Consider a static triangle wave. Zucker et al. (1988, 1989) propose that the goal of orientation selection is a coarse description of the local differential structure, through curvature, at each position. It is achieved by postulating an iterative, possibly intercolumnar, process to refine the initial Orientation estimates (analogous to the initial motion measurements) by maximizing a functional that captures how the local differential estimates fit together. This is done by partitioning all possible curves into a finite number of equivalence classes, and then evaluating support for each of them independently. An important consequence of this algorithm is that, if more than one equivalence class is supported by the image data at a single point, then both enter the final representation at that point. This is precisely what happens at a tangent discontinuity, with the supported equivalence classes containing the curves leading into the discontinuity (example: a static version of the scissors example in the Introduction). Mathematically this corresponds to the Zariski tangent space (Shafarevich 1977); and physiologically the multiple values at a single point could be implemented by multiple neurons (with different preferred orientations) firing within a single orientation hypercolumn. Now, observe that this is precisely the structure obtained for the coherent motion patterns in the Introduction to this paper -singular points are defined by two orientations, each of which could give rise to a compound motion direction. Hence we propose that multiple motion direction vectors are associated with the points of discontinuity, that is, with the corners of the triangle waves, and that it is these singularities that give rise to the corners in coherent motion. Stating the point more formally, the singular points (or corners) of coherent motion derive directly from the singular points (or corners) of component motion, and for exactly the same reason. These points are illustrated in Figure 4 (bottom) by the open circles. But for such a scheme to be tractable physiologically, we require a neural architecture capable of supporting multiple values at a single retinotopic position. The evidence supports this, since (1) compound motion may well be computed within visual area MT (Movshon et al. 1985; Rodman and Albright 1988), and (2) there is a columnar organization (around direction of motion) in MT to support multiple values (i.e., there could be multiple neurons firing within a direction-of-motion hypercolumn) (Albright et al. 1984). Before such a scheme could be viable, however, a more subtle requirement needs to be stressed. The tuning characteristic for a directionselective neuron is typically broad, suggesting that multiple neurons would typically be firing within a hypercolumn. Therefore, exactly as in orientation selection, some nonlocal processing would be necessary to focus the firing activity, and to constrain multiple firings to singularities. In orientation selection we proposed that these nonlocal interactions be implemented as intercolumnar interactions (Zucker et af. 1989); and, again by analogy, now suggest that these nonlocal interactions exist for
Coherent Compound Motion
55
compound motion as well. That they further provide the basis for interpolation (Zucker and Iverson 1987) and for defining regions of coherence also seems likely. The triangle-wave example deserves special attention, since it provides a bridge between the orientation selection and optical flow computations. In particular, for nonsingular points on the triangle wave, there is a single orientation and a single direction-of-motion vector. Thus the compound motion computation can run normally. However, at the singular points of the triangle wave there are two orientations (call them n, and no); each of these defines a compound motion with the diagonal component (denoted simply n). Thus, in mathematical terms, there are three possible ways to formulate the matrix equation, with (n,, n),(no,n), and (no,q). The solutions to the first two problems define the two compound motion vectors that define the corner, while the third combination simply gives the translation of the triangle wave at the singular point. In summary, we have: Conjecture 1. Singularities are represented in visual area M T analogously to the way they are represented in V1; that is, via the activity of multiple neurons representing different direction-of-motion vectors at about the same (retinotopic) loca tion. We thus have that coherent pattern motion involves multiple data concerning orientation and direction at a single retinotopic location, but there is still a remaining question of depth. That depth likely plays a role was argued in the Introduction, but formally enters as follows. Recall that the tilted plane for rigid compound motion (e.g., in Heeger’s algorithm) resulted from the combination of component gratings. But a necessary condition for physical components to belong to the same physical object is that they be at the same depth, otherwise a figure/ground or transparency configuration should obtain. MT neurons are known to be sensitive to depth, and Allman et al. (1985) have speculated that interactions between depth and motion exist. We now refine this speculation to the conjecture that Conjecture 2. The subpopulation of MT neurons that responds to compound motion agrees with the subpopulation that is sensitive to zero (or to equivalent) disparity. There is some indirect evidence in support of this conjecture, in that Movshon et al. (1985) (see also Rodman and Albright 1988) have reported that only a subpopulation of MT neurons responds to compound pattern motion, and Maunsell and Van Essen (1983) have reported that a subpopulation of MT neurones is tuned to zero disparity. Perhaps these are the same subpopulations. Otherwise more complex computations relating depth and coherent motion will be required. As a final point, observe that all of the analysis of compound motion was done in terms of optical flow, or the projection of the (3-D) velocities
56
Steven W. Zucker, Lee Iverson, and Robert R. Hummel
onto the image plane, yet two of the three possible percepts reported by subjects were three-dimensional. If there is another stage at which these 3-D percepts are synthesized, they could certainly take advantage of the notion that discontinuities are represented as multiple values at a point; each value then serves as the boundary condition for one of the surfaces meeting at the corner.
Acknowledgments This research was sponsored by NSERC Grant A4470 and by AFOSR Grant 89-0260. We thank Allan Dobbins for comments.
References Adelson, E. H., and Movshon, J. A. 1982. Phenomenal coherence of moving visual patterns. Nature (London) 200, 523-525. Albright, T. L., Desimone, R., and Gross, C. 1984. Columnar organization of directionally selective cells in visual area MT of the macaque. J. Neurophysiol. 51, 16-31. Allman, J., Miezin, F., and McGuinness, E. 1985. Direction- and velocity-specific responses from beyond the classical receptive field in the middle temporal area (MT). Perception 14,85-105. Bulthoff, H., Little, J., and Poggio, T. 1989. A parallel algorithm for real time computation of optical flow. Nature (London) 337,549-553. Heeger, D. 1988. Optical flow from spatio-temporal filters. Int. J. Comput. Vision 1,279-302. Maunsell, J. H. R. and Van Essen, D. 1983. Functional properties of neurons in middle temporal visual area of macaque monkey. 11. Binocular interactions and sensitivity to binocular disparity. J. Neurophysiol. 49, 1148-1167. Movshon, J. A., Adelson, E. H., Gizzi, M. S., and Newsome, W. T. 1985. The analysis of moving visual patterns. In Study Group on Pattern Recognition Mechanisms, C. Chagas, R. Gattass, and C. Gross, eds. Pontifica Academia Scientiarum, Vatican City. Rodman, H. and Albright, T. 1988. Single-unit analysis of pattern-motion selective properties in the middle temporal visual area (MT). Preprint, Dept. of Psychology, Princeton University, Princeton, NJ. Shafarevich, I. R. 1977. Basic Algebraic Geometry. Springer-Verlag, New York. Wang, H. T., Mathur, B., and Koch, C. 1989. Computing optical flow in the primate visual system. Neural Comp. 1,92-103. Yuille, A. and Grzywacz, N. 1988. The motion coherence theory, Proc. Second Int. Conf. Comput. Vision, IEEE Catalog No. 88CH2664-1, Tarpon Springs, FL, pp. 344-353. Zucker, S. W. and Casson, Y. 1985. Sensitivity to change in early optical flow. Invest. Ophthalmol. Visual Sci. (Suppl.) 26(3), 57. Zucker, S. W., Dobbins, A., and Iverson, L. 1989. Two stages of curve detection suggest two styles of visual computation. Neural Comp. 1, 68-81.
Coherent Compound Motion
57
Zucker, S. W., and Iverson, L. 1987. From orientation selection to optical flow. Cornput. Vision, Graphics, Image Process. 37, 196-220. Zucker, S. W., David, C., Dobbins, A., and Iverson, L. 1988. The organization of curve detection: Coarse tangent fields and fine spline coverings. Proc. Second Int. Conf. Cornput. Vision, IEEE catalog no. 88CH2664-1, 568-577. Tarpon Springs, FL.
Received 10 July 1989; accepted 8 January 1990.
Communicated by Gordon M. Shepherd
A Complementarity Mechanism for Enhanced Pattern Processing James L. Adams* Neuroscience Program, 73-346 Brain Research Institute, University of California, Los Angeles, CA 90024 USA
The parallel ON- and OFF-center signals flowing from retina to brain suggest the operation of a complementarity mechanism. This paper shows what such a mechanism can do in higher-level visual processing. In the proposed mechanism, inhibition and excitation, both feedforward, coequally compete within each hierarchical level to discriminate patterns. A computer model tests complementarity in the context of an adaptive, self-regulating system. Three other mechanisms (gain control, cooperativity, and adaptive error control) are included in the model but are described only briefly. Results from simulations show that complementarity markedly improves both speed and accuracy in pattern learning and recognition. This mechanism may serve not only vision but other types of processing in the brain as well. 1 Introduction
We know that the human genome contains a total of about 100,000 genes. Only a fraction of that total relates to vision. So few genes cannot explicitly encode the trillions of connections of the billions of neurons comprising the anatomical structures of vision. Hence, general mechanisms must guide its development and operation, but they remain largely unknown. The objective of the research reported in this paper was to deduce and test mechanisms that support stability and accuracy in visual pattern processing. The research combined analysis of prior experimental discoveries with synthesis of mechanisms based on considerations of the requirements of a balanced system. Several mechanisms were proposed. One of these, complementarity, is presented in this paper. 2 Foundations and Theory
Beginning with traditional aspects of vision models, I assume that patterns are processed hierarchically. Although this has not been conclu*Current address: Department of Neurobiology, Barrow Neurological Institute, 350 W. Thomas Road, Phoenix, AZ 85013-4496 USA.
Neural Computation 2, 58-70 (1990)
@ 1990 Massachusetts Institute of Technology
A Complementarity Mechanism for Enhanced Pattern Processing
59
sively demonstrated (Van Essen 1985),most experimental evidence shows selectivity for simple features at early stages of visual processing and for complex features or patterns at later stages.' I also assume that signals representing lower order visual features converge to yield higher order features (Hubel and Wiesel 1977). Inasmuch as the retina is a specialized extension of the brain itself, we might expect to find mechanisms common to both. Consequently, I partially base the complementarity mechanism of this paper on certain well-established findings from the retina. Retinal photoreceptors respond in proportion to the logarithm of the intensity of the light striking them (Werblin 1973). Since the difference between two logarithmic responses is mathematically identical to the logarithm of the ratio of the two stimulus intensities, relative intensities can be measured by simple summation of excitatory and inhibitory signals. The ratios of intensities reflected from different portions of an object are characteristic of the object and remain fairly constant for different ambient light intensities.* Use of this property simplifies pattern learning and recognition. Prediction 1. Constant ratios and overlapping feedforward. A given feature at a particular location within the visual field is accurately represented under varying stimulus intensities if (1) the neurons responding to that feature and those responding to the complement of that feature fire at the same relative rates for different levels of illumination3 and (2) the ratios of responses from different parts of an object are obtained via overlapping inhibitory and excitatory feedforward projections. Photoreceptor cells depolarize in darkness and hyperpolarize in light, generating the strongest graded potentials in darkness (Hagins 1979). Even so, a majority of the tonic responses to illumination transmitted from the retinal ganglion cells to the lateral geniculate nucleus represent two complementary conditions: (1) excitation by a small spot of light of a particular color and inhibition by a surround of a contrasting color or ( 2 ) inhibition by a center spot and excitation by a contrasting surround (De Monasterio and Gouras 1975). Such ON-center and OFFcenter signals vary from cell to cell, most reporting contrasts in color and others reporting contrasts in light intensity. ON- and OFF-center activities are maintained separately at least as far as the visual cortex (Schiller et al. 1986).
lThe term "feature" is used in this paper to refer to a configuration of stimuli or signals for which one or more pattern processing neurons optimally respond. 'This follows from a law of physics that the light reflected from a surface is linearly proportional to the intensity of the incident light. 3Some examples of low-level complementary features are dark versus bright (opposites in intensity), vertical versus horizontal lines (geometrically perpendicular), and red versus green or yellow versus blue (complementary colors).
60
James L. Adams
Prediction 2. Constant sums. If a cortical neuron that responds to a stimulus at a given position within a sensory receptive field is most active for a particular stimulus, there is another neuron nearby4 that is least active for that same stimulus. Similarly, a stimulus provoking a high rate of discharge in the second neuron results in a low rate of discharge in the first. The sum of their activities remains approximately constant for a wide range of stimuli. Corollary Prediction 2a. Constant sums for complementary features. With a fixed level of illumination, as a feature at a given location is transformed into the complement of that feature (e.g., rotation of a vertical edge to a horizontal position), the sum of the firing rates of the cortical neurons optimal for the original feature and those optimal for its complement remains constant. The relative levels of activity of the two shift inversely with the change in character of the stimulus, one group becoming stronger and the other becoming weaker.
Prediction 3. System balance. In order that complementary features compete equally, each feature triggers equal amounts of parallel excitatory and inhibitory feedforward activity5 While some neurons are excited in response to a particular feature, others are inhibited, and vice versa. This also contributes to the overall balance of the system.6 In addition to an overall dynamic balance of excitatory and inhibitory activity, consider all connections, both active and inactive, per neuron. In a network with the total strengths of excitatory and inhibitory connections balanced only globally, or even locally, we could expect to find some neurons with mostly excitatory connections and others with mostly inhibitory connections. However, that is contrary both to what has been observed in studies of connections in the cerebral cortex (e.g., see Lund 1981) and to what can be functionally predicted.
Prediction 4. Neuronal connection balance. On each neuron, the feedforward input connections are balanced such that the total (active and inac-
4Within the local region of cortex serving the same sensory modality and the same position in the receptive field. 5 A continuous balance is more easily achieved in a system of parallel feedforward excitation and inhibition than in one of temporally alternating feedforward excitation and lateral inhibition. 60rdinarily, the firing of an individual neuron has little effect on neighboring neurons not connected to it. The shift of ions across the membrane of the firing cell is shunted away from the neighboring cells by the extracellular fluid. Hypothetically, though, a problem might arise if a high percentage of neurons within a neighborhood received only excitation and some of them fired in synchrony. The shifts in potential could be sufficient to induce spurious firings of neighboring neurons before the shunting was completed. Such a problem would not occur, however, in a system in which approximately equal amounts of excitation and inhibition arrived at any given instant within every small region of neural circuitry. The parallel feedforward described in this prediction could provide the required balance.
A Complementarity Mechanism for Enhanced Pattern Processing
61
tive) excitatory input strength equals the total inhibitory input ~trength.~ Although the strongest experimental evidence for a complementarity mechanism has been found in the lowest processing levels of vision, there is no a priori reason to believe it will not be found in higher processing levels as well. Prediction 5. Complementarity in successive processing levels. In each hierarchical level of pattern processing, new complementarities are formed. One cannot expect the brain to achieve perfect complementarity at all times. Nevertheless, one can predict that self-adjusting feedback mechanisms continually restore the nervous system toward such equilibrium. Dynamically, this process can be compared with homeostasis. The agents for the competition between complementary signals in the retina are the bipolar, horizontal, and amacrine cells (Dowling 1979). In the cerebral cortex, the agents for the competition may be excitatory and inhibitory stellate interneurons, although much less is known about the functional roles of neurons in the cortex than in the retina. Stellate neurons occur throughout all layers of the cortex but are especially numerous in input layer IV (Lorente de N6 1949). Whether or not they are the exact agents for competing complementary signals in the cortex, they are well-placed for providing parallel excitatory and inhibitory feedforward signals from the specific input projections to the output pyramidal neurons. 3 Rules for Implementing Complementarity
Mechanistically, complementarity can be achieved throughout the processing hierarchy by implementing the following rules: 1. In a processing hierarchy, let the output projections from one level of processing excite both excitatory and inhibitory interneurons that form plastic connections onto the feature-selective neurons of the next higher level.
2. On those neurons that receive plastic connections: adjust both active and inactive connections. If the net activation is excitatory, simultaneously reinforce the activated excitatory input connections and the nonactivated inhibitory input connections? Similarly, if the 7The firing of such a neuron depends upon the instantaneous predominance of either excitation or inhibition among its active connections. 'Connections whose strengths change with increasing specificity of responses. gA connection is "activated" if it has received a recent input spike and certain undefined responses have not yet decayed. Nonactivated connections respond to some feature(s) other than the one currently activating the other input connections.
James L. Adams
62
net activation is inhibitory, simultaneously reinforce the activated inhibitory input connections and the nonactivated excitatory input connections.
3. Maintain a fixed total strength of input plastic connections onto any given neuron. For each neuron, require half of that total strength be excitatory and half inhibitory. The requirement for fixed total input connection strengths is met by making some connections weaker as others are made stronger. Those made weaker are the ones not satisfying the criteria for reinforcement; thus, they lose strength passively. 4 Network for Testing the Theory
The complementarity mechanism was tested with a three-layer network of 1488 simulated neurons (see Fig. 1). This particular network accepts a 12 x 12 input array of 144 binary values. The neuron types important to the processing by the network and their prototypical connectivity are shown in Figure 2. Each simulated neuron has a membrane area on which the positive (excitatory) and negative (inhibitory)input signals are integrated. The neurons are connected with strengths (weights) that vary with the reinforcement history of each neuron. The connection strength is one of the determinants of the amount of charge delivered with each received input spike. Another determinant is the gain, used to make the network self-regulating in its modular firing rate. That is, the gain is automatically adjusted up or down as necessary to move each module toward a certain number of output spikes generated for a given number of input spikes received. The gain tends toward lower values as a module becomes more specific in its responses. The regulatory feedforward neurons (RFF in Fig. 2) modulate the changes in connection strengths whenever an error occurs. (An error is defined to be the firing of an output neuron in response to a combination of input signals other than that to which it has become ”specialized.”) This modulation shifts the weights in a manner that reduces the future likelihood of erroneous firing. The error detection and correction occur without an external teacher. That, however, is not the subject of this paper.’O A neuron fires when its membrane potential (charge per unit area) exceeds a predetermined threshold level. Upon firing, the neuron’s membrane potential is reset to the resting level (each cycle of the simulation represents about 10 msec), and a new period of integration begins for that neuron. Based on an assumption of reset action by basket cells, a modular, winner-take-all scheme resets the potentials of all neurons within a module whenever a single output neuron in that module fires. ‘OFor further information on the error control mechanism, see Adams (1989).
A Complementarity Mechanism for Enhanced Pattern Processing
63
Signals are transmitted from one neuron to succeeding neurons via axons. Referring to Figure 2, each input axon to a module makes fixedstrength excitatory connections to one inhibitory stellate neuron and one excitatory stellate. It also makes a fixed-strength inhibitory connection to one RFF neuron, omitting an inhibitory interneuron for simplicity. Every stellate neuron within a module connects to every primary pyramidal (P) neuron within that module. These connections are initialized with random strengths. The strength of each can evolve within a range of zero u p to a maximum of about one-quarter the total-supportable input synaptic strength of the postsynaptic cell. In practice, the actual strengths that develop fall well between the two extremes. It is the
PL2
PLl
IA
Figure 1: Network layout. The input array (IA) accepts the binary input and generates axon spikes that project directly without fan-out or fan-in to the simulated neural elements of processing level 1 (PL1).PL1 is divided into 16 modules representing hypercolumns as defined by Hubel and Wiesel (1977). The output from PLl converges onto a single module in processing level 2 (PL2). Each dashed line represents 9 axons (144 axons project from IA to PLl and another 144 from PLl to PL2). Each moduIe of PL1 generates local feature identifications and the single module of PL2 generates pattern identifications.
James L. Adams
64
Q Pyramidal neurons ( p r i m a r y and secondary)
0
Excitatory interneurons
I n h i b i t o r y interneuron
*
Regulatory interneuron
Figure 2: Basic neuronal layout of levels PLl and PL2. Only prototypical connections are shown. The IS and ES neurons connect to all primary pyramidal (P) neurons within a module. Likewise, each RFF neuron sends modulatory connections to all of the same P neurons. All AFB neurons connect to all RFF neurons within the same module, but some of these connections may have zero strength. balancing of these stellate to pyramidal connections that employs the
complementarity mechanism. Each primary pyramidal cell connects to one secondary pyramidal (S) for relay of output to the next level of processing and to one adaptive feedback (AFB) neuron. An AFB neuron makes simple adaptive connections to all of the RFF neurons within the module. The strength of each AFB to RFF connection, starting from zero, slowly "adapts" to the recent level of inhibition of the RFF cell. These adaptive feedback and regulatory feedforward cells are the major components of the adaptive error control mechanism. If the amount of excitation of an RFF cell exceeds the amount of its inhibition by a threshold amount, the RFF cell generates a regulatory error correction signal that projects to all primary pyramidal neurons within the module. Another mechanism used in the simulations is cooperativity. This mechanism restricts the primary pyramidal neurons to respond to the mutual action of multiple input connections. It also causes the initially random input connection strengths onto each neuron to approach
A Complementarity Mechanism for Enhanced Pattern Processing
65
values that reflect how much they represent particular information. Thus, in a variation of the Hebb (1949) neurophysiological postulate, frequent synaptic participation in the firing of a neuron in this model leads to an equalization of the strengths of the mutually active presynaptic connections. Although the strongest of these connections often get weaker as a result, the ensemble of reinforced mutually active connections grows stronger. This also applies to inhibitory connections onto neurons whose responses are complementary to those of the neurons receiving reinforced excitatory connections. The overall effect is to increase the accuracy of representing information. The mechanisms of gain control, adaptive error control, and cooperativity contributed to the performance of the network simulations reported here. But for the current paper, those three mechanisms must remain in the background. For further discussion see Adams (1989). Both the control simulations and the experimental simulations in which complementarity was deactivated equally employed the other mechanisms.
5 Testing Procedures
The simulation software was written in APL2, a high-level language developed by IBM and based on the theoretical work of K. E. Iverson, J. A. Brown, and T. More. This language was chosen for its power and flexibility in manipulating vectors and multidimensional arrays. Although APL2 supports speed in the development of code, it does not have a compiler. Consequently, the resulting code will not run as fast as optimized and compiled FORTRAN code, for example. The simulations were run on an IBM 3090-200VF. APL2 automatically invokes the fast vector hardware on this machine, but only a small amount of my code was able to take advantage of it due to the short vectors of parameters associated with each simulated neuron. The network was trained with eight variations in position and/or orientation of parallel lines (Fig. 3). This resulted in the development of feature selectors in each module for lines of each of the eight variations. The network was then tested with the eight different patterns of Figure 4, some geometric and others random combinations of short line segments. During this testing, the second processing level (PL2 in Fig. 1) developed responses representing the different patterns." Each pattern was presented until every module had fired at least 30 times since the last error (defined in the previous section). The performance of the network was measured in terms of (1)how well each pattern l'The use of line segments is not important to the model. What is significant is that feature-selective neurons in PLl evolve to represent combinations of input activity in IA and pattern-selective neurons in PL2 evolve to represent combinations of featureselective activity in PLI.
James L. Adams
66
______________
I************[ I I
j
I
************I I
I ************
I
II
II
I ************I
I
I
************I
************ I ************
************I
iI I
Figure 3: The training patterns (patterns 1 to 8). Each pattern is presented to level LA (input array) as a 12 x 12 binary array, where a 1 represents an input spike (shown here as an asterisk)and a 0 represents a nonfiring position (shown here as a blank).
was distinguished from all other patterns, (2) how many simulation cycles were required to reach the stated performance criterion, and (3) the number of errors produced in the process. Three control and three experimental simulations were run. The control simulations each began with a different set of random strength synaptic connections. The experimental simulations began with the same initial sets of connection strengths as the controls but were run with the complementarity mechanism deactivated.
6 Results
Both with and without complementarity, the overall network design incorporating gain control, cooperativity, and adaptive error control was able to achieve accurate discrimination of the patterns.]' However, the 12Theformula used to measure discriminationis given in Adams (1989); it is omitted here since discrimination was not a problem.
A Complementarity Mechanism for Enhanced Pattern Processing
* * **** I**** * * * * * *** I**** **** **** ** I * * *
I*
I:
I I****
i
I*
I
****
** ** ** * * ** * *
*
*****A**
* * ** * * * **** **** ****
I * I * I *
1: I
: I:**:
:
I
*
*
* * *** ** * ** * * * * *
* * * *
* * * *
******A*
________----
*** * ** **** * * * ** * * * *** ** ** *
* **** * * *** * * *** * * * * * * *
******** * * * * * **** * *** * * * * * **** * * * * * * * * * * * * * * ****
I I * I*
* * * * * * * A*** * *** * * *** * * * * * * * * * * * * **** * **** **** * * *
I*** * I *** *
67
**
** * ** * * * I *** ** * * * ** * I* * * ** I** **** ** 1 ** ** **** **
I**
* * * * II ** ** * * I* * * *** I * * * 1 * *** * * I* * * I**** * I * * I * ****
Figure 4: The test patterns (patterns 9 to 16). These patterns are presented in the same manner as the training patterns, except that they are not presented until after training has been completed for level PL1 (processing level 1). number of simulation cycles required to attain equivalent performance without complementarity was double what it was with complementarity.I3 The most dramatic evidence of the benefit of complementarity was the five times greater number of errors made by the network in the process of learning the patterns when complementarity was not used. This is illustrated in Figures 5 and 6.
7 Conclusions Computer simulations show that a complementarity mechanism of synaptic reinforcement markedly improves visual pattern learning. The network learns faster and responds more accurately when each feature or pattern classifying neuron has an equal opportunity to be excited to great13Simulation cycle computer processing that occurs for each simulated instant in time. In these simulations, a cycle representing an interval of about 10 msec of real time used nearly 1 sec of the computer’s Central Processing Unit (CPU) time. With complementarity, 35 min of CPU time simulating about 24 sec of real time was required for the system to learn all eight training patterns. Without complementarity 73 min of CPU time was required to reach equivalent performance. In terms of simulation cycles, the values were 2390 cycles with complementarity versus about 5000 cycles without it.
James L. Adam
68
t 500
COMPLEMENTARITY:
0USED,
RANOOM INITIAL STATE
DISABLED
NUMBER 1
8
0
L
7
8
FIRST EIGHT TRAINING SESSIONS
Figure 5: Errors with and without complementarity.Both sequencesbegan with identical values of the random-valued synaptic strengths. The errors are shown in the order of pattern presentation, which was the same for both cases. Any pair of columns represents a single pattern. Thus, as much as possible, any difference between the two cases is due only to the presence or absence of the complementarity mechanism. est activity by the feature or pattern it represents and to be inhibited to least activity by a complementary feature or pattern. Based on the experimental evidence for complementarity in early visual processing and the theoretical advantages demonstrated for its use in higher level processing, I predict that future studies of the brain will reveal its presence throughout the hierarchy of visual pattern processing. Moreover, even though this research was founded on the visual system, the complementarity mechanism would seem to apply equally well to other sensory, and even motor, hierarchies in the brain.
Acknowledgments This research was partially supported by NIH Grants 5 T32 MH1534506, 07. Computing resources were provided by the IBM Corporation
A Complementarity Mechanism for Enhanced Pattern Processing
.'
69
&.
.. .. ..@ ../ :"
oo..
'
t
I
I
I
2
I
I
I
100 200 300 400 ERRORS WITH COMPLEMENTARITY
I
I
J
500
Figure 6: Summary plot of relative errors with and without complementarity. Results from the control and experimental runs for three different sets of random initial values of synaptic strength. Pairs of error counts are used as the coordinates in this 2D plot. That is, the total errors in learning a given pattern with complementarity active is used for the horizontal position of a plotted point and the total errors in learning that same pattern with complementarity disabled is used for the vertical position. Each such point is connected by a line. The solid, dotted, and dashed lines show the results for three different sets of initial synaptic strengths. The solid line corresponds to the data in Figure 5. The 1:l line shows where the lines would fall if complementarity had no effect. under its Research Support Program. Faculty sponsor was J. D. Schlag, University of California, Los Angeles.
70
James L. Adams
References Adams, J. L. 1989. The Principles of "Complementarity," "Cooperativity," and "Adaptive Error Control" in Pattern Learning and Recognition: A Physiological Neural Network Model Tested by Computer Simulation. Ph.D. dissertation, University of California, Los Angeles. On file at University Microfilms Inc., Ann Arbor, Michigan. De Monasterio, F. M. and Gouras, P. 1975. Functional properties of ganglion cells of the rhesus monkey retina. J. Physiol. 251, 167-195. Dowling, J. E. 1979. Information processing by local circuits: The vertebrate retina as a model system. In The Neurosciences Fourth Study Program, F. 0. Schmitt and F, G. Worden, eds. MIT Press, Cambridge, MA. Hagins, W. A. 1979. Excitation in vertebrate photoreceptors. In The Neurosciences Fourth Study Program, F. 0. Schmitt and F. G. Worden, eds. MIT Press, Cambridge, MA. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Hubel, D. H. and Wiesel, T. N. 1977. Functional architecture of macaque monkey visual cortex. Proc. R. SOC. London, B 198, 1-59. Lorente de NO, R. 1949. The structure of the cerebral cortex. In Physiology of the Nervous System, 3rd ed., J. F. Fulton, ed. Oxford University Press, New York. Lund, J. S. 1981. Intrinsic organization of the primate visual cortex, area 17, as seen in Golgi preparations. In The Organization of the Cerebral Cortex, F. 0.Schmitt, F. G. Worden, G. Adelman, and S. G. Dennis, eds. MlT Press, Cambridge, MA. Schiller, P. H., Sandell, J. H., and Maunsell, J. H. R. 1986. Functions of the ON and OFF channels of the visual system. Nature (London) 322, 824-825. Van Essen, D. C. 1985. Functional organization of primate visual cortex. In Cerebral Cortex, Vol. 3, A. Peters and E. G. Jones, eds. Plenum, New York. Werblin, F. S. 1973. The control of sensitivity in the retina. Sci. Am. January, 70-79.
Received I1 July 1989; accepted 30 November 1989.
Communicated by Graeme Mitchison
Hebb-Type Dynamics is Sufficient to Account for the Inverse Magnification Rule in Cortical Somatotopy Kamil A. Grajski* Michael M. Merzenich Coleman Memorial Laboratories, University of California, San Francisco, CA 94143 USA
The inverse magnification rule in cortical somatotopy is the experimentally derived inverse relationship between cortical magnification (area of somatotopic map representing a unit area of skin surface) and receptive field size (area of restricted skin surface driving a cortical neuron). We show by computer simulation of a simple, multilayer model that Hebb-type synaptic modification subject to competitive constraints is sufficient to account for the inverse magnification rule.
1 Introduction The static properties of topographic organization in the somatosensory system are well-characterized experimentally. Two somatosensory mapderived variables are of particular interest: (1)receptive field size and (2) cortical magnification. Receptive field size for a somatosensory neuron is defined as the restricted skin surface (measured in mm2) which maximally drives (measured by pulse probability) the unit. Cortical magnification is the cortical region (measured in mm2) over which neurons are driven by mechanical stimulation of a unit area of skin. Somatotopic maps are characterized by an inverse relationship between cortical magnification and receptive field size (Sur et al. 1980). Recent experimental studies of reorganization of the hand representation in adult owl monkey cortex bear directly on the inverse magnification rule. First, cortical magnification is increased (receptive field sizes decreased) for the one or more digit tips stimulated over a several week period in monkeys undergoing training on a behavioral task (Jenkins et al. 1989). Second, Jenkins and Merzenich (1987) reduced cortical magnification (increased receptive field sizes) for restricted hand surfaces by means of focal cortical lesions. ~
'Present address: Ford Aerospace, Advanced Development Department/MSX-22, San Jose, CA 95161-9041 USA.
Neural Computation 2, 71-84 (1990)
0 1990
Massachusetts Institute of Technology
72
Kamil A. Grajski and Michael M. Merzenich
Previous neural models have focused on self-organized topographic maps (Willshaw and von der Malsburg 1976; Takeuchi and Amari 1979; among others). The model networks typically consist of a prewired multilayer network in which self-organizing dynamics refines the initially random or roughly topographic map. The capacity of self-organization typically takes the form of an activity-dependent modification for synaptic strengths, e.g., a Hebb-type synapse, and is coupled with mechanisms for restricting activity to local map regions, that is, lateral inhibition. Finally, input to the system contains correlational structure at distinct spatiotemporal scales, depending on the desired complexity of model unit response properties. Are these mechanisms sufficient to account for the new experimental data? In this report we focus on the empirically derived inverse relationship between receptive field size and cortical magnification. We present a simple, multilayer model that incorporates Hebb-type synaptic modification to simulate the long-term somatotopic consequences of the behavioral and cortical lesion experiments referenced above.
2 The Model
The network consists of three hierarchically organized two-dimensional layers (see Fig. I). A skin (S) layer consists of a 15 x 15 array of nodes. Each node projects to the topographically equivalent 5 x 5 neighborhood of excitatory (E) cells located in the subcortical (SC) layer. We define the skin layer to consist of three 15 x 5 digits (Left, Center, Right). A standard-sized stimulus (3 x 3) is used for all inputs. Each of the 15 x 15 SC nodes contains an E and an inhibitory (I) cell. The E cell acts as a relay cell, projecting to a 7 x 7 neighborhood of the 15 x 15 cortical (C) layer. In addition, each relay cell makes collateral excitatory connections to a 3 x 3 local neighborhood of inhibitory cells; each inhibitory cell makes simple hyperpolarizing connections with a local neighborhood (5 x 5) of SC E cells. Each of the 15 x 15 C nodes also contains an E and I cell. The E cell is the exclusive target of ascending connections and the exclusive origin of descending connections: the descending connections project to a 5 x 5 neighborhood of E and I cells in the topographically equivalent SC zone. Local C connections include local E to E cell connections (5 x 5) and E to I cell connections (5 x 5). The C I cells make simple hyperpolarizing connections intracortically to E cells in a 7 x 7 neighborhood. No axonal or spatial delay is modelled; activity appears instantaneously at all points of projection. (For all connections, the density is constant as a function of distance from the node. A correction term is calculated for each type of connection in each layer to establish planar boundary conditions.) The total number of cells is 900; the total number of synapses is 57,600.
Hebb-Type Dynamics is Sufficient
73
A.) PROJECTION PATHWAYS
C
8.) LOCAL CIRCUITS
C
tl -
7
C S C S
f
15x15
I
Excltatlon
@$g
1 Inhibition C = CORTEX SC = SUBCORTEX
S = SKIN
Figure 1: Organization of the network model. (A) Layer organization. On the left is shown the divergence of projections from a singIe skin site to the subcortex and its subsequent projection to the cortex: Skin 6)to Subcortex (SC), 5 x 5; SC to Cortex (C), 7 x 7. S is "partitioned into three 15 x 5 "digits" Left, Center, and Right. The standard S stimulus used in all simulations is shown lying on digit Left. On the right is shown the spread of projection from C to SC, 5 x 5. (B) Local circuits. Each node in the SC and C layers contains an excitatory (E) and inhibitory cell (I). In C, each E cell forms excitatory connections with a (5 x 5 ) patch of I cells; each I cell forms inhibitory connections with a 7 x 7 patch of E cells. In SC, these connections are 3 x 3 and 5 x 5, respectively. In addition, in C only, E celIs form excitatory connections with a 5 by 5 patch of E cells. The spatial relationship of E and I cell projections for the central node is shown at left. Systematic variation of these connectivity patterns is discussed elsewhere (Grajski and Merzenich, in preparation). The model neuron is the same for all E and I cells: an RC-time constant membrane that is depolarized and (additively) hyperpolarized by linearly weighted connections:
Kamil A. Grajski and Michael M. Merzenich
74
3
3
u,".' - membrane potential for unit i of type Y on layer X ; u,".' firing rate for unit z of type Y on layer X ; 6; - skin units are OFF (= 0 ) or ON (= 1); 7, - membrane time constant (with respect to unit time). w z3 post(z~y)pre(X,Y)- connection to unit z of postsynaptic type y on postsynaptic layer 5 from units of presynaptic type Y on presynaptic layer X . Each summation term is normalized by the number of incoming connections (corrected for planar boundary conditions) contributing to the term. (Numerical integration is with the Adams-Bashforth predictor method bootstrapped with the fourth-order RK method; with the step size used (0.2), a corrector step is unnecessary.) Each unit converts membrane potential to a continuous-valued output value u, via a sigmoidal function representing an average firing rate (P = 4.0):
Synaptic strength is modified in three ways: (1) activity-dependent change, (2) passive decay, and ( 3 ) normalization. Activity-dependent and passive decay terms are as follows:
w , ~- connection from cell j to cell i; 7syn = 0.017, = 0.005 - time constant for passive synaptic decay; CY = 0.05, the maximum activitydependent step change; uj, v, - pre- and postsynaptic output values, respectively. Further modification occurs by a multiplicative normalization performed over the incoming connections for each cell. The normalization is such that the summed total strength of incoming connections is R:
Ni - number of incoming connections for cell i; wij - connection from cell j to cell i; R = 2.0 - the total resource available to cell i for redistribution over its incoming connections. Network organization is measured using procedures that mimic experimental techniques. Figure 2 shows the temporal response to an applied standard stimulus. Figure 3 shows the stereotypical spatiotemporal response pattern to the same input stimulus. Our model differs from those cited above in that inhibition is determined by network dynamics, not by applying a predefined lateral-inhibition function. Cortical magnification is measured by "mapping" the network, for example, noting which 3 x 3 skin patch most strongly drives each cortical
Hebb-Type Dynamics is Sufficient
75
LEGEND
...... ......
CORTICALECELL CORTICALWELL THALAMIC E-CELL THALAMIC ICELL
STIMULUS SITE
1
STIMULUS DURATION
-1.0 I
0
4
8
12
TIME (NORMALIZED UNITS)
Figure 2: Temporal response of the self-organizing network to a pulsed input at the skin layer site shown at lower right. Temporal dynamics are a result of net architecture and self-organization - no lateral inhibitory function is explicitly applied.
E cell. The number of cortical nodes driven maximally by the same skin site is the cortical magnification for that skin site. Receptive field size for a C (SC) layer E cell is estimated by stimulating all possible 3 x 3 skin patches (169) and noting the peak response. Receptive field size is defined as the number of 3 x 3 skin patches which drive the unit at 2 50% of its peak response. 3 Results
The platform for simulating the long-term somatotopic consequences of the behavioral and cortical lesion experiments is a refined topographic network. The initial net configuration is such that the location of individual connections is predetermined by topographic projection. Initial connection strengths are drawn from a Gaussian distribution ( p = 2.0, uz = 0.2). The refinement process is iterative. Standard-sized skin patches are stimulated (pulse ON for 2.5 normalized time steps, then OFF) in random sequence, but lying entirely within the single digit borders "defined" on the skin layer, that is, no double-digit stimulation. Note, however, that during mapping all possible (169) skin patches are stimulated so that double-digit fields (if any) can be detected. For each patch, the
76
Kamil A. Grajski and Michael M. Merzenich
Figure 3: Spatiotemporal response of the self-organizingnetwork to same stimulus as in Figure 2. Surround inhibition accompanies early focal excitation, which gives way to late, strong inhibition. network is allowed to reach steady-state while the plasticity rule is ON. Immediately following steady-state, synaptic strengths are renormalized as described above. (Steady-state is defined as the time point at which
Hebb-Type Dynamics is Sufficient
77
C and X layer E cell depolarization changes by less than 1%.Time to reach steady state averages 3.54.5 normalized time units.) The refinement procedure continues until two conditions are met: (1) fewer than 5% of all E cells change their receptive field location and (2) receptive field areas (using the 50% criterion) change by no more than fl unit area for 95% of E cells. Such convergence typically requires 10-12 complete passes over the skin layer. Simulated mapping experiments show that the refined map captures many features of normal somatotopy First, cortical magnification is proportional to the frequency of stimulation: (1)equal-sized representations for each digit but ( 2 ) greater magnification for the surfaces located immediately adjacent to each digit’s longitudinal axis of symmetry. Second, topographic order is highly preserved in all directions within and between digit representations. Third, discontinuities occur between representations of adjacent digits. Fourth, receptive fields are continuous, single-peaked, and roughly uniform in size. Fifth, for adjacent (withindigit) sites, receptive fields overlap significantly (up to 70%, depending on connectivity parameters); overlap decreases monotonically with distance. (Overlap is defined as the intersection of receptive field areas as defined above.) Last, the basis for refinement of topographic properties is the emergence of spatial order in the patterning of afferent and efferent synaptic weights. (See Discussion.) Jenkins et al. (1989) describe a behavioral experiment that leads to cortical somatotopic reorganization. Monkeys are trained to maintain contact with a rotating disk situated such that only the tips of one or two of their longest digits are stimulated. Monkeys are required to maintain this contact for a specified period of time in order to receive food reward. Comparison of pre- and post-stimulation maps (or the latter with maps obtained after varying periods without disk stimulation) reveal up to nearly %fold differences in cortical magnification and reduction in receptive field size for stimulated skin. We simulate the above experiment by extending the refinement process described above, but with the probability of stimulating a restricted skin region increased 5:l. Figure 4 shows the area selected for stimulation as well as its pre- and poststimulation SC and C layer representations. Histograms indicate the changes in distributions of receptive field sizes. The model reproduces the experimentally observed inverse relationship between cortical magnification and receptive field size (among several other observations). Subcortical results show an increase in area of representation, but with no significant change in receptive field areas. The subcortical results are predictive - no direct subcortical measurements were made for the behavioral study monkeys. The inverse magnification rule predicts that a decrease in cortical magnification is accompanied by an increase in receptive field areas. Jenkins and Merzenich (1987) tested this hypothesis by inducing focal cortical lesions in the representation of restricted hand surfaces, for example, a
Kamil A. Grajski and Michael M. Merzenich
78
SKIN SURFACE COACTIVATION INllALCORllCAL
FINAL CORTICAL
INllAL SUBCORTICAL
ANALSUBMRTICAL
CO-ACTIVATED SKIN
Figure 4: (A) Simulation of behaviorally controlled skin stimulation experiment (Jenkinset al. 1990). The initial and final zones of (sub)corticalrepresentation of coactivated skin are shown. The coactivated skin (shown at far left) is stimulated 5:l over remaining skin sites using a 3 by 3 stimulus (see Figs. 1-3). single digit. A long-term consequence of such a manipulation is profound cortical map reorganization characterized by (1) a reemergence of a representation of the skin formerly represented in the now lesioned zone in the intact surrounding cortex, (2) the new representation is at the expense of cortical magnification of skin originally represented in those regions, so that (3) large regions of the map contain neurons with abnormally large receptive fields, for example, up to several orders of magnitude larger. We simulate this experiment by "lesioning" the regton of the cortical layer shown in Figure 5. The refinement process described above is continued under these new conditions until topographic map and receptive field size measures converge. The reemergence of representation and changes in distributions of receptive field areas are also shown in Figure 5. The model reproduces the experimentally observed inverse relationship between cortical magnification and receptive field size (among many other phenomena). These results depend on randomizing the synaptic strengths of all remaining intracortical and cortical afferent
Hebb-Type Dynamics is Sufficient
79
A
0 m
-z r
I1
fn W
0.2
-
t fn
B 2
0 c
0.1
x
20 x
n
o.. n 1
2
3
4
5
6
7
8
9 1 0 1 1 1 2 1 3 1 4 1 5
AREA (ARBITRARY UNITS)
Normal Subconex BehStim Subcortex
K
0 n
Bn
AREA (ARBITRARY UNITS)
Figure 4: Cont’d. Histograms depict changes in receptive field areas for topographically equivalent, contiguous, equal-sized zones in (sub)cortical layer. There is a strong inverse relationship between cortical magnification and cortical receptive field size.
80
Kamil A. Grajski and Michael M. Merzenich
CORTICAL LESION EXPERIMENT IMll AL CORTICAL
FINAL CORTICAL
INTlAL SUBCORTlCAL
FINAL SUBCORllCAL
LESIONEDSKIN REPRESENTATlON
Figure 5: Simulation of a somatosensorycortical lesion experiment (Jenkinsand Merzenich 1987). The cortical zone representing the indicated skin regions is "destroyed." Random stimulation of the skin layer using a 3 by 3 patch leads to a reemergence of representation along the border of the lesion. Cortical magnification for all skin sites is decreased. connections as well as enhancing cortical excitation (afferent or intrinsic) or reducing cortical inhibition by a constant factor, 2.0 and 0.5, respectively, in equation set 1. Otherwise, the intact representations possess an overwhelming competitive advantage. In general, this simulation was more sensitive to the choice of parameters, whereas preceding ones were not (see Discussion). Again, the subcortical results are predictive as no direct experimental data exist.
4 Discussion
The inverse magnification rule is one of several "principles" suggested by experimental studies of topographic maps. It is a powerful one as it links global and local map properties. We have shown that a coactivation
81
Hebb-Type Dynamics is Sufficient
-
0.3 Normal Cortex Lesioned Cortex
0 v) r
u
I1
5 v)
w
0.2
5 U
0
z P c a
0.1
B 0 a n
0.0
4 15
1
2
3
7
6
5
4
8
9
1 0 1 1 1 2
AREA (ARBITRARY UNITS)
(b)
Normal Subcortex Lesioned Submrtex
1
I
0.3
0.2
0.1
0.0'
1 .
1
I . 1 .
2
3
I
4
'
I
5
'
I
6
'
I
7
.
I
8
. I
9
. I
I
'
I
' I
I
'
1 0 1 1 1 2 1 3 1 4 1 5
AREA (ARBITRARY UNITS) (C)
Figure 5: Cont'd. Histograms depicting changes in receptive field area for topographically equivalent, contiguous, equal-sized zones in (sub)corticallayers (left-most and right-most 1 / 3 combined). Note the increase in numbers of large receptive fields.
based synaptic plasticity rule operating under competitive conditions is sufficient to account for this basic organizational principle. What is the basis for tkese properties? Why does the model behave the way it does? The basis for these effects is the emergence and
a2
Kamil A. Grajski and Michael M. Merzenich
reorganization of spatial order in the patterning of synaptic weights. In the unrefined map, connection strengths are distributed N(2.0, 0.2). Following stimulation this distribution alters to a multipeaked, widevariance weight distribution, with high-values concentrated as predicted by coactivation. For instance, some subcortical sites driven by the Center digit alone project to the cortex in such a way that parts of the projection zone cross the cortical digit representational border between Center and Right. The connection strengths onto cortical Center cells are observed to be one to two orders of magnitude greater than those onto cortical cells in the Right representation. Similarly, cortical mutually excitatory connections form local clusters of high magnitude. These are reminiscent of groups (see Pearson et a2. 1987). However, in contradistinction to the Pearson study, in this model, receptive field overlap is maintained (among other properties) (Grajski and Merzenich, in preparation). Synaptic patterning is further evaluated by removing pathways and restricting classes of plastic synapses. Repeating the above simulations using networks with no descending projections or using networks with no descending and no cortical mutual excitation yields largely normal topography and coactivation results. Restricting plasticity to excitatory pathways alone also yields qualitatively similar results. (Studies with a two-layer network yield qualitatively similar results.) Thus, the refinement and coactivation experiments alone are insufficient to discriminate fundamental differences between network variants. On the other hand, modeling the long-term consequences of digit amputation requires plastic cortical mutually excitatory connections (Grajski and Merzenich 1990). Simulations of the cortical lesion experiments are confounded by the necessity to redistribute synaptic resources following removal of 33% of the cortical layer’s cells. We explored randomization of the remaining cortical connections in concert with an enhancement of cortical excitation or reduction of cortical inhibition. Without some additional step such as this, the existing representations obtain a competitive advantage that blocks reemergence of representation. Whenever rerepresentation occurs, however, the inverse magnification relationship holds. The map reorganization produced by the model is never as profound as that observed experimentally, suggesting the presence of other nonlocal, perhaps injuryrelated and neuromodulatory effects (Jenkins and Merzenich 1987). This model extends related studies (Willshaw and von der Malsburg 1976; Takeuchi and Amari 1979; Ritter and Schulten 1986; Pearson et al. 1987). First, the present model achieves lateral inhibitory dynamics by a combination of connectivity and self-organization of those connections; earlier models either apply a lateral-inhibition function in local neighborhoods, or have other fixed inhibitory relationships. Second, the present model significantly extends a recently proposed model of somatosensory reorganization (Pearson et aI. 1987) to include (1) a subcortical layer, (2) simulations of additional experimental data, (3) more accurate simulation
Hebb-Type Dynamics is Sufficient
83
of normal and reorganized somatotopic maps, and (4) a simpler, more direct description of possible underlying mechanisms. The present model captures features of static and dynamic (relorganization observed in the auditory and visual systems. Weinberger and colleagues (1989) have observed auditory neuron receptive field plasticity in adult cats under a variety of behavioral and experimental conditions (see also Robertson and Irvine 1989). In the visual system, Wurtz et al. (1989) have observed that several weeks following induction of a restricted cortical lesion in monkey area MT, surviving neurons’ receptive field area increased. Kohonen (1982), among others, has explored the computational properties of self-organized topographic mappings. A better understanding of the nature of real brain maps may support the next generation of topographic and related computational networks (e.g., Moody and Darken 1989).
Acknowledgments This research supported by NIH Grants NS10414 and GM07449, Hearing Research Inc., the Coleman Fund, and the San Diego Supercomputer Center. K.A.G. gratefully acknowledges reviewers’ comments and helpful discussions with Terry Allard, Bill Jenkins, Gregg Recanzone, and especially Ken Miller.
References Grajski, K. A. and Merzenich, M. M. 1990. Neural network stimulation of somatosensor, representational plasticity. In Neural Information Processing Systems, Vol. 2. D. Touretzky, ed., in press. Grajski, K. A. and Merzenich, M. M. 1990. Hebbian synaptic plasticity in a multi-layer, distributed neural network model accounts for key features of normal and reorganized somatosensor (topographic) maps. In preparation. Jenkins, W. M. and Merzenich, M. M. (1987). Reorganization of neocortical representations after brain injury: A neurophysiological model of the bases of recovery from stroke. In Progress in Bruin Research, F. J. Seil, E. Herbert, and B. M. Carlson, eds., Vol. 71, pp. 249-266. Elsevier, Amsterdam. Jenkins, W. M., Merzenich, M. M., Ochs, M. T., Allard, T., and Guic-Robles, E. 1990. Functional reorganization of primary somatosensory cortex in adult owl monkeys after behaviorally controlled tactile stimulation. J. Neurophys., 63(1).
Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. B i d . Cybernet. 43,59-69. Moody, J. and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294.
84
Kamil A. Grajski and Michael M. Merzenich
Pearson, J. C., Finkel, L. H., and Edelman, G. M. 1987. Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. Neurosci. 7, 42094223. Ritter, H. and Schulten, K. 1986. On the stationary state of Kohonen’s selforganizing sensory mapping. Biol. Cybernet. .54,99-106. Robertson, D. and Irvine, D. R. F. 1989. Plasticity of frequency organization in auditory cortex of guinea pigs with partial unilateral deafness. J. Comp. Neural 282, 456-471. Sur, M., Merzenich, M. M., and Kaas, J. H. 1980. Magnification, receptive-field area and “hypercolumn” s u e in areas 3b and 1 of somatosensory cortex in owl monkeys. j . Neurophys. 44,295-311. Takeuchi, A. and Amari, S. 1979. Formation of topographic maps and columnar microstructures in nerve fields. Bid. Cybernef. 35, 63-72. Weinberger, N. M., Ashe, J. H., Metherate, R., McKenna, T. M., Diamond, D. M., and Bakin, J. 1990. Retuning auditory cortex by learning: A preliminary model of receptive field plasticity. Concepts Neurosci., in press. Willshaw, D. J. and von der Malsburg, C. 1976. How patterned neural connections can be set up by self-organization. Proc. R. SOC. London B 194, 431445. Wurtz, R., Komatsu, H., Diirsteler, M. R., and Yamasaki, D. S. G. 1990. Motion to movement: Cerebral cortical visual processing for pursuit eye movements. In Signal and Sense: Local and Global Order in Perceptual Maps, E. W. Gall, ed. Wiley, New York, in press.
Received 25 September 1989; accepted 20 December 1989.
Communicated by Terrence J. Sejnowski
Optimal Plasticity from Matrix Memories: What Goes Up Must Come Down David Willshaw Peter Dayan Centre for Cognitive Science and Department of Physics, University of Edinburgh, Edinburgh, Scotland
A recent article (Stanton and Sejnowski 1989) on long-term synaptic depression in the hippocampus has reopened the issue of the computational efficiency of particular synaptic learning rules (Hebb 1949; Palm 1988a; Morris and Willshaw 1989) - homosynaptic versus heterosynaptic and monotonic versus nonmonotonic changes in synaptic efficacy. We have addressed these questions by calculating and maximizing the signal-to-noise ratio, a measure of the potential fidelity of recall, in a class of associative matrix memories. Up to a multiplicative constant, there are three optimal rules, each providing for synaptic depression such that positive and negative changes in synaptic efficacy balance out. For one rule, which is found to be the Stent-Singer rule (Stent 1973; Rauschecker and Singer 1979), the depression is purely heterosynaptic; for another (Stanton and Sejnowski 19891, the depression is purely homosynaptic; for the third, which is a generalization of the first two, and has a higher signal-to-noise ratio, it is both heterosynaptic and homosynaptic. The third rule takes the form of a covariance rule (Sejnowski 1977a,b) and includes, as a special case, the prescription due to Hopfield (1982) and others (Willshaw 1971; Kohonen 1972).
In principle, the association between the synchronous activities in two neurons could be registered by a mechanism that increases the efficacy of the synapses between them, in the manner first proposed by Hebb (1949); the generalization of this idea to the storage of the associations between activity in two sets of neurons is in terms of a matrix of modifiable synapses (Anderson 1968; Willshaw et al. 1969; Kohonen 1972). This type of architecture is seen in the cerebellum (Eccles ef al. 1968) and in the hippocampus (Marr 1971) where associative storage of the Hebbian type (Bliss and L0mo 1973) has been ascribed to the NMDA receptor (Collingridge et al. 1983; Morris et al. 1986). A number of questions concerning the computational power of certain synaptic Neural Computation 2, 85-93 (1990)
@ 1990 Massachusetts Institute of Technology
86
David Willshaw and Peter Dayan
modification rules in matrix memories have direct biological relevance. For example, is it necessary, or merely desirable, to have a rule for decreasing synaptic efficacy under conditions of asynchronous firing, to complement the increases prescribed by the pure Hebbian rule (Hebb 1949)? The need for a mechanism for decreasing efficacy is pointed to by general considerations, such as those concerned with keeping individual synaptic efficacies within bounds (Sejnowski 197%); and more specific considerations, such as providing an explanation for ocular dominance reversal and other phenomena of plasticity in the visual cortex (Bienenstock et al. 1982; Singer 1985). There are two types of asynchrony between the presynaptic and the postsynaptic neurons that could be used to signal a decrease in synaptic efficacy (Sejnowski 197%; Sejnowski et al. 1988): the presynaptic neuron might be active while the postsynaptic neuron is inactive (homosynaptic depression), or vice versa (heterosynaptic depression). We have explored the theoretical consequences of such issues. We consider the storage of a number R of pattern pairs [represented as the binary vectors A(w) and B(w) of length rn and n, respectively] in a matrix associative memory. The matrix memory has m input lines and n output lines, carrying information about the A-patterns and the B-patterns, respectively, each output line being driven by a linear threshold unit (LTU) with m variable weights (Fig. 1). Pattern components are generated independently and at random. Each component of an A-pattern takes the value 1 (representing the active state) with probability p and the value c (inactive state) with probability 1 - p . Likewise, the probabilities for the two possible states 1 and c for a component of a B-pattern are T , 1 - T . In the storage of the association of the wth pair, the amount A, by which the weight W,, is changed depends on the values of the pair of numbers [AAw),B,(w)l. Once the entire set of patterns has been learned, retrieval of a previously stored B-pattern is effected by the presentation of the corresponding A-pattern. The 3th LTU calculates the weighted sum of its inputs, d, [A(w)l,
The state of output line j is then set to c or 1, according to whether d,[A(w)] is less than or greater than the threshold 8,. The signal-to-noise ratio p is a measure of the ability of an LTU to act as a discriminator between those A(w) that are to elicit the output c and those that are to elicit the output 1. It is a function of the parameters of the system, and is calculated by regarding dj[A(w)] as the sum of two components: the signal, sj(w),whch stems from that portion of the
Optimal Plasticity from Matrix Memories
87
Figure 1: The matrix memory, which associates A-patterns with B-patterns. Each weight W,j is a linear combination over the patterns:
fl where A is given in the table below.
The matrix shows the steps taken in the retrieval of the pattern B(w) that was previously stored in association with A(w). For good recall, the calculated output B' must resemble the desired output B(w).
David Willshaw and Peter Dayan
88
weights arising from the storage of pattern w,and the noise, n,(w),which is due to the contribution from all the other patterns to the weights.
i=l
In most applications of signal-to-noise (S/N) analysis, the noise terms have the same mean and are uncorrelated between different patterns. When these assumptions are applied to the current model, maximizing the signal-to-noise ratio with respect to the learning rule parameters a, p, y, and 6, leaves them dependent on the parameter c (Palm 1988b). However, the mean of the noise n,(w) in equation 1 is biased by the exclusion of the contribution Aij(w), whose value depends on the target output for pattern w; and the noise terms for two different patterns w1 and w2 are in general correlated through the R - 2 contributions to the value of Ail(w), which occur in both terms. Our analysis (Fig. 2) takes account of these factors, and its validity is confirmed by the results of computer simulation (Table 1).Maximizing the expression we obtain for the signal-to-noise ratio in terms of the learning parameters leads to the three c-independent rules, R1, R2, and R3. To within a multiplicative constant they are
Rule R1 is a generalization of the Hebb rule, called the covariance rule (Sejnowski 1977a; Sejnowski et al. 1988; Linsker 1986). In this formulation, the synaptic efficacy between any two cells is changed according to the product of the deviation of each cell’s activity from the mean. When pattern components are equally likely to be in the active and the inactive states ( p = T = 1/2), R1 takes the form of the ”Hopfield” rule (Hopfield 1982), and has the lowest signal-to-noise ratio of all such rules. Rule R1 prescribes changes in efficacy for all of the four possible states of activity seen at an individual synapse, and thus utilizes both heterosynaptic and homosynaptic asynchrony. It also has the biologically undesirable property that changes can occur when neither pre- nor postsynaptic neuron is
Optimal Plasticity from Matrix Memories
Low Mean
89
High Mean
Figure 2: Signal-to-noiseratios. The frequency graph of its linear combinations d(w) for a given LTU. The two classes to be distinguished appear as approximately Gaussian distributions, with high mean p h , low mean pl, and variances u i , u:, where ui E u:. For good discrimination the two distributions should be narrow and widely separated. In our calculation of the signal-to-noise ratio, the mean of the noise n ( w ) (equation 1) differs for high and low patterns, and so the expressions for the expected values of p h and were calculated separately. Second, the correlations between the noise terms obscuring different patterns add an extra quantity to the variance of the total noise. The entire graph of the frequency distributions for lugh and low patterns is displaced from the expected location, by a different amount for each unit. This overall displacement does not affect the power of the unit to discriminate between patterns. In calculating the signal-to-noise ratio, it is therefore appropriate to calculate the expected dispersion of the noise about the mean for each unit, rather than using the variance, which would imply measuring deviations from the expected mean. The expected dispersion for high patterns is defined as
H being the number of w for which B ( w ) = 1, and sf is defined similarly as the expected dispersion for low patterns. The signal-to-nolse ratio for a discriminator is therefore defined as p = (E[Ph - P11)' ;b; + s,; ~
It depends on all the parameters of the system, and may be maximized with respect to those that define the learning rule, a,P, 7,and b . The maxima are found at the rules R1, R2,and R3 described in the text. The effect of changing c is to shift and compress or expand the distributions. For a given LTU, it is always possible to move the threshold with c in such a way that exactly the same errors are made (Table la). The choice of c partly determines the variability of Ilh and p, across the set of units, and this variability is minimized at c = -p/(l - p). With this value of c , and in the limit of large n ~ , its effect becomes negligible, and hence the thresholds for all the units may be set equal. n, and
a,
active ( a # 0). However, the change to be applied in the absence of any activity can be regarded as a constant background term of magnitude pr. In rule R2, the so-called Stent-Singer rule (Stent 1973; Rauschecker and Singer 1979),depression is purely heterosynaptic. For a given number of stored associations, the signal-to-noise ratio for R2 is less than that for R1 by a factor of 1/(1 - r ) . In rule R3, which Stanton and Sejnowski (1989) proposed for the mossy fibers in the hippocampal CA3 region, and which is also used in theoretical schemes (Kanerva 1988), depression is purely
90
David Willshaw and Peter Dayan
homosynaptic. R3 has a signal-to-noise ratio less than R1 by a factor of 1/(1- p ) . If p = T , R2 and R3 have the same signal-to-noise ratio. All the rules have the automatic property that the expected value of each weight is 0; that is, what goes up does indeed come down. One way of implementing this property that avoids the necessity of synapses switching between excitatory and inhibitory states is to assign each synapse a constant positive amount of synaptic efficacy initially. Our results do not apply exactly to this case, but an informal argument suggests that initial synaptic values should be chosen so as to keep the total synaptic efficacy as small as possible, without any value going negative. Given that it is likely that the level of activity in the nervous system is relatively low (< lo%), it is predicted that the amount of (homosynaptic) long-term potentiation (Bliss and L0mo 1973) per nerve cell will be an order of magnitude greater than the amount of either homosynaptic or heterosynaptic depression. Further, under R1, any experimental technique for investigating long-term depression that relies on the aggregate effect on one postsynaptic cell of such sparse activity will find a larger heterosynaptic than homosynaptic effect. As for the Hopfield case (Willshaw 1971; Kohonen 1972; Hopfield 1982), for a given criterion of error (as specified by the signal-to-noise ratio) the number of associations that may be stored is proportional to the size, m, of the network. It is often noted (Willshaw et aZ. 1969; Amit et at. 1987; Gardner 1987; Tsodyks and Feigel'man 1988) that the sparser the coding of information (i.e., the lower the probability of a unit being active) the more efficient is the storage and retrieval of information. This is also true for rules R1, R2, and R3, but the information efficiency of the matrix memory, measured as the ratio of the number of bits stored as associations to the number of bits required to represent the weights, is always less than in similar memories incorporating clipped synapses (Willshaw et aZ. 19691, that is, ones having limited dynamic range. The signal-to-noise ratio measures only the potential of an LTU to recall correctly the associations it has learned. By contrast, the threshold 6, determines the actual frequency of occurrence of the two possible types of misclassification. The threshold may be set according to some further optimality criterion, such as minimizing the expected number of recall errors for a pattern. For a given LTU, the optimal value of 6 will depend directly on the actual associations it has learned rather than just on the parameters generating the patterns, which means that each LTU should have a different threshold. It can be shown that, as m, n, and R grow large, setting c at the value - p / U - p ) enables the thresholds of all the LTUs to be set equal (and dependent only on the parameters, not the actual patterns) without introducing additional error. Although natural processing is by no means constrained to follow an optimal path, it is important to understand the computational consequences of suggested synaptic mechanisms. The signal-to-noise ratio
Optimal Plasticity from Matrix Memories
la
lb
p,r 0.5 0.4 0.3 0.2 0.5 0.5 0.5 0.5
c -1 -1 -1 -1 -1 -0.5 0 0.5
p,r 0.5 0.4 0.3 0.2 0.1 0.05
c
0 0 0 0 0 0 p,r
lc
0.5 0.4 0.3 0.2 0.1 0.05
Expect
Actual
S/N 10 7.5 1.4 0.25 10 10 10
S/Nfo 11 H . 3 8.3f1.5 1.3f 0.40 0.32 f 0 . 2 2 11 f 1.3 11 f 1.3 11 f 1.3 11 =t1.3
10 Expect
Actual
S/N 0.05 0.11 0.31 1.1 5.9 16
S / N h O.lOfO.11 0.11 i 0 . 0 9 0.34 f 0.15 1.2 f0.47 5.3 f 1.8 28f18
R1 10 11 12 16 28 54
R2, R3 5.1 6.4 8.5 13 26 51
Hebb
91
Previous Expect S/N errors
Actual errors
10 10 11 12
1.1 1.6 4.5 4.2 1.1 1.1 1.1 1.1
1.1 1.7 4.6 4.0 1.1 1.1 1.1 1.1
Previous Expect Actual errors errors
S/N 6.8 7.6 9.4 13 26 51
9.1 7.8
5.8 3.6 0.92 0.16
8.7 7.6 5.9 3.4 1.2 0.15
Hopfield
0.050 10 0.11 7.5 0.31 1.4 1.1 0.25 5.9 0.045 16 0.015
Table 1: Simulations. The object of the simulations was to check the formulae developed in our analysis and compare them with a previous derivation (Palm 1988b). The matrix memory has m = 512 input lines and TZ = 20 output lines. To ensure noticeable error rates, the number of pattern pairs was set at 0 = 200. In all cases p = r.
la: The Hopfield (1982) rule (a,8,-,,6) = (1, -1, -1,l). Columns 3 and 4 compare the S/N ratio expected from our analysis and that measured in the simulation, the latter also showing the standard error measured over the output units; column 5 gives the S/N ratio calculated on the basis of previous analysis (Palm 1988b). Columns b and 7 compare the expected and measured numbers of errors per pattern, the threshold being set so that the two possible types of error occurred with equal frequency. For good recall (< 0.03 errors per unit) the S / N ratio must be at least 16. The lack of dependence on the value of c is demonstrated in rows 5-8. The same patterns were used in each case. lb: Similar results for the nonoptimal Hebb (1949) rule (a,psy. 6 ) = (0,fl. fl,I). lc: Values of the signal-to-noise ratio for the rules R1, R2, and R3 and the Hebb and the Hopfield rules. R1 has higher signal-to-noise ratio than R2 and R3, but for the latter two it is the same since p = T here. The Hebb rule approaches optimality in the limit of sparse coding; conversely, the Hopfield rule is optimal at p = T = 112.
92
David Willshaw and Peter Dayan
indicates how good a linear threshold unit may be at its discrimination task, and consequently how much information can be stored by a network of a number of such units. Synaptic depression is important for computational reasons, independent of any role it might play in preventing saturation of synaptic strengths. Up to a multiplicative constant, only three learning rules maximize the signal-to-noise ratio. Each rule involves both decreases and increases in the values of the weights. One rule involves heterosynaptic depression, another involves homosynaptic depression, and in the third rule there is both homosynaptic and heterosynaptic depression. All rules work most efficiently when the patterns of neural activity are sparsely coded. Acknowledgments We thank colleagues, particularly R. Morris, T. Bliss, P. Hancock, A. Gardner-Medwin, and M. Evans, for their helpful comments and criticisms on an earlier draft. This research was supported by grants from the MRC and the SERC. References Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1987. Information storage in neural networks with low levels of activity. Phys. Rev. A. 35,2293-2303. Anderson, J. A. 1968. A memory storage model utilizing spatial correlation functions. Kybernetik 5, 113-119. Bienenstock, E., Cooper, L.N., and Munro, P. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 3248. Bliss, T. V. P. and Lsmo, T. 1973. Long-lastingpotentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. (London) 232, 331-356. Collingridge, G. L., Kehl, S. J., and McLennan, H. J. 1983. Excitatory amino acids in synaptic transmission in the Schaffer collateral-commissural pathway of the rat hippocampus. J. Physiol. (London) 334, 33-46. Eccles, J. T., Ito, M., and Szentbgothai, J. 1968. The Cerebellum as a Neuronai Machine. Springer Verlag, Berlin. Gardner, E. 1987. Maximum storage capacity of neural networks. Europhys. Lett. 4, 481-485. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Hopfield, J. J. 1982. Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kanerva, P. 1988. Sparse Distributed Memory. MIT Press/Bradford Books: Cambridge, MA. Kohonen, T. 1972. Correlation matrix memories. IEEE Trans. Comput. C-21, 353-359.
Optimal Plasticity from Matrix Memories
93
Linsker, R. 1986. From basic network principles to neural architecture: Emergence of spatial opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83, 750S7512. Marr, D. 1971. Simple memory: A theory for archicortex. Phil. Trans. R. SOC. London B 262, 23-81. Morris, R. G. M., Anderson, E., Baudry, M., and Lynch, G. S. 1986. Selective impairment of learning and blockade of long term potentiation in vivo by AP5, an NMDA antagonist. Nature (London) 319,774-776. Morris, R. G. M. and Willshaw, D. J. 1989. Must what goes up come down? Nature (London) 339, 175-176. Palm, G. 1988a. On the asymptotic information storage capacity of neural networks. In Neural Computers, R. Eckmiller, and C. von der Malsburg, eds. NATO AS1 Series F41, pp. 271-280. Springer Verlag, Berlin. Palm, G. 1988b. Local synaptic rules with maximal information storage capcity. In Neural and Synergetic Computers, Springer Series in Synergetics, H. Haken, ed., Vol. 42, pp. 100-110. Springer-Verlag, Berlin. Rauschecker, J. P. and Singer, W. 1979. Changes in the circuitry of the kitten’s visual cortex are gated by postsynaptic activity. Nature (London) 280, 58-60. Sejnowski, T. J. 1977a. Storing covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303-321. Sejnowski, T. J. 197%. Statistical constraints on synaptic plasticity. 1. Theor. Biol. 69,385-389. Sejnowski, T. J., Chattarji, S., and Stanton, P. 1988. Induction of synaptic plasticity by Hebbian covariance. In The Computing Neuron, R. h r b i n , C . Miall, and G. Mitchison, eds., pp. 105-124. Addison-Wesley,Wokingham, England. Singer, W. 1985. Activity-dependent self-organization of synaptic connections as a substrate of learning. In The Neural and Molecular Bases of Learning, J. P. Changeux and M. Konishi, eds., pp. 301-335. Wiley, New York. Stanton, P. and Sejnowski, T. J. 1989. Associative long-term depression in the hippocampus: Induction of synaptic plasticity by Hebbian covariance. Nature (London) 339, 215-218. Stent, G. S. 1973. A physiological mechanism for Hebb’s postulate of learning. Proc. Natl. Acad. Sci. U.S.A. 70,997-1001. Tsodyks, M. V. and Feigel’man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105. Willshaw, D. J. 1971. Models of Distributed Associative Memory. Ph.D. Thesis, University of Edinburgh. Willshaw, D. J., Buneman, 0. P., and Longuet-Higgins, H. C. 1969. Nonholographic associative memory. Nature (London) 222, 960-962.
Received 28 November 1989; accepted 19 December 1989.
Communicated by Terrence J. Sejnowski
Pattern Segmentation in Associative Memory DeLiang Wang Joachim Buhmann Computer Science Department, University of Southern California, University Park, Los Angeles, CA 90089-0782 USA
Christoph von der Malsburg Program in Neural, Informational, and Behavioral Sciences, University of Southern California, University Park, Los Angeles, CA 90089-0782 USA
The goal of this paper is to show how to modify associative memory such that it can discriminate several stored patterns in a composite input and represent them simultaneously. Segmention of patterns takes place in the temporal domain, components of one pattern becoming temporally correlated with each other and anticorrelated with the components of all other patterns. Correlations are created naturally by the usual associative connections. In our simulations, temporal patterns take the form of oscillatory bursts of activity. Model oscillators consist of pairs of local cell populations connected appropriately. Transition of activity from one pattern to another is induced by delayed selfinhibition or simply by noise. 1 Introduction
Associative memory (Steinbuch 1961;Willshaw et al. 1969; Hopfield 1982) is an attractive model for long-term as well as short-term memory. Especially the Hopfield formulation (Hopfield 1982) provides for both levels a clear definition of data structure and mechanism of organization. The data structure of long-term memory has the form of synaptic weights for the connections between neurons, and memory traces are laid down with the help of Hebbian plasticity. On the short-term memory level the data structure has the form of stationary patterns of neural activity, and these patterns are organized and stabilized by the exchange of excitation and inhibition. Since in this formulation short-term memory states are dynamic attractor states, one speaks of attractor neural networks. Neurons are interpreted as elementary symbols, and attractor states acquire their symbolic meaning as an unstructured sum of individual symbolic contributions of active neurons. The great virtue of associative memory Neural Computation 2,96106 (1990)
@ 1990 Massachusetts Institute of Technology
Pattern Segmentation in Associative Memory
95
is its ability to restore incomplete or corrupted input patterns, that is, its ability to generalize over Hamming distance (the number of bits missing or added). Let us just mention here, since it becomes relevant later, that associative memory can be formulated such that attractors correspond to oscillatory activity vectors instead of stationary ones (Li and Hopfield 1989; Baird 1986; Freeman et al. 1988). Associative memory, taken as a model for functions of the brain, is severely limited in its applicability by a particular weakness - its low power of generalization. This is a direct consequence of the fact that associative memory treats memory traces essentially as monolithic entities. An obvious and indispensable tool for generalization in any system must be the decomposition of complex patterns into functional components and their later use in new combinations. A visual scene is almost always composed of a number of subpatterns, corresponding to coherent objects that are very likely to reappear in different combinations in other scenes (or the same scene under a different perspective and thus in different spatial relations to each other). Associative memory is not equipped for this type of generalization, as has been pointed out before (von der Malsburg 1981, 1983, 1987). It treats any complex pattern as a synthetic whole, glues all pairs of features together, and recovers either the whole pattern or nothing of it. Two different arrangements of the same components cannot be recognized as related and have to be stored separately. There is no generalization from one scene to another, even if they are composed of the same objects. Since complex scenes never recur, a nervous system based on the associative memory mechanism alone possesses little ability to learn from experience. This situation is not specific to vision. Our auditory world is typified by complex sound fields that are composed of sound streams corresponding to independent sources. Take as an example the cocktail party phenomenon where we are exposed to several voices of people who talk at the same time. It would be useless to try to store and retrieve the combinations of sounds heard simultaneously from different speakers. Instead, it is necessary to separate the sound streams from each other and store and access them separately. Similar situations characterize other modalities and especially all higher levels of cognitive processing. The basis for the type of generalization discussed here is the specific and all-pervasive property of our world of being causally segmented into strongly cohesive chunks of structure that are associated with each other into more loose and varying combinations. There are two attitudes which an advocate of associative memory could take in response to this evident weakness. One is to see it as a component in a more complex system. The system has other mechanisms and subsystems to analyze and create complex scenes composed of rigid subpatterns that can individually be stored and retrieved in associative memory. The other attitude tries to build on the strengths of associative memory as a candidate cognitive architecture and tries to modify the
96
D. Wang, J. Buhmann, and C. von der Malsburg
model such as to incorporate the ability to segment complex input patterns into subobjects and to compose synthetic scenes from stored objects. We subscribe to this second attitude in this paper. There are three issues that we have to address. The first concerns the type of information on the basis of which pattern segmentation can be performed; second, the data structure of associative memory and attractor neural networks has to be modified by the introduction of variables that express syntactical' binding; and third, mechanisms have to be found to organize these variables into useful patterns. There are various potential sources of information relevant to segmentation. In highly structured sensory spaces, especially vision and audition, there are general laws of perceptual grouping, based on "common fate" (same pattern of movement, same temporal history), continuity of perceptual quality (texture, depth, harmonic structure), spatial contiguity, and the like. These laws of grouping have been particularly developed in the Gestalt tradition. On the other end of a spectrum, segmentation of complex patterns can be performed by just finding subpatterns that have previously been stored in memory. Our paper here will be based on this memory-dominated type of segmentation. Regarding an appropriate data structure to encode syntacticalbinding, the old proposal of introducing more neurons (e.g., a grandmother-cell to express the binding of all features that make up a complex pattern) is not a solution (von der Malsburg 1987) and produces many problems of its own. It certainly is useful to have cells that encode high-level objects, but the existence of these cells just creates more binding problems, and their development is difficult and time-consuming. We work here on the assumption (von der Malsburg 1981, 1983, 1987; von der Malsburg and Schneider 1986; Gray et al. 1989; Eckhorn et al. 1988; Damasio 1989; Strong and Whitehead 1989; Schneider 1986) that syntactical binding is expressed by temporal correlations between neural signals. The scheme requires temporally structured neural signals. A set of neurons is syntactically linked by correlating their signals in time. Two neurons whose signals are not correlated or are even anticorrelated express thereby the fact that they are not syntactically bound. There are first experimental observations to support this idea (Gray et al. 1989; Eckhorn et al. 1988). It may be worth noting that in general the temporal correlations relevant here are spontaneously created within the network and correspondingly are not stimulus-locked. As to the issue how to organize the correlations necessary to express syntactical relationships, the natural mechanism for creating correlations and anticorrelations in attractor neural networks is the exchange of excitation and inhibition. A pair of neurons that is likely to be part of one segment is coupled with an excitatory link. Two neurons that do 'We use the word syntactical structure in its original sense of arranging together, that is, grouping or binding together, and do not intend to refer to any specific grammatical or logical rule system.
Pattern Segmentation in Associative Memory
97
not belong to the same segment inhibit each other. The neural dynamics will produce activity patterns that minimize contradictions between conflicting constraints. This capability of sensory segmentation has been demonstrated by a network that expresses general grouping information (von der Malsburg and Schneider 1986; Schneider 1986). The system we are proposing here is based on associative memory, and performs segmentation exclusively with the help of the memorydominated mechanism. Our version of associative memory is formulated in a way to support attractor limit cycles (Li and Hopfield 1989; Baird 1986; Freeman et al. 1988): If a stationary pattern is presented in the input that resembles one of the stored patterns, then the network settles after some transients into an oscillatory mode. Those neurons that have to be active in the pattern oscillate in phase with each other, whereas all other neurons are silent. In this mode of operation the network has all the traditional capabilities of associative memory, especially pattern completion. When a composite input is presented that consists of the superposition of a few patterns the network settles into an oscillatory mode such that time is divided into periods in which just a single stored state is active. Each period corresponds to one of the patterns contained in the input. Thus, the activity of the network expresses the separate recognition of the individual components of the input and represents those patterns in a way avoiding confusion. This latter capability was not present in previous formulations of associative memory. The necessary couplings between neurons to induce correlations and anticorrelations are precisely those created by Hebbian plasticity. Several types of temporal structure are conceivable as basis for this mode of syntactical binding. At one end of a spectrum there are regular oscillations, in which case states would be distinguished by different phase or frequency. At the other end of the spectrum there are chaotic activity patterns (Buhmann 1989). The type of activity we have chosen to simulate here is intermediate between those extremes, being composed of intermittent bursts of oscillations (see Fig. 21, a common phenomenon in the nervous system at all levels. 2 Two Coupled Oscillators
A single oscillator i, the building block of the proposed associative memory, is modeled as a feedback loop between a group of excitatory neurons and a group of inhibitory neurons. The average activity 2, of excitatory group i and the activity yi of inhibitory group i evolve according to
(2.2)
D. Wang, J. Buhmann, and C. von der Malsburg
98
where r, and ry are the time constants of the excitatory and inhibitory components of the oscillator. An appropriate choice of r,, ry allows us to relate the oscillator time to a physiological time scale. Gz and Gy are sigmoid gain functions, which in our simulations have the form
with thresholds 8, or 8, and gain parameters l / A z and l/&. For the reaction of inhibitory groups on excitatory groups we have introduced the nonlinear function F ( x )= (1- q)x+ qx2, (0 5 q 5 l), where q parameterizes the degree of quadratic nonlinearity. This nonlinearity proved to be useful in making oscillatory behavior a more robust phenomenon in the network, so that in spite of changes in excitatory gain (with varying numbers of groups in a pattern) the qualitative character of the phase portrait of the oscillators is invariant. H, in equation 2.3 describes delayed self-inhibition of strength a and decay constant p. This is important to generate intermittant bursting. The synaptic strengths of the oscillators’ feedback loop are T,,, T , s E { x , y}. Equations 2.1 and 2.2 can be interpreted as a mean field approximation to a network of excitatory and inhibitory logical neurons (Buhmann 1989). Notice that x,, y, are restricted to [0, rZ] and [O, ry],respectively. The parameters 2 and may be used to control the average values of x and y. In addition to the interaction between x, and y,, an excitatory unit x, receives time-dependent external input I,(t) from a sensory area or from other networks, and internal input S,(t) from other oscillators. Let us examine two oscillators of type 2.1-2.3, coupled by associative connections W12, W21 as shown schematically in Figure 1. The associative interaction is given by
Sl(t)= W1222(t);
S2(t) = W21x,(t)
Two cases can be distinguished by the sign of the associative synapses. If both synapses are excitatory (“2 > 0, W21 > 0) the two oscillators try to oscillate in step, interrupted by short periods of silence due to delayed self-inhibition. A simulation of this case is shown in Figure 2a. The degree of synchronization can be quantified by measuring the correlation
C(1,2) = (21x2) - ( X d ( X 2 ) AXIAX, between the two oscillators, Ax, being the variance of x,. For the simulation shown in Figure 2a we measured C(1,2) = 0.99, which indicates almost complete phase locking. The second case, mutual inhibition between the oscillators (W12 < 0, W21 < 0), is shown in Figure 2b. The two oscillators now avoid each other, which is reflected by C(1,2) = -0.57.
Pattern Segmentation in Associative Memory
99
4 Excitatory d
Inhibitory Associative
Figure 1: Diagram of two mutually connected oscillators. Alternatively, both oscillators could be continuously active but oscillate out of phase, with 180" phase shift. That mode has been simulated successfully for the case of two oscillators and might be applied to segmentation of an object from its background; for more than two oscillators with mutual inhibition phase avoidance behavior turns out to be difficult to achieve. 3 Segmentation in Associative Memory
After this demonstration of principle we will now test the associative capabilities of a network of N oscillators connected by Hebbian rules. We store p sparsely coded, random N bit words 6'' = {.C$'}E, with pattern index v = 1,.. . , p . The probability that a bit equals 1 is a, that is, P([,Y)=
100
D. Wang, J. Buhmann, and C. von der Malsburg
aS([r - 1) + (1 - a)S([,”) with typically a < 0.2. The synapses are chosen according to the Hebbian rule
With connectivity 3.1, oscillator i receives input Si(t) = EkfiW i k x k ( t )from other oscillators. In the following simulation, 50 oscillators and 8 patterns were stored in the memory. For simplicity we have chosen patterns of equal size
b
Figure 2: (a) Simulated output pattern of two mutually excitatory oscillators. The parameter values for the two oscillators are the same, 7, = 0.9, 7, = 1.0, T,, = 1.0,T,, = 1.9, T~~= 1.3, T~~= 1.2, 77 = 0.4, A, = A, = 0.05, oz = 0.4, e, = 0.6, 11 = I, = 0.2, ct = 0.2, /3 = 0.14, 5 = y = 0.2, W12 = W21 = 2.5. Initial values: z1(0) = 0.0, z2(0) = 0.2, y1(0) = y2(0) = 0.0. The equations have been integrated with the Euler method, At = 0.01,14,000 integration steps. (b) Simulated output pattern of two mutually inhibitory oscillators. All parameters are the same as in (a), except that “12 = “21 = -0.84,o = 0.1, p = 0.26.
Pattern Segmentationin Associative Memory
101
(8 active units). The first three patterns, which will be presented to the network in the following simulation, have the form
t’
= ( l , l ,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,. . . ,O)
52
=
63
(0,0,0,0,0,0,1,1,1,1,1,1,1,0,0,0,0.0,1,0,.. . ,O) = ( 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 0. ,.., O )
(3.2)
Notice the 25% mutual overlap among these 3 patterns and bits [y9 = 1 for patterns u = 1,2,3. With this choice of stored patterns we have tested pattern recall and pattern completion after presentation of just one incomplete pattern, the fundamental capability of associative memory. The network restored the information missing from the fragment within one or two cycles. The same behavior had been demonstrated in (Freeman et a2. 1988). A more intriguing dynamic behavior is shown by the network if we present all three patterns [I, E3 or parts of them simultaneously. In all simulations external input was time-independent but similar results can be expected for time-dependent input as used in Li and Hopfield (1989). The result of a simulation is shown in Figure 3 where the input is a superposition of patterns E l , 6’, t3 with one bit missing in each pattern (see caption of Fig. 3). In this figure only the first 19 oscillators are monitored; the others stay silent due to lack of input and mutual inhibition among oscillators representing different patterns. All three patterns are recognized, completed, and recalled by the network. In addition to the capabilities of conventional associative memory the network is able to segment patterns in time. The assembly of oscillators representing a single input pattern is oscillating in a phase-locked way for several cycles. This period is followed by a state of very low activity, during which another assembly takes turn to oscillate. In Figure 4 we have plotted the correlations between the first 19 oscillators. The oscillators in one pattern are highly correlated, that is coactive and phase-locked, whereas oscillators representing different patterns are anticorrelated. Oscillators 1, 7, and 13, which belong to two patterns each, stay on for two periods. Oscillator 19, which belongs to all three active patterns, stays on all the time. According to a number of simulation experiments, results are rather stable with respect to variation of parameters. Switching between one pattern and another can be produced either by noise, or by delayed self-inhibition (the case shown here), or by a modulation of external input. A mixture of all three is likely to be biologically relevant. The case shown here is dominated by delayed self-inhibition and has a small admixture of noise. The noise-dominated case, which we have also simulated, has an irregular succession of states and takes longer to give each input state a chance. Delayed self-inhibition might also be used in a nonoscillatory associative memory to generate switching between several input patterns. Our simulations, however, indicate that limit cycles facilitate transitions and make them more reliable.
[’,
102
D. Wang, J. Buhmann, and C. von der Malsburg
Figure 3: Simulation of an associative memory of 50 oscillators. Eight patterns have been stored in the memory and three of them, ,$I, t*,t3(3.2) are presented in this simulation simultaneously with one bit missing in each pattern. Only the output of the first 19 oscillators is shown. The others stay silent due to lack of input. The vertical dashed lines identify three consecutive time intervals with exactly one pattern active in each interval. From the result we see that at any time instant only one pattern is dominant while in a long run, all patterns have an equal chance to be recalled due to switching among the patterns. The parameter values differing from Figure 2 are Tyy= 1.0, a = 0.17, p = 0.1. We added uncorrelated white noise of amplitude 0.003 to the input to the excitatory groups. Initial value: x = 0.2(1,. . . ,l), y = (0,. . . ,O). Input: I = 0.2 ~1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,.. .,Oh
Pattern Segmentation in Associative Memory
103
Figure 4: Correlation matrix between the first 19 oscillators (cf. Fig. 3). Filled and open circles stand for positive and negative correlations, respectively. The diameter of each circle is proportional to the absolute value of the correlation.
For conceptual reasons, only a limited number of states can be represented in response to a static input. A superposition of too many (more than perhaps 10) input states leads to ambiguity and the system responds with an irregular oscillation pattern. The exact number of entities that can be represented simultaneously depends on details of implementation, but a reasonable estimate seems to be the seven plus or minus two, often cited in the psychophysical literature as the number of objects that can be held in the human attention span.
104
D. Wang, J. Buhmann, and C. von der Malsburg
4 Discussion
The point of this paper is the demonstration of a concept that allows us to compute and represent syntactical structure in a version of associative memory. Whereas in the attractor neural network view a valid state of short-term memory is a static activity distribution, we argue for a data structure based on the history of fluctuating neural signals observed over a brief time span (the time span often called "psychological moment") (Poppel and Logothetis 1986). There is ample evidence for the existence of temporal signal structure in the brain on the relevant time scale (1050 msec). Collective oscillations are of special relevance for our study here. They have been observed as local field potentials in several cortices (Gray et al. 1989; Eckhorn et al. 1988; Freeman 1978). The way we have modeled temporal signal structure, as bursts of collective oscillations, is just one possibility of many. Among the alternatives are continuous oscillations, which differ in phase or frequency between substates, and stochastic signal structure. Is the model biologically relevant? Several reasons speak for its application to sensory segmentation in olfaction. A major difficulty in applying associative memory, whether in our version or the standard one, is its inability to deal with perceptual invariances (e.g., visual position invariance). This is due to the fact that the natural topology of associative memory is the Hamming distance, and not any structurally invariant relationship. In olfaction, Hamming distance seems to be the natural topology, and for this reason associative memory has been applied to this modality before (Li and Hopfield 1989; Baird 1986; Freeman et al. 1988; Haberly and Bower 1989). Furthermore, in the simple model for segmentation we have presented here, this ability relies completely on previous knowledge of possible input patterns. In most sensory modalities general structure of the perceptual space plays an additional important role for segmentation, except in olfaction, as far as we know. Finally, due to a tradition probably started by Walter Freeman, temporal signal structure has been well studied experimentally (Freeman 1978; Haberly and Bower 1989), and has been modeled with the help of nonlinear differential equations (Baird 1986; Freeman et al. 1988; Haberly and Bower 1989). There are also solid psychophysical data on pattern segmentation in olfaction (Laing et al. 1984; Laing and Frances 1989). It is widely recognized that any new mixture of odors is perceived as a unit; but if components of a complex (approximately balanced) odor mixture are known in advance, they can be discriminated, in agreement with the model presented here. When one of the two odors dominates the other in a binary mixture, only the stronger of the two is perceived (Laing et al. 1984), a behavior we also observed in our model. How can associative memory, of the conventional kind or ours, be identified in the anatomy (Shepherd 1979; Luskin and Price 1983) of the
Pattern Segmentation in Associative Memory
105
olfactory system of mammals? In piriform cortex, pyramidal cells on the one hand and inhibitory interneurons on the other would be natural candidates for forming our excitatory and inhibitory groups of cells. They would be coupled by associative fibers within piriform cortex. Signals in stimulated olfactory cortex are oscillatory in nature (in a frequency range of 40-60 Hz) (Freeman 1978) and therefore lend themselves to this interpretation. On the other hand, also the olfactory bulb has appropriately connected populations of excitatory (mitral cells) and inhibitory (granule cells) neurons, which also undergo oscillations in the same frequency range and possibly in phase with cortical oscillations. The two populations are coupled by the lateral and medial olfactory tract in a diffuse, nontopographically ordered way. Thus a more involved implementation of associative memory in the coupled olfactory bulb-piriform cortex system is also conceivable. Our model makes the following theoretical prediction. If the animal is stimulated with a mixture of a few odors known to the animal, then it should be possible to decompose local field potentials from piriform cortex into several coherent components with zero or negative mutual correlation.
Acknowledgments This work was supported by the Air Force Office of Scientific Research (88-0274). J. B. was a recipient of a NATO Postdoctoral Fellowship (DAAD 300/402/513/9). D. L. W. acknowledges support from a n NIH grant (1ROl NS 24926, M.A. Arbib, PI).
References Baird, B. 1986. Nonlinear dynamics of pattern formation and pattern recognition in the rabbit olfactory bulb. Physica D 22, 150-175. Buhmann, J. 1989. Oscillations and low firing rates in associative memory neural networks. Phys. Rev. A 90,41454148. Damasio, A. R. 1989. The brain binds entities and events by multiregional activation from convergence zones. Neural Cornp. 1,123-132. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybernet. 60, 121-130. Freeman, W. J. 1978. Spatial properties of an EEG event in the olfactory bulb and cortex. Electroencephalogr. Clin. Neurophysiol. 44, 586-605. Freeman, W. J., Yao, Y., and Burke, B. 1988. Central pattern generating and recognizing in olfactory bulb: A correlation learning rule. Neural Networks 1.277-288.
106
D. Wang, J. Buhmann, and C. von der Malsburg
Gray, C. M., Konig, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit intercolumnar synchronization which reflects global stimulus properties. Nature (London) 338,334-337. Haberly, L. B. and Bower, J. M. 1989. Olfactory cortex: Model circuit for study of associative memory? Trends Neural Sci. 12,258-264. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A.79,2554-2558. Laing, D. G. and Frances, G. W. 1989. The capacity of humans to identify odors in mixtures. Physiol. Behav. 46, 809-814. Laing, D. G., Panhuber, H., Willcox, M. E., and Pittman, E. A. 1984. Quality and intensity of binary odor mixtures. Physiol. Behav. 33,309-319. Li, Z . and Hopfield, J. J. 1989. Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybernet. 61,379-392. Luskin, M. B. and Price, J. L. 1983. The topographical organization of associational fibres of the olfactory system in the rat, including centrifugal fibres to the olfactory bulb. I. Comp. Neurol. 216, 264-291. Poppel, E., and Logothetis, N. 1986. Neuronal oscillations in the human brain. Naturwissenschaften 73, 267-268. Schneider, W. 1986. Anwendung der Korrelationstheorie der Hirnfunktion auf das akustische Figur-Hintergrund-Problem(Cocktailparty-Effekt).Dissertation, University of Gottingen. Shepherd, G. M. 1979. The Synaptic Organization of the Brain. Oxford University Press, New York. Steinbuch, K. 1961. Die Lernmatrix. Kybernefik 1,36-45. Strong, G. W. and Whitehead, B. A. 1989. A solution to the tag-assignment problem for neural networks. Behav. Brain Sci. 12, 381433. von der Malsburg, C. 1981. The Correlation Theory of Bruin Function. Internal Report 81-2, Abteilung fiir Neurobiologie, MPI fiir Biophysikalische Chemie, Gottingen. von der Malsburg, C. 1983. How are nervous structures organized? In Synergetics of the Brain. Proceedings of the lnternational Symposium on Synergetics, May 2983, E. Bagar, H. Flohr, H. Haken, and A. J. Mandell, eds. Springer, Berlin, Heidelberg, pp. 238-249. von der Malsburg, C. 1987. Synaptic plasticity as basis of brain organization. In The Neural and Molecular Bases of Learning, Dahlem Konferenzen, J.-P. Changew, and M. Konishi, eds. Wiley, Chichester, pp. 411431. von der Malsburg, C. and Schneider, W. 1986. A neural cocktail-party processor. Bid. Cybernet. 54,29-40. Willshaw, D. J., Buneman, 0. P., and Longuet-Higgins, H. C. 1969. Nonholographic associative memory. Nature (London) 222, 960-962.
Received 9 August 1989; accepted 9 January 1990.
Communicated by David S. Touretzky
A Neural Net Associative Memory for Real-Time Applications Gregory L. Heileman Department of Computer Engineering, University of Central Florida, Orlando, FL 32816 USA
George M. Papadourakis Department of Computer Science, University of Crete, h k l i o n , Crete, Greece
Michael Georgiopoulos Department of Electrical Engineering, University of Central Florida, Orlando, FL 32816 USA
A parallel hardware implementation of the associative memory neural network introduced by Hopfield is described. The design utilizes the Geometric Arithmetic Parallel Processor (GAPP), a commercially available single-chip VLSI general-purpose array processor consisting of 72 processing elements. The ability to cascade these chips allows large arrays of processors to be easily constructed and used to implement the Hopfield network. The memory requirements and processing times of such arrays are analyzed based on the number of nodes in the network and the number of exemplar patterns. Compared with other digital implementations, this design yields significant improvements in runtime performance and offers the capability of using large neural network associative memories in real-time applications. 1 Introduction Data stored in an associative memory are accessed by their contents. This is in contrast to random-access memory (RAM) in which data items are accessed according to their address. The ability to retrieve data by association is a very powerful technique required in many high-volume information processing applications. For example, associative memory has been used to perform real-time radar tracking in an antiballistic missile environment. They have also been proposed for use in database applications, image processing, and computer vision. A major advantage that associative memory offers over RAM is the capability of rapidly retrieving data through the use of parallel search and comparison operations; however, this is achieved at some cost. The ability to search the contents Neural Computation 2, 107-115 (1990) @ 1990 Massachusetts Institute of Technology
108
G.L. Heileman, G.M. Papadourakis, and M. Georgiopoulos
of a traditional associative memory in a fully parallel fashion requires the use of a substantial amount of hardware for control logic. Until recently, the high cost of implementing associative processors has mainly limited their use to special purpose military applications (Hwang and Briggs 1984). However, advances in VLSI technology have improved the feasibility of associative memory systems. The Hopfield neural network has demonstrated its potential as an associative memory (Hopfield 1982). The error correction capabilities of this network are quite powerful in that it is able to retrieve patterns from memory using noisy or partially complete input patterns. Koml6s and Paturi (1988), among others, have recently performed an extensive analysis of this behavior as well as the convergence properties and memory capacity of the Hopfield network. Due to the massive number of nodes and interconnections in large neural networks, real-time systems will require computational facilities capable of exploiting the inherent parallelism of neural network models. Two approaches to the parallel hardware implementation of neural networks have been utilized. The first involves the development of special-purpose hardware designed to specifically implement neural network models or certain classes of neural network models (Alspector et al. 1989; Kung and Hwang 1988). Although this approach has been shown to yield tremendous speedups when compared to sequential implementations, the specialized design limits the use of such computers to neural network applications and consequently limits their commercial availability. This is in contrast to the second approach to parallel hardware implementation, general-purpose parallel computers, which are designed to execute a variety of different applications. The fact that these computers are viable for solving a wide range of problems tends to increase their availability while decreasing their cost. In this paper a direct, parallel, digital implementation of a Hopfield associative memory neural network is presented. The design utilizes the first general-purpose commercially produced array processor chip, the Geometric Arithmetic Parallel Processor (GAPP) developed by the NCR Corporation in conjunction with Martin Marietta Aerospace. Using these low-cost VLSI components, it is possible to build arbitrarily sized Hopfield networks with the capability of operating in real-time. 2 The GAPP Architecture
The GAPP chip is an inexpensive two-dimensional VLSI array processor that has been utilized in such applications as pattern recognition, image processing, and database management. Current versions of the GAPP operate at a 10-MHz clock cycle; however, future versions will utilize a 20-MHz clock cycle (Brown and Tomassi 1989). A single GAPP chip contains a mesh-connected 6 by 12 arrangement of processing elements
A Neural Net Associative Memory for Real-Time Applications
109
(PEs). Each PE contains a bit-serial ALU, 128 x 1 bits of RAM, 4 singlebit latches and is able to communicate with each of its four neighbors. GAPP chips can be cascaded to implement arbitrarily sized arrays of PEs (in multiples of 6 x 12). This capability can be used to eliminate bandwidth limitations inherent in von Neumann machines. For example, a 48 x 48 PE array (32 GAPP chips) can read a 48-bit-wide word every 100 nsec, yielding an effective array bandwidth of 480 Mbits/sec (Davis and Thomas 1988; NCR Corp. 1984). Information can be shifted into the GAPP chip from any edge. Therefore, the ability to shift external data into large GAPP arrays is limited only by the number of data bus lines available from the host processor. For example, Martin Marietta Aerospace is currently utilizing a 126,720 PE array (1760 GAPP chips) in image processing applications. This system is connected to a Motorola MC68020 host system via a standard 32-bit Multibus (Brown and Tomassi 1989). 3 The Hopfield Neural Network
The Hopfield neural network implemented here utilizes binary input patterns - example inputs are black and white images (where the input elements are pixel values), or ASCII text (where the input patterns are bits in the 8-bit ASCII representation). This network is capable of recalling one of A4 exemplar patterns when presented with an unknown N element binary input pattern. Typically, the unknown input pattern is one of the M exemplar patterns corrupted with noise (Lippmann 1987). The recollection process, presented in Figure 1, can be separated into two distinct phases. In the initialization phase, the M exemplar patterns are used to establish the N 2 deterministic connection weights, t i j . In the search phase, an unknown N element input pattern is presented to the N nodes of the network. The node values are then multiplied by the connection weights to produce the new node values. These node values are then considered as the new input and altered again. This process continues to iterate until the input pattern converges. 4 Hopfield Network Implementation on the GAPP
Our design maps each node in the Hopfield network to a single PE on GAPP chips. Thus, an additional GAPP chip must be incorporated into the design for every 72 nodes in the Hopfield network. The ease with which these chips are cascaded allows such an approach to be used. When implementing the Hopfield network, the assumption is made that all M exemplar patterns are known a priori. Therefore, the initialization phase of the recollection process is performed off-line on the host computer. The resulting connection weights are downloaded, in signed magnitude format, to the PEs’ local memory as bit planes. The local
G.L. Heileman, G.M. Papadourakis, and M. Georgiopoulos
110
Let = number of exemplar patterns = number of elements in each exemplar pattern z," = element i of exemplar for pattern s = fl = element i of unknown input pattern = *1 ya u,(k) = output of node i after k iterations = interconnection weight from node i to node J' tij
M N
Initialization:
M
CX
~
~
X
if~ if
~ j ( 0=) yi,
Search:
~i + ,
j
i=j
15 i 5 N
, N
where
iterate until
u3(k + 1)= U j ( k ) ,
15 j 5 N
Figure 1: The recollection process in a Hopfield neural network.
memory of the PEs is used to store the operands of the sum of products operations required in the search phase. The memory organization of a PE (node j) is illustrated in Figure 2. For practical applications, the GAPP memory is insufficient for storing all weights concurrently, thus segmentation is required. The Hopfield network is implemented in parallel with each PE performing N multiplications and ( N - 1)additions per iteration. However, in practice no actual multiplications need occur since the node values are either +1 or -1. Therefore, multiplications are implemented by performing an exclusive-OR operation on the node bit plane and the sign bit plane of the weights. The result replaces the weights' sign bit plane. These results are then summed and stored in the GAPP memory. The sign bit plane of the summations represents the new node values.
A Neural Net Associative Memory for Real-Time Applications
111
After an iteration has been completed, the input pattern is tested for convergence utilizing the global OR function of the GAPP chips. If the result of the global OR is 1, another iteration is required; thus, it is necessary to transfer the new node values (i.e., the sign bit of the summation) to the host machine. These node values are then downloaded, along with the connection weights, to the GAPP chips in the manner described previously and another iteration is performed. 5 Memory Requirements and Processing Time
The number of bits required to store each weight value and the summation in the search phase are w = rlog2(M+1)1+1and p = [log2(NM+1)1+1, respectively, where N is the number of nodes in the network and M represents the number of exemplar patterns. Therefore, each PE in the GAPP array has a total memory requirement of N ( w + 1) + p (see Fig. 2).
tlj U1
-4 w bits
1 bit
t2j 212
tkj uk
Ctijui 1
p bits
Figure 2: Organization of a single PEs memory in the Hopfield neural network implementation on the GAPP.
G.L. Heileman, G.M. Papadourakis, and M. Georgiopoulos
112
If we let B denote the size of a single PEs memory, then each PE has ( B - p ) bits available for storing weight and node values. If N ( w + 1) > B -p, there is not enough GAPP memory to store all of the weights at one time, and weights must be shifted into the GAPP memory in segments. The number of weights in each of these segments is given by
while the total number of segments is given by
Letting C represent the number of clock cycles needed to shift a bit plane into GAPP memory, then the number of clock cycles required to download weight and node vaiues to GAPP memory, and to upload new node values to the host is L
=
[SD(w+1)+2lC - 1
Furthermore, C depends on the number of data bus lines available from the host and the number of GAPP chips, n. In particular, C can be expressed as
1
1
C = 12 6n/# data lines + 1 The processing time required to implement the search phase of the Hopfield network on the GAPP chips is formulated below. The implementation involves four separate steps. First, the D weights stored in GAPP memory are multiplied by the appropriate node values. As discussed previously, this is performed using an exclusive-OR operation; such an operation requires 3 0 GAPP clock cycles. The second step involves converting the modified weight values into two's complement format; this processing requires D(4w - 1) clock cycles. Next, the D summations required by the search phase are implemented; this can be accomplished in 3Dp clock cycles. Finally, 4 clock cycles are required to test for input convergence. The total processing time can now be expressed as P
=
S[3D + D(4w - 1)+ 3Dp + 41 clock cycles
and the total time required to perform a single iteration of the search phase of the Hopfield network is T
=L
+P
= SD[C(W+ 1)+ 3 ( p + 1)+ ( 4 -~111
+ 4 s + 2C - 1 clock cycles
A Neural Net Associative Memory for Real-Time Applications
113
6 Comparisons and Experimental Results
A comparison of the results obtained in the previous section with other digital implementations of the Hopfield network (Na and Glinski 1988) is illustrated in Figure 3. The curve for the DEC PDP-11/70 can be considered a close approximation for the number of clock cycles required by other sequential processing (von Neumann) architectures. Also, the curve for the GAPP PEs assumes the use of a standard 32-bit bus. All of the curves in the figure are plotted with the assumption that 111 = 10.15N1. As more nodes are added, the number of clock cycles required to process the data on the PDP-11/70 and Graph Search Machine (GSM) increases much more rapidly than it does on the GAPP PEs; this can be attributed to the high degree of fine-grained parallelism employed by the GAPP processors when executing the Hopfield algorithm. For example, when implementing a 360 node network, this design requires 7 msec to perform a single iteration. Extrapolation of the curves in Figure 3 also indicates that for large networks, the ability to implement the network in parallel will easily outstrip any gains achieved by using a faster clock cycle on a sequential processing computer. For example, executing Hopfield networks on the order of 100,000 nodes yields an approximate 132-fold speedup over a sequential implementation. Therefore, a sequential computer with a clock frequency twice as fast as that of the GAPP will still be 66 times slower than the Hopfield network implementation on GAPP processors. In terms of connections per second (CPS), the 126,720 PE GAPP array discussed earlier can deliver approximately 19 million CPS while running at 10 MHz. The same array running at 20 MHz would yield nearly 38 million CPS, where CPS is defined as the number of multiplyand-accumulate operations that can be performed in a second. In this case, the CPS is determined by dividing the total number of connections by the time required to perform a single iteration of the Hopfield algorithm (the time required to shift in weight values from the host, and the time required to perform the symmetric hard limiting function, f h , are also included). These results compare favorably to other more costly general-purpose parallel processing computers such as a Connection Machine, CM-2, with 64 thousand processors (13 million CPS), a 10-processor WARP systolic array (17 million CPS), and a 64-processor Butterfly computer (8 million CPS). It should be noted, however, that the CPS measure is dependent on the neural network algorithm being executed. Therefore, in terms of comparison, these figures should be considered only as rough estimates of performance (Darpa study 1988). To verify the implementation of the Hopfield network presented in Section 4, and the analysis presented in Section 5, a 12 x 10 node Hopfield network was successfully implemented on a GAPP PC development system using the GAL (GAPP algorithm language) compiler. The exemplar patterns chosen were those used by Lippmann et al. (1987) in their
G.L. Heileman, G.M. Papadourakis, and M. Georgiopoulos
114
t I
In
"0
50
loo
150
200
250
m
350
4co
Figure 3: Number of clock cycles required to implement a single iteration of the Hopfield network (search phase) on a PDP-11/70, the Graph Search Machine and GAPP processors. Because of the explosive growth rates of the PDP-11/70 and GSM curves, this graph displays GAPP results for only a relatively small number of nodes. However, the analysis presented here is valid for arbitrarily large networks. character recognition experiments. The implementation of these experiments in fact corroborated the predicted results.
Acknowledgments This research was supported by a grant from the Division of Sponsored Research at the University of Central Florida.
References Alspector, J., Gupta, B., and Allen, R. B. 1989. Performance of a stochastic learning microchip. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA.
A Neural Net Associative Memory for Real-Time Applications
115
Brown, J. R. and Tommasi, M. 1989. Martin Marietta Electronic Systems, Orlando, FL. Personal communication. Darpa neural network study. 1988. B. Widrow, Study Director. AFCEA International Press. Davis, R. and Thomas, D. 1988. Systolic array chip matches the pace of highspeed processing. Electronic Design, October. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Hwang, K. and Briggs, F. 1984. Computer Architecture and Parallel Processing. McGraw-Hill, New York. Komlos, J. and Paturi, R. 1988. Convergence results in an associative memory model. Neural Networks 1, 239-250. Kung, S. Y. and Hwang, J. N. 1988. Parallel architectures for artificial neural nets. In Proceedings of the IEEE International Conference on Neural Networks, Vol. 11, San Diego, CA, pp. 165-172. Lippmann, R. P. 1987. An introduction to computing with neural nets. I E E E Acoustics Speech Signal Proc. Mag. 4(2), 4-22. Lippmann, R. P., Gold, B., and Malpass, M. L. 1987. A Comparison of Hamming and Hopfield Neural Nets for Pattern Classification. Tech. Rep. 769, M.I.T., Lincoln Laboratory, Lexington, MA. Na, H. and Glinski, S. 1988. Neural net based pattern recognition on the graph search machine. In Proceedings of the IEEE International Conferenceon Acoustics Speech and Signal Processing, New York. NCR Corp., Dayton, Ohio. 1984. Geometric arithmetic parallel processor (GAPP) data sheet.
Received 5 June 1989; accepted 2 October 1989.
Communicated by John Moody
Gram-Schmidt Neural Nets Sophocles J. Orfanidis Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ 08855 USA
A new type of feedforward multilayer neural net is proposed that exhibits fast convergence properties. It is defined by inserting a fast adaptive Gram-Schmidt preprocessor at each layer, followed by a conventional linear combiner-sigmoid part which is adapted by a fast version of the backpropagation rule. The resulting network structure is the multilayer generalization of the gradient adaptive lattice filter and the Gram-Schmidt adaptive array. 1 Introduction
In signal processing language, a feedforward multilayer neural net adapted by the backpropagation rule (Rumelhart et al. 1986; Werbos 1988; Parker 1987) is the multilayer generalization of the adaptive linear combiner adapted by the Widrow-Hoff LMS algorithm (Widrow and Stearns 1985). The backpropagation rule inherits the computational simplicity of the LMS algorithm. But, like the latter, it often exhibits slow speed of convergence. The convergence properties of the LMS algorithm are well known (Widrow and Stearns 1985). Its learning speed depends on the correlations that exist among the components of the input vectors -the stronger the correlations, the slower the speed. This can be understood intuitively by noting that, if the inputs are strongly correlated, the combiner has to linearly combine a lot of redundant information, and thus, will be slow in learning the statistics of the input data. On the other hand, if the input vector is decorrelated by a preprocessor prior to the linear combiner, the combiner will linearly combine only the nonredundant part of the same information, thus, adapting faster to the input. Such preprocessor realizations of the adaptive linear combiner lead naturally to the fast Gram-Schmidt preprocessors of adaptive antenna systems and to the adaptive lattice filters of time-series problems (Widrow and Stearns 1985; Monzingo and Miller 1980; Compton 1988; Orfanidis 1988). In this paper, we consider the generalization of such preprocessor structures to multilayer neural nets and discuss their convergence properties. The proposed network structure is defined by inserting, at each layer of the net, a Gram-Schmidt preprocessor followed by the convenNeural Computation 2,116-126 (1990) @ 1990 Massachusetts Institute of Technology
Gram-Schmidt Neural Nets
117
tional linear combiner and sigmoid parts. The purpose of each preprocessor is to decorrelate its inputs and provide decorrelated inputs to the linear combiner that follows. Each preprocessor is itself realized by a linear transformation, but of a speciaI kind, namely, a unit lower triangular matrix. The weights of the preprocessors are adapted locally at each layer, but the weights of the linear combiners must be adapted by the backpropagation rule. We discuss a variety of adaptation schemes for the weights, both LMS-like versions and fast versions. The latter are, in some sense, implementations of Newton-type methods for minimizing the performance index of the network. Newton methods for neural nets have been considered previously (Sutton 1986; Dahl 1987; Watrous 1987; Kollias and Anastassiou 1988; Hush and Salas 1988; Jacobs 1988). These methods do not change the structure of the network - only the way the weights are adapted. They operate on the correlated signals at each layer, whereas the proposed methods operate on the decorrelated ones. 2 Gram-Schmidt Preprocessors
In this section, we summarize the properties of Gram-Schmidt preprocessors for adaptive linear combiners. Our discussion is based on Orfanidis (1988). The correlations among the components of an ( M +1)-dimensional input vector x = [xo,q ,. . . , x ~ ] *are contained in its covariance matrix R = E [ x x r ] , where E [ ] denotes expectation and the superscript T transposition. The Gram-Schmidt orthogonalization procedure generates a new basis z = [zo,z1, . . . ,Z M ] * with mutually uncorrelated components, that is, E [ z , z , ] = 0 for i # j . It is defined by starting at zo = 2 0 and proceeding recursively for i = 1,2,. . . , M
where the coefficients bij are determined by the requirement that zi be decorrelated from all the previous zs { z o , zl, . . . ,zipl}.These coefficients define a unit lower triangular matrix B such that X =
Bz
(2.2)
known as the innovations representation of x. For example, if A4 = 3,
Sophocles J. Orfanidis
118
Figure 1: (a) Gram-Schmidt preprocessor. (b) Elementary building block. Equation 2.1 is shown in Figure 1. It represents the solution of the lower triangular linear system 2.2 by forward substitution. The covariance matrix of z is diagonal, Z, =
E[zzT]= diag{&o,E~, ..., E M )
where &, = E[z,2].It is related to R by R = BVBT which is recognized as the Cholesky factorization of R. Thus, all the correlation properties of x are contained in B , whereas the essential, nonredundant, information in x is contained in the uncorrelated innovations vector z. For the adaptive implementations, it proves convenient to cast the Gram-Schmidt preprocessor as a prediction problem with a quadratic performance index, iteratively minimized using a gradient descent scheme. Indeed, an equivalent computation of the optimal weights bZj is based on the sequence of minimization problems
E, = ~tz,2I = min,
i = 1 , 2 , .. . , M
(2.3)
where, for each i, the minimization is with respect to the coefficients b,, j = 0,1,. . . , i - 1. Each z, may be thought of as the prediction error in predicting 2, from the previous zs, or equivalently, the previous m. The decorrelation conditions E[z,z,] = 0 are precisely the orthogonal-
ity conditions for the minimization problems (2.3). The gradient of the
Gram-Schmidt Neural Nets
119
performance index ri is a&,/ab,, = -2E[z,z,]. Dropping the expectation value (and a factor of two), we obtain the LMS-like gradient-descent delta rule for updating b,,
where p is a learning rate parameter. A faster version, obtained by applying Newton's method to the decorrelated basis, is as follows (Orfanidis 1988):
(2.5) where ,B is usually set to 1 and E., is a time-average approximation to E, = E[z:] updated from one iteration to the next by
E, = XE,
-I-z;
(2.6)
where X is a "forgetting" factor with a typical value of 0.9. Next, we consider the Gram-Schmidt formulation of the adaptive linear combiner. Its purpose is to generate an optimum estimate of a desired output vector d by the linear combination y = W x , by minimizing the mean square error I = ~ [ e ~= emin ]
(2.7)
where e = d - y is the estimation error. The output y may also be computed in the decorrelated basis z by
y = W X= Gz
(2.8)
where G is the combiner's weight matrix in the new basis, defined by
WB=G
+-
W=GB-'
(2.9)
The conventional LMS algorithm is obtained by considering the performance index (2.7) to be a function of the weight matrix W . In this case, the matrix elements of W are adapted by Aw,, = p w ,
Similarly, viewing the performance index as a function of G and carrying out gradient descent with respect to G, we obtain the LMS algorithm for adapting the matrix elements of G
As%.,= W J ,
(2.10)
A faster version is (2.11)
Sophocles J. Orfanidis
120
with El adapted by 2.6. Like 2.5, it is equivalent to applying Newton's method with respect to the decorrelated basis. Conceptually, the adaptation of B has nothing to do with the adaptation of G, each being the solution to a different optimization problem. However, in practice, B and G are simultaneously adapted using equations 2.4 and 2.10, or their fast versions, equations 2.5 and 2.11. In 2.11, we used the scale factor pp instead of p to allow us greater flexibility when adapting both b,, and gz,. 3 Gram-Schmidt Neural Nets
In this section, we incorporate the Gram-Schmidt preprocessor structures into multilayer neural nets and discuss various adaptation schemes. Consider a conventional multilayer net with N layers and let un,xn-denote the input and output vectors at the nth layer and W" the weight matrix connecting the nth and (n + 1) layers, as shown in Figure 2. The overall input and output vectors are #, x N . The operation of the network is described by the forward equations: For R = 0,1,.. . , N - 1 Un+l
,p+l
wnxn
- f(u"+')
Figure 2: (a) Conventional net. (b) Gram-Schmidt net.
Gram-Schmidt Neural Nets where we denote f(u) = [f(uo), f ( u l ) , . . .IT if u = moidal function is defined by
121 [ U O ,U I ,
. . .IT.
The sig-
The performance index of the network is
1
&= (d-xN)T(d-xN) 2 patterns
(3.3)
For each presentation of a desired input/output pattern {xo, d } , the backpropagation rule (Rumelhart et al. 1986; Werbos 1988; Parker 1987) computes the gradients e" = -a&/au" by starting at the output layer
eN = D N ( d- x N ) and proceeding backward to the hidden layers, for R = N - 1,N =~
n ~ ~ i T ~ n + l
(3.4) -
2, . . . ,1 (3.5)
where D" = diag{f'(u")} is the diagonal matrix of derivatives of the sigmoidal function, and f' = f ( 1 - f ) . The weights W" are adapted by aw,; = pLe,"+'L;
(3.6)
or, by the "momentum" method (Rumelhart et al. 1986)
Awii
= aAwC
+ pe;+lx;
(3.7)
where a plays a role analogous to the forgetting factor X of the previous section. The proposed Gram-Schmidt network structure is defined by inserting an adaptive Gram-Schmidt preprocessor at each layer of the network. Let z" be the decorrelated outputs at the nth layer and B'l the corresponding Gram-Schmidt matrix, such that by equation 2.2, X" = Bnzn,and let G'" be the Iinear combiner matrix, as shown in Figure 2. It is related to W" by equation 2.9, W"BrL= GI1,which implies that Wnxn= Gnzn. The forward equations 3.1-3.2 are replaced now by (3.8) (3.9) (3.10) where 3.8 is solved for Z" by forward substitution, as in 2.1. Inserting W" = G"(B"1-I in the backpropagation equation 3.5, we obtain en = DnWnTen+'= Dn(BnT)-lGnTen+l. To facilitate this computation, define the intermediate vector f" = (BrLT)-1G7LTerz+1 or, BrLTfz = GnTe"+' Then, equation 3.5 can be replaced by the pair (3.11) (3.12)
Sophocles J. Orfanidis
122
where, noting that BnT is an upper triangular matrix, equation 3.11 may be solved efficiently for fn using backward substitution. Using 2.4, the adaptation equations for the b-weights are given by Ab; = pzrzj” (3.13) or, by the fast version based on 2.5 (3.14)
with E,” updated from one iteration to the next by ET = XEj”+ (27)‘
(3.15)
Similarly, the adaptation of the g-weights is given by Ag; = peq+’z,”
(3.16)
and its faster version based on 2.11 (3.17)
Momentum updating may also be used, leading to the alternative adaptation (3.18) Ag; = aAgG + pea+’z; and its faster version Ag;
= CuAg;
PP + -eE?
n+l n a
(3.19)
zj
The complete algorithm consists of the forward equations 3.8-3.10, the backward equations 3.4 and 3.11,3.12, and the adaptation equations 3.13 and 3.16, or the faster versions, equations 3.14, 3.15, and 3.17.
4 Simulation Results In this section, we present some simulations illustrating the performance of the proposed network structures. Consider two network examples, the first is a 3:3:2 network consisting of three input units, two output units, and one hidden layer with three units, and the second is a 3:6:2 network that has six hidden units. We choose a set of eight input/output training patterns given by 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
0 1 1 0 1 0 0
1 0 0 1 0 1 1
1 0
Gram-Schmidt Neural Nets
123
They correspond to the 3-input parity problem in which the first of the two outputs is simply the parity of the three inputs and the second output is the complement of the first output. Figure 3 shows the performance index (3.2) versus iteration number for the conventional and Gram-Schmidt nets, where each iteration represents one epoch, that is, the presentation of all eight patterns in sequence. In the first two graphs, the linear combiner weights were adapted on an epoch basis, that is, accumulating equation 3.6 in the conventional case or equation 3.17 in the Gram-Schmidt case, over all eight patterns and then updating. The last two graphs correspond to momentum or pattern updating, that is, using equations 3.7 and 3.19 on a pattern basis. The b-weights were adapted only on a pattern basis using the fast method, equations 3.14 and 3.15. The following values of the parameters were used: p = 0.25, 0 = 1, X = cy = 0.85. The same values of ,u and GY were used in both the conventional and Gram-Schmidt cases. To avoid possible divisions by zero in equations 3.14 and 3.17, the quantities E? were initialized to some small nonzero value, EY = 6, typically 6 = 0.01. The algorithm is very insensitive to the value of 6.Also,
3 : 6 : 2 , epoch updating
3 3 . 2 . epoch updoting I
0.71
0.6} 0 5 -------_.___
LJ
O 03 . h
-0 1 1
-0 1
0
200
+OO
600
800
0
1000
200
400
600
800
,terat,ons
iterations
3.3 2, pottern u p d a t l n g
3 6 2 , pottern updatinq
0.7,
-0 I 1
0
I 100
200
300
100
0
500
100
200
300
,terat,ons
lterotlons
-
Figure 3: Learning curves of conventional and Gram-Schmidt nets.
LOO
Sophocles J. Orfanidis
124
3:3:2
3:6:2
Epoch Pattern Epoch Pattern
Conventional
2561
1038
1660
708
Gram-Schmidt
923
247
349
100
Table 1: Average Convergence Times bias terms in equations 3.9 were incorporated by extending the vector Z" by an additional unit which was always on. It has been commonly observed in the neural network literature that there is strong dependence of the convergence times on the initial conditions. Therefore, we computed the average convergence times for the above examples based on 200 repetitions of the simulations with random initializations. The random initial weights were chosen using a uniform distribution in the range [-1,11. The convergence time was defined as the number of iterations for the performance index (3.3) to drop below a certain threshold value - here, Emax= 0.01. The average convergence times are shown in Table 1. The speed advantage of the Gram-Schmidt method is evident. 5 Discussion
Convergence proofs of the proposed algorithms are straightforward only in the partially adaptive cases, that is, adapting B" with fixed G" or adapting G" with fixed B". In the latter case, it is easily shown that 3.16, in conjunction with the backpropagation equations 3.11,3.12, implements gradient descent with respect to the g-weights. When B" and G" are simultaneously adaptive, convergence proofs are not available, not even for the single-layer adaptive combiners that are widely used in signal processing applications. Although we have presented here only a small simulation example, we expect the benefits of the Gram-Schmidt method to carry over to larger neural net problems. The convergence rate of the LMS algorithm for an ordinary adaptive linear combiner is controlled by the eigenvalue spread, Xmax/X-, of the input covariance matrix R = E[x%''~]. The Gram-Schmidt preprocessors achieve faster speed of convergence
Gram-Schmidt Neural Nets
125
by adaptively decorrelating the inputs to the combiner and equalizing the eigenvalue spread - the relative speed advantage being roughly proportional to Xmax/Xmin. In many applications, such as adaptive array processing, as the problem size increases so does the eigenvalue spread, thus, making the use of the Gram-Schmidt method more effective. We expect the same behavior to hold for larger neural network problems. A guideline whether the use of the Gram-Schmidt method is appropriate for any given neural net problem can be obtained by computing the eigenvalue spread of the covariance matrix of the input patterns xo:
R=
xoxoT patterns
If the eigenvalue spread is large, the Gram-Schmidt method is expected to be effective. For our simulation example, it is easily determined that the corresponding eigenvalue spread is Xmax/Xmin = 4. The results of Table 1 are consistent with this speed-up factor.
References Compton, R. T. 1988. Adaptive Antennas. Prentice-Hall, Englewood Cliffs, NJ. Dahl, E. D. 1987. Accelerated learning using the generalized delta rule. Proc. IEEE First Int. Conf. Neural Networks, San Diego, p. 11-523. Hush, D. R. and Salas, J. M. 1988. Improving the learning rate of back-propagation with the gradient reuse algorithm. Proc. I E E E Int, Conf. Neural Networks, San Diego, p. 1-441. Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks, 1, 295. Kollias, S. and Anastassiou, D. 1988. Adaptive training of multilayer neural networks using a least squares estimation technique. Proc. I E E E Int. Conf. Neural Networks, San Diego, p. 1-383. Monzingo, R. A. and Miller, T. W. 1980. Introduction to Adaptive Arrays, Wiley, New York. Orfanidis, S. J. 1988. Optimum Signal Processing, 2nd ed., McGraw-Hill, New York. Parker, D. B. 1987. Optimal algorithms for adaptive networks: Second order back propagation, second order direct propagation, second order Hebbian learning. Proc. IEEE First Int. Conf. Neural Networks, San Diego, p. 11-593, and earlier references therein. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representationsby error propagation. In Parallel Distributed Processing, Vol. 1, D. E. Rumelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Sutton, R. S. 1986. Two problems with back propagation and other steepestdescent learning procedures for networks. Proc. 8th Ann. Conf. Cognitive Sci. SOC.,p. 823. Watrous, R. L. 1987. Learning algorithms for connectionist networks: Applied gradient methods of nonlinear optimization. Proc. ZEEE First Int. Conf. Neural Networks, San Diego, p. 11-619.
126
Sophocles J. Orfanidis
Werbos, l? J. 1988. Backpropagation: Past and future. Proc. IEEE lnt. Conf. Neural Networks, San Diego, p. 1-343,and earlier references therein. Widrow, B. and Steams, S. D. 1985. Adaptive Signal Processing. Prentice-Hall, Englewood Cliffs, NJ.
Received 10 July 1989; accepted 13 November 1989.
127
Errata
In Halbert White’s “Learning in Artificial Neural Networks: A Statistical Perspective” (1:447), step one was omitted from a discussion of the multilevel single linkage algorithm of Rinnooy Kan et al. (1985). The step is:
1. Draw a weight vector w from the uniform distribution over w.
The artwork for figures 2 and 3 in ”Backpropagation Applied to Handwritten Zip Code Recognition,” by Y. LeCun et al. (1:545 and 548) was transposed. The legends were placed correctly.
Communicated by Dana Ballard
Visual Perception of Three-Dimensional Motion David J. Heeger* The Media Laboratory, Massachusetts Institute of Technology, Cambridge, M A 02139 U S A
Allan Jepson Computer Science Department, University of Toronto, Toronto, Ontario M5S 1A4, Canada
As an observer moves and explores the environment, the visual stimulation in his eye is constantly changing. Somehow he is able to perceive the spatial layout of the scene, and to discern his movement through space. Computational vision researchers have been trying to solve this problem for a number of years with only limited success. It is a difficult problem to solve because the relationship between the optical-flow field, the 3D motion parameters, and depth is nonlinear. We have come to understand that this nonlinear equation describing the optical-flow field can be split by an exact algebraic manipulation to yield an equation that relates the image velocities to the translational component of the 3D motion alone. Thus, the depth and the rotational velocity need not be known or estimated prior to solving for the translational velocity. The algorithm applies to the general case of arbitrary motion with respect to an arbitrary scene. It is simple to compute and it is plausible biologically. 1 Introduction
Almost 40 years ago, Gibson (1950) pointed out that visual motion perception is critical for an observer's ability to explore and interact with his environment. Since that time, perception of motion has been studied extensively by researchers in the fields of visual psychophysics, visual neurophysiology, and computational vision. It is now well-known that the visual system has mechanisms that are specifically suited for analyzing motion. In particular, human observers are capable of recovering accurate information about the translational component of three-dimensional motion from the motion in images (Warren and Hannon 1988). 'Current address: NASA-Ames Research Center, mail stop 262-2, Moffett Field, CA 94035 USA. Nerrrai Compufafion 2, 129-137 (1990) @ 1990 Massachusetts Institute of Technology
David J. Heeger and Allan Jepson
130
The first stage of motion perception is generally believed to be the measurement of image motion, or optical flow, a field of two-dimensional velocity vectors that encodes the direction and speed of displacement for each small region of the visual field. A variety of algorithms for computing optical flow fields have been proposed by a number of computational vision researchers (e.g., Horn and Schunk 1981; Anandan 1989; Heeger 1987). The second stage of motion perception is the interpretation of optical flow in terms of objects and surfaces in the three-dimensional world. As an observer (or camera) moves with respect to a rigid scene (object or surface), the image velocity at a particular image point depends nonlinearly on three quantities: the translational velocity of the observer relative to a point in the scene, the relative rotational velocity between the observer and the point in the scene, and the distance from the observer to the point in the scene. This paper presents a simple algorithm for recovering the translational component of 3D motion. The algorithm requires remarkably little computation; it is straightforward to design parallel hardware capable of performing these computations in real time. The mathematical results in this paper have direct implications for research on biological motion perception. 2 3D Motion and Optical Flow
We first review the physics and geometry of instantaneous rigid-body motion under perspective projection, and derive an equation relating 3D motion to optical flow. Although this equation has been derived previously by a number of authors (e.g., Longuet-Higgins and Prazdny 1980; Bruss and Horn 1983; Waxman and Ullman 19851, we write it in a new form that reveals its underlying simplicity. Each point in a scene has an associated position vector, P = ( X ,Y,Z)t, relative to a viewer-centered coordinate frame as depicted in Figure 1. Under perspective projection this surface point projects to a point in the image plane, (z, Y ) ~ , fX/Z
x
=
Y
= fYlZ
(2.1)
where f is the "focal length of the projection. Every point of a rigid body shares the same six motion parameters relative to the viewer-centered coordinate frame. Due to the motion of the observer, the relative motion of a surface point is
v=
(
dX d Y d Z dt ' dt ' dt
) =-(nxP+T)
- - -
(2.2)
Visual Perception of 3D Motion
131
Figure 1: Viewer-centered coordinate frame and perspective projection. where T = (T,,T,,, T2)t and f2 = ( O L R,, , denote, respectively, the translational and rotational velocities. Image velocity, 8(z,y), is defined as the derivatives, with respect to time, of the z- and y-components of the projection of a scene point. Taking derivatives of equation 2.1 with respect to time, and substituting from equation 2.2 gives
e(r,y) = p(z,V ) A ( ~ : , Y)T+ ~ ( YW r ,
(2.3)
where p(x,y) = 1/Z is inverse depth, and where
The A h , y) and B(x, y) matrices depend only on the image position and the focal length, not on any of the unknowns. Equation 2.3 describes the image velocity for each point on a rigid body, as a function of the 3D motion parameters and the depth. An important observation about equation 2.3 is that it is bilinear; 8 is a
132
David J. Heeger and Allan Jepson
linear function of T and 0 for fixed p , and it is a linear function of p and f2 for fixed T.
3 Optical Flow at Five Image Points Since both p ( z , y) (the depth) and T (the translational component of motion) are unknowns and since they are multiplied together in equation 2.3, they can each be determined only up to a scale factor; that is, we can solve for only the direction of translation and the relative depth, not for the absolute 3D translational velocity or for the absolute depth. For the rest of this paper, T denotes a unit vector in the direction of the 3D translation (note that T now has only two degrees of freedom), and p ( z , y) denotes the relative depth for each image point. It is impossible to recover the 3D motion parameters, given the image velocity, e(z, y), at only a single image point; there are six unknowns on the right-hand side of equation 2.3 and only two measurements [the two components O h , y)]. Several flow vectors, however, may be utilized in concert to recover the 3D motion parameters and depth. Image velocity measurements at five or more image points are necessary to solve the problem (Prazdny 1983), although any number of four or more vectors may be used in the algorithm described below. For each of five image points, a separate equation can be written in the form of equation 2.3. These five equations can also be collected together into one matrix equation (reusing the symbols in equation 2.3 rather than introducing new notation): (3.1)
where 0 (a 10-vector) is the image velocity at each of the five image points, and p (a 5-vector) is the depth at each point. A(T) (a 10 x 5 matrix) is obtained by collecting together into a single matrix A(z,y)T for each z and y:
Visual Perception of 3D Motion
133
Similarly, B (a 10 x 3 matrix) is obtained by collecting together into a single matrix the five B(z, y) matrices:
Finally, q (an %vector) is obtained by collecting together into one vector the unknown depths and rotational velocities, and C(T) (a 10 x 8 matrix) is obtained by placing the columns of B along side the columns of A(T):
4 Recovering the Direction of Translation We now present a method for recovering the observer's 3D translational velocity, T. The depths and rotational velocity need not be known or estimated prior to solving for T. The result is a residual surface, R(T), over the discretely sampled space of all possible translation directions. An illustration of such a residual surface is depicted in Figure 2. The residual function, R(T), is defined such that R(T0) = 0 for To equal to the actual 3D translational velocity of the observer, and such that R(T) > 0 for T different from the actual value. Equation 3.1 relates the image velocities, 8, at five image points to the product of a matrix, C(T) (that depends on the unknown 3D translational velocity), times a vector, q (the unknown depths and rotational velocity). The matrix, C(T), divides 10-space into two subspaces: the &dimensional subspace that is spanned by its columns, and the leftover orthogonal 2-dimensional subspace. The columns of C(T) are guaranteed to span the full 8 dimensions for almost all choices of five points and almost any T. In particular, an arrangement of sample points like that shown on dice is sufficient. The %dimensional subspace is called the range of C(T), and the 2-dimensional subspace is called the orthogonal complement of C(T). Let C'(T) (a 10 x 2 matrix) be an orthonormal basis for the 2-dimensional orthogonal complement of C(T). It is straightforward, using techniques of numerical linear algebra (Strang 1980), to choose a Ci(T) matrix given C(T). The residual function is defined in terms of this basis for the orthogonal complement:
David J. Heeger and Allan Jepson
134
T-space (flattened hemisphere)
Figure 2: The space of all possible translation directions is made up of the points on the unit hemisphere. The residual function, R(T), is defined to be zero for the true direction of translation. The two-dimensional solution space is parameterized by a and P, the angles (in spherical coordinates) that specify each point on the unit hemisphere. Given a measurement of image velocity, 8, and the correct translational velocity, To, the following three statements are equivalent: 0 = C(To)q,
for some q
8 E range[C(To)I
Visual Perception of 3D Motion
t
135
...
Image Velocity Data
Figure 3: The direction of translation is recovered by subdividing the flow field into patches. A residual surface is computed for each patch using image velocity measurements a t five image points from within that patch. The residual surfaces from each patch are then summed to give a single solution.
Since 6' is in the column space (the range) of C(To), and since C'(T0) is orthogonal to C(To), it i s clear that R(Tol = 0. The residual function can be computed in parallel for each possible choice of T, and residual surfaces can be computed in parallel for different sets of five image points. The resulting residual surfaces are then summed, as illustrated in Figure 3, giving a global least-squares estimate for T. It is important to know if there are incorrect translational velocities that might also have a residual of zero. For five point patterns that have small angular extent (e.g., 10" of visual angle or smaller) there may be multiple zeroes in the residual surface. When the inverse depth values of the five points are sufficiently nonplanar, it can be shown that the zeroes of R(T) are concentrated near the great circle that passes through the correct translational direction T = To and through the translational direction that corresponds to moving directly toward the center of the five image points. For four point patterns there is a curve of solutions near this great circle.
136
David J. Heeger and Allan Jepson
The solution is disambiguated by summing residual surfaces from different five point patterns. Two or more sets of five point patterns, in significantly different visual directions, have zeroes concentrated near different great circles. They have simultaneous zeroes only near the intersection of the great circles, namely near T = To. In cases where the inverse depths are nearly planar, or velocity values are available only within a narrow visual angle, it may be impossible to obtain a unique solution. These cases will be the exception rather than the rule in natural situations. The matrices C(T) and CL(T)depend only on the locations of the five image points and on the choice of T; they do not depend on the image velocity inputs. Therefore, the C'(T) matrices may be precomputed (and stored) for each set of five image points, and for each of the discretely sampled values of T. As new flow-field measurements become available from incoming images, the residual function, R(T), is computed in two steps: (1) a pair of weighted summations, given by OtC'(T), and (2) the sum of the square of the two resulting numbers. The algorithm is certainly simple enough (a weighted summation followed by squaring followed by a summation) to be implemented in visual cortex. A number of cells could each compute R(T) (or perhaps some function like exp[-R(T)]), each for a different choice of T, and each for some region of the visual field. Each cell would thus be tuned for a different direction of 3D translation. The perceived direction of motion would then be represented as the minimum (or the peak) in the distribution of cell firing rates. While it is well-known that cells in several cortical areas (e.g., areas MT and MST) of the primate brain are selective for 2D image velocity, physiologists have not yet tested whether these cells are selective for 3D motion.
5 Discussion
The algorithm presented in this paper demonstrates that it is simple to recover the translational component of 3D motion from motion in images. We have tested the algorithm and compared its performance to other proposed algorithms. Simulations using synthetic flow fields with noise added demonstrate that our new approach is much more robust. These results are reported in a companion paper (in preparation). We also show in a companion paper that it is straightforward to recover the rotational velocity and the depth once the translation is known. This helps to substantiate Gibson's conjecture that observers can gain an accurate perception of the world around them by active exploration, and that an unambiguous interpretation of the visual world is readily available from visual stimuli.
Visual Perception of 3D Motion
137
References Anandan, P. 1989. A computational framework and an algorithm for the measurement of visual motion. lnt. J. Comp. Vision 2, 283-310. Bruss, A. R., and Horn, B. K. P. 1983. Passive navigation. Comput. Vision, Graphics, lmage Proc. 21, 3-20. Gibson, J. J. 1950. The Perception of the Visual World. Houghton Mifflin, Boston. Heeger, D. J. 1987. Model for the extraction of image flow. I. Opt. Sac. A m . A 4, 1455-1 471. Horn, B. K. P., and Schunk, B. G. 1981. Determining optical flow. Artifi. Intell. 17,185-203. Longuet-Higgins, H. C., and Prazdny, K. 1980. The interpretation of a moving retinal image. Proc. R. Sac. London B 208, 385-397. Prazdny, K. 1983. On the information in optical flows. Comput. Vision, Graphics lmage Proc. 22, 239-259. Strang, G. 1980. Linear Algebra and Its Applications. Academic Press, New York. Warren, W. H., and Hannon, D. J. 1988. Direction of self-motion is perceived from optical flow. Nature (London) 336, 162-163. Waxman, A. M., and Ullman, S. 1985. Surface structure and three-dimensional motion from image flow kinematics. lnt. J. Robot. Res. 4, 72-94.
Received 28 December 1989; accepted 20 February 1990.
Communicated by Geoffrey Hinton and Steven Zucker
Distributed Symbolic Representation of Visual Shape Eric Saund Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304 USA
The notion of distributed representation has gained significance in explanations of connectionist or ”neural” networks. This communication shows that the concept also offers motivation in devising representations for visual shape within a symbolic computing paradigm. In a representation for binary (silhouette) shapes, and in analogy with conventional distributed connectionist networks, descriptive power is gained when microfeatures are available naming important spatial relationships in images. Our symbolic approach is introduced through a vocabulary of 31 “hand built” shape descriptors operating in the twodimensional shape world of the dorsal fins of fishes. 1 Introduction A distributed representation expresses information through the ensemble behavior of a collection of microfeatures (Rumelhart et al. 1986). In a conventional connectionist network, microfeatures are created as the result of a training procedure. They arise presumably because they capture some statistical regularity over training data, and to this extent microfeatures contribute to a system’s ability to perform correct generalizations and inferences on novel data (Sejnowski and Rosenberg 1987; Rosenberg 1987). The term ”microfeature”refers to the fact that units of information are of relatively small grain size - each microfeature typically stands for only a fragment of what a human would consider a unified conceptual object (Hinton 1989). This paper shows that the notion of distributed representation can profitably be exported from its origin in connectionist networks, and incorporated in a symbolic computing architecture in the problem domain of visual shape representation. Our overall goal is to devise representations for visual shape that will support a wide range of visual tasks, including recognizing shapes, classifying shapes into predetermined categories, and delivering meaningful assessments of the ways in which objects may be considered similar or different in shape. Our specific shape world is a particular class of naturally occurring two-dimensional binary (silhouette) shapes, namely, the dorsal fins of fishes. Microfeatures play an important role in making explicit spatial relationships among Neural Computation 2, 138-151 (1990)
@ 1990 Massachusetts Institute of Technology
Distributed Representation of Visual Shape
139
fragments of shape such as edges and regions. A particular shape is represented not in any single symbolic node, but through the ensemble behavior of a collection of such microfeatures. 2 Shape Fragments Represented by Symbolic Tokens
The system architecture is based on symbolic tokens placed on a blackboard, in the style of a production system. Tokens of various types make explicit fragments of shape such as edges, corners, and partially enclosed regions, as shown in Figure 1. Shape tokens are asserted via grouping processes; characteristic configurations or constellations of primitive tokens provide support for the assertion of additional, more abstract, tokens (Marr 1976). We may view the microfeatures of present concern as occurring at the topmost level in an abstraction hierarchy of token types. An explanation of the token grouping operations themselves lies beyond the scope of this paper; for details the reader is referred to Saund (198813, 1989). In our implementation a microfeature may be regarded as a deformable template, as shown in Figure 2. Unlike purely iconic, pixel-based templates, these templates are not correlated with the image directly, but are asserted according to alignment with more primitive shape fragments represented by shape tokens. The deformation of each template type resembles that of a mechanical linkage, and is characterized by a scalarvalued parameter. For example, the internal parameter of the microfeature shown in Figure 2a makes explicit the relative orientation between an edge and a corner occurring within a certain proximity of one another.
3 Domain-Specific Shape Microfeatures
Under this framework, the descriptive power of a shape representation lies in its vocabulary of deformable template-like descriptors. To what spatial configurations and deformations among shape fragments should explicit microfeatures be devoted? Our view is that a vocabulary of abstract level shape descriptors should be designed to capture the geometric regularities and structure inherent to the target shape domain. Shape events should be labeled that are most useful in characterizing and distinguishing among the shapes that will be encountered by the system. Because the geometric structures and attributes of greatest descriptive significance may differ from domain to domain, it is to be expected that appropriate descriptive vocabularies might be, to at least some degree, domain-specific. For example, a language for distinguishing among the shapes of fish dorsal fins might contain terms for degree of sweepback and angle of protrusion from the body - terms that are irrelevant to the shapes of tree leaves. Just as a connectionist
Eric Saund
140
FULL-CORNER
A PARTIAL-CIRCULAR-REGION
Figure 1: (a) Profile fish shape. (b) Shape tokens denote edge, corner, and region shape fragments occurring in the image. A shape token is depicted as a line segment with a circle at one end indicating orientation.
Distributed Representation of Visual Shape
I
fa)
141
1
:
t
deformation parameter
Figure 2: A microfeature resembles a template that can deform like a mechanical linkage to fit a range of fin shapes. By maintaining a characteristic parameter of deformation, each shape microfeature provides explicit access to an aspect of spatial configuration by which dorsal fins may vary in shape. In (a), a microfeature captures the relative orientation between tokens representing a particular corner (circle) and edge (ellipse). (b)Many microfeatures can overlap and share support as they latch onto shape fragments present in the image.
142
Eric Saund
network embodies knowledge of a problem space within link weights, a vocabulary of shape microfeatures becomes the locus of knowledge about the geometric configurations expected to be encountered in the visual world. Through careful conscious examination of a test set of 43 fish dorsal fin shapes, we have designed a vocabulary of 31 shape microfeatures well-suited to describing and distinguishing among fish dorsal fins. (The class of dorsal fins considered is limited to those that protrude outward from the body; we exclude fishes whose dorsal fins extend along the entire length of the body.) A methodology for designing these shape descriptors is not formalized. Roughly, however, it consisted in identifying collections of dorsal fin shapes that appeared clearly similar or different in some regard, and analyzing the geometric relationships among edge, corner, and region shape fragments that contributed to the similarities or differences in appearance. For example, noticing the "notch feature occurring at the rear of many dorsal fins led to the development of several templates whose deformations correspond to variations in the depth and vertex angle of the notch. The complete set of shape descriptors is presented in Figure 3. Although it is based on symbolic tokens instead of link weights in a wired network, this shape vocabulary shares three important properties with the microfeatures of traditional connectionist distributed representations: (1)Each descriptive element makes explicit a geometric property pertaining to only a portion of the entire shape; note that no descriptor is called, say, ANCHOVY-DORSAL-FIN or SHARK-DORSAL-FIN. (2) The descriptors share support at the image level. This is to say, two or more abstract level deformable templates may latch onto the same edge or corner fragment. (3) The descriptors overlap one another in a redundant fashion. Many microfeatures participate in the characterization of a fin shape, and it is only through the ensemble description that an object's geometry is specified in its entirety. 4 What Is Gained by Using Shape Microfeatures?
The value in this distributed style of representation derives from the direct access it provides to a large number of geometric properties that distinguish shapes in the target domain. By offering explicit names for many significant ways in which one dorsal fin can be similar or different in shape from another, the representation supports a variety of visual tasks, including shape classification, recognition, comparison, and (foreseeably) graphic manipulation. We shall offer three illustrations of the representation at work. First, however, it is useful to consider the deformable template microfeature representation in light of a feature space interpretation. The assertion of an abstract level shape descriptor carries two pieces of infor-
CONFIC~II-IIEICIIT-FICLE-\VIDTII-RXrIO
NOTCII-DEPTII-PICI.E~WIDTII-RATIO
PICLE-POSTERIOR-CORNER~VERTEX-*NCLE
P*RALLEL-SIDES-RELATIVE-ORIENTATION
I
-
,-
LEADING-EDGE-ANGLE
k -3
JXL d iCONFIG-Ill-TOPARC-HEICHT-BASE-WIDTH-RATIO
CONFlC-III-TOPARC-SlZE-9*SE-WIDTH-R*TIO
CONFIO-Ill-TOPARC-ORIENTATION
CONFIC-III-TOPARC-CURV&TURE
NOTCII-DEPTII-BASE-WIDTH-RATIO
NOTCII-SIZE
N O TCII-1’1-VERTEX-ANGLE-DIFFERENCE
NOTCII-III-VERTEX-,\NGl.E-SUh(
NOTCII-VERTEX-ANGLE
NOTCII-FLV~EDCE-CURVATURE
/! -3-
/ >
J--L
Jl JC
44-
L
Figure 3: Complete vocabulary of shape microfeatures designed for the dorsal fin domain. Arrows depict the deformation parameter reflected in each microfeature’s name.
LECPE-BACK-EDGE-CURVATURE
LECPE-BACK-EDGE-ORIENTATION
C:ONFIG~II‘ l ~ l ~ ~ ~ ’ ~ l l N ~ ~ l l - l l ~ ~ ~ ~ N l ~ ~ l , , \ l ~ l ’
CONFIG~II-TOP-COIlNER-FL,\RE
CONFIG- 11- 1 ’ 0 1 ’ - C O I l N E R - ~ ~ S E - D O R I E N ’ 1 ‘1‘ION ,L
PARALLEL-SIDES-SWEEPBACK-ANOLE
PARALLEL-SIDES-NDISTANCE
PARALLEL-SIDES-RELATIVE-SCALE
CONFlG~llllEIOI17~IlASE~\VIDTII-R~rlO
CONFlG-II-’COl’~CORN~R-ShEW
LEADINC-EDCE-REL-l.ENCTll2
CON FIG-11-TOP.COIINER-VERTEX-ANGLE
LEADING-EDCE-CLlnVATURE
LEADING-EDGE-REL-LENGTlll
4
J L
CONFIG-Il-VEII.~EX-PROI-ONTO~UASE~PROPORTION~-~
CONFIG-11-TOP-CORNER-ROUNDEDNESS
I
n
144
Eric Saund
mation: (1) The statement that a qualified configuration of edge, corner, and/or region shape fragments occurs at this particular pose (location, orientation and scale) in the image, and (2) a scalar parameter corresponding to the template deformation required to latch onto these fragments. As such, the complete distributed description of a shape may be regarded as point in a high-dimensional feature space (or hyperspace), each feature dimension contributed by the scalar parameter of an individual microfeature. Note, however, that every shape object creates its own feature space, depending on the microfeatures fitting its particular geometry and different objects’ feature spaces may or may not share particular dimensions in common. For example, certain microfeatures apply only to dorsal fins that are rounded on top (e.g., CONFIG-111-TOPARC-CURVATURE), and these dimensions are absent in the descriptions of sharply pointed fins. In this regard the present shape representation differs from a connectionist network, which employs the same set of nodes, or hyperspace feature dimensions, for all problems. Figure 4a shows that the distributed shape vocabulary supports classification of dorsal fin shapes into categories. One simple representation for a shape category is a window in a microfeature hyperspace. A shape’s membership in a given category may be cast as a function of (1) the microfeatures it shares in common with the category’s hyperspace, and (2) the shape’s proximity to the category window along their common microfeature dimensions. Details of such a scheme are presented in Saund (1988b1, however, the efficacy of this approach lies not in the computational details, but in the particular feature dimensions available for establishing category boundaries. By tailoring microfeatures to the significant geometric properties of the domain, the representation provides fitting criteria for the establishment of salient equivalence classes of shapes. Figure 4b shows that the representation supports evaluation of degree of similarity among shapes. Here, dorsal fins were rank ordered by a computer program that estimated similarity to a target fin (circled). This computation would be useful to a shape recognition task, in which it must be determined whether a novel shape is sufficiently similar to a known candidate shape. Of course, different shapes may be considered similar or different from one another in many different ways. Under the microfeature representation it is possible to design similarity measures differentially weighting various shape properties according to contextual or task-specific considerations. Again, as in the shape classification task, we emphasize that a rich vocabulary of shape microfeatures provides an appropriate language for expressing the criteria by which shape similarity and difference may be evaluated. Finally, Figure 4c shows that a microfeature representation supports analysis of the ways in which one shape must be deformed to make it more similar to another. A computer program drew arrows indicating that, for example, a Trout-Perch dorsal fin must be squashed and skewed
Distributed Representation of Visual Shape
145
A
LECPE-BACK-EDGE-ORIENTATION
\
\
t
,
A
A
\ \
CONFIG-11-TOP-CORNER-BASE-DORIENTATION
NOTCH-DEPTH-PICLE-WIDTH-RATIO
Figure 4a: (1) Dotted line segregates a region of microfeature hyperspace in which “Flaglike” dorsal fins are clustered.
Eric Saund
146
CRIECORY-ROW(OE0
Figure 4a: (2) Dorsal fin shapes classified into several perceptually salient categories.
Distributed Representation of Visual Shape
target: Mudminnows
cat egory : rounded
ICill i fis hes 1
category: rounded
target :
target:
Cars
target: Gars
147
category: rounded
category: broomstick
Figure 4: (b) Fin shapes rank ordered by rated similarity to a target shape (circled).
to the right to make it more similar to an Anchovy dorsal fin. Because microfeatures make explicit deformation parameters of deformable shape templates, the components of deformation relating two shapes sharing some of the same microfeaturesmay be read off the microfeatures directly. Going in the converse direction, one may imagine that a computer graphics application could directly control the deformation of individual shape
Eric Saund
148
Anchovies
1
c Puffers
/
Cavefishes
(4 Figure 4: (c)Arrows depict the deformation required to transform one dorsal fin shape into another. Magnitude and direction componentswere read directly off shape microfeatures. microfeatures, which would then push and shove on other microfeatures and more primitive fragments of shape to modify an object's geometry under user control. An energy-minimization approach to this facility is presented in Saund (1988a, 1988b). 5 Conclusion
We have presented a "hand-built" representation for a particular class of shapes illustrating that distributed representations can be employed to advantage in symbolic computation as well as in wired connectionist networks. By designing shape microfeatures according to the structure and constraints of a target shape domain, the descriptive vocabulary provides explicit access to a large number of important geometric properties useful in classifying, comparing, and distinguishing shapes. This approach amounts to taking quite seriously Marr's principle of explicit naming (Marr 1976): If a collection of data is treated as a whole, give it a (symbolic) name so that the collection may be manipulated as a unit. This is the function fulfilled by shape microfeatures that explicitly label certain configurations of edge, corner, and region shape fragments, and this is also the function fulfilled by "hidden units" in connectionist
Distributed Representation of Visual Shape
149
networks, which label certain combinations of activity in other units. The lesson our shape vocabulary adopts from the connectionist tradition is that the chunks of information for which it is useful to create explicit names can be of small grain size, and, they may be quite comprehensible even if they do not correspond with the conceptual units of the casual human observer. Appendix A Overview of the Computational Procedures This appendix sketches the computational procedures by which a description of a shape in terms of abstract level shape microfeatures may be obtained from a binary image. In our implementation, the description of a shape exists fundamentally as a set of tokens or markers in a blackboard data structure called the Scale-Space Blackboard. This data structure provides for efficient indexing of tokens on the basis of spatial location and size; this is achieved through a stack of two-dimensional arrays, each of whose elements contains a list of tokens present within a localized region of the image, and within a range of scales or sizes. A scale-normalized measure of distance is provided to facilitate access of tokens occurring within spatial neighborhoods, where the neighborhood size is proportionate to scale. A shape token denotes the presence in the image of an image event conveyed by the token's type. Thus, a token is a packet of information carrying the following information: type, 5 and y location, orientation, scale (size), plus additional information depending on the token's type. Initially, a shape is described in terms of tokens of type, PRIMITIVEEDGE, at some finest scale of resolution. These may be computed from a binary silhouette by performing edge detection, edge linking, contour tracing, and then by placing tokens at intervals along the contour. Next, PRIMITIVE-EDGE tokens are asserted at successively larger scales through a fine-to-coarse aggregation or grouping procedure. Roughly, whenever a collection of PRIMITIVE-EDGES is found to align with one another at one scale, a new PRIMITIVE-EDGE may be asserted at the next larger scale. The combinatorics of testing candidate collections of tokens is managed by virtue of the spatial indexing property of the Scale-Space Blackboard: each PRIMITIVE-EDGE at one scale serves as a "seed" for a larger scale PRIMITIVE-EDGE, and only PRIMITIVE-EDGES located within a local (scalenormalized distance) neighborhood of the seed are tested for the requisite alignment. The multiscale PRIMITIVE-EDGE description of a shape serves as a foundation for additional token grouping operations leading to the assertion of EXTENDED-EDGES, FULL-CORNERS, and PARTIAL-CIRCULAR-REGIONS as pictured in Figure 1. The grouping rules are in each case based
150
Eric Saund
on identifying clusters or collections of shape tokens occurring with a prescribed spatial configuration. For example, an EXTENDED-EDGE may be asserted whenever a collection of PRIMITIVE-EDGES is found to lie, within certain limits, along a circular arc; in addition to location, orientation, and scale, each EXTENDED-EDGE maintains an internal parameter denoting the curvature of the arc. The grouping rules include means for ensuring that a given image event (e.g., edge or corner) will be labeled by only one token of the appropriate type in the spatial, orientation, and scale neighborhood of the event. Instances of abstract level shape microfeatures are asserted through essentially the same kind of token grouping procedure. Each microfeature specifies an acceptable window on the spatial configuration (scalenormalized distance, relative orientation, direction, and relative scale) FULL-CORNER, and/or PARTIALof a pair or triple of EXTENDED-EDGE, CIRCULAR-REGION type tokens. For example, the microfeature pictured in Figure 2 demands the presence on the Scale-Space Blackboard of a FULLCORNER token and an EXTENDED-EDGE token in roughly the proximity shown. The deformation parameter of each microfeature is typically a simple expression of one aspect of the spatial configuration and/or internal parameters of its constituents, such as relative orientation, scalenormalized distance, scale-normalized curvature (of an EXTENDED-EDGE constituent), vertex angle (of a FULL-CORNER constituent), etc., or ratios of these measures. Again, the combinatorics of testing pairs or triples of shape tokens to see whether they support a given microfeature is limited by the spatial indexing property of the Scale-Space Blackboard. Instances of the microfeature pictured in Figure 2 are thus found by testing in turn each FULL-CORNER token on the Scale-Space Blackboard; for each FULL-CORNER tested, all of the EXTENDED-EDGE tokens within a local scale-normalized distance neighborhood are gathered up and tested for appropriate proximity. This computation scales linearly with the number of FULL-CORNER tokens present. When the image contains one or more protuberant objects such as dorsal fin shapes, the microfeatures pertaining to each may be isolated by identifying collections of microfeatures clustering appropriately in spatial location and overlapping in their support. For example, for the LECPE-BACK-EDGE-CURVATURE microfeature and the LEADING-EDGECURVATURE microfeature to be interpreted as pertaining to the same dorsal fin shape (see Figure 3), the FULL-CORNER constituent of the former must align with the EXTENDED-EDGE constituent of the latter, and vice versa. Typically, microfeature instances may be found in isolation at scattered locations in an image, but, because they are tailored to the morphological properties of dorsal fins, only at these shapes will microfeatures cluster and overlap one another extensively. Once the microfeatures pertaining to a given dorsal fin have been isolated, the shape of this object is interpreted in terms of the microfeatures’ deformation parameters as described in the text.
Distributed Representation of Visual Shape
151
Acknowledgments This paper describes work done at the MIT Artificial Intelligence Laboratory. Support for the Laboratory's research is provided in part by DARPA under ONR contract N00014-85-K-0124. The author was supported by a fellowship from the NASA Graduate Student Researchers Program.
References Audubon Society Field Guide to North American Fishes, Whales, and Dolphins. 1983. Knopf, New York. Hinton, G. 1989. Connectionist learning procedures. Artif. Intell., 401-3, 185234. Marr, D. 1976. Early processing of visual information. Phil. Trans. R. SOC.London B 275,483-519. Rosenberg, C. 1987. Revealing the structure of NETtalk's internal representations. Proc. 9th Ann. Conf. Cognitive Sci. SOC.,Seattle, WA, 537-554. Rumelhart, D., Hinton, G., and Williams, R. 1986. Parallel Distributed Processing: Explorations in the Structure of Cognition. Bradford Books, Cambridge, MA. Saund, E. 1988a. Configurationsof shape primitives specified by dimensionalityreduction through energy minimization. Proc. 1987 AAAl Spring Symp. Ser., Palo Alto, CA, 100-104. Saund, E. 1988b. The role of knowledge in visual shape representation. MIT A1 Lab TR 1092. Saund, E. 1989. Adding scale to the primal sketch. Proc I E E E CVPR, San Diego, CA, 70-78. Sejnowski, T., and Rosenberg, C. 1987. Parallel networks that learn to pronounce English text. Complex Syst. 1, 145-168.
Received 26 April 1989; accepted 15 March 1990.
Communicated by Richard Andersen
Modeling Orient ation Discriminat ion at Multiple Reference Orientations with a Neural Network M. Devos G. A. Orban Laboratorium voor Neuro- en Psychofysiologie, Katholieke Universiteit Leuven, Campus Gasthuisberg, Herestraat, B-3000Leuven, Belgium
We trained a multilayer perceptron with backpropagation to perform stimulus orientation discrimination at multiple references using biologically plausible values as input and output. Hidden units are necessary for good performance only when the network must operate at multiple reference orientations. The orientation tuning curves of the hidden units change with reference. Our results suggest that at least for simple parameter discriminations such as orientation discrimination, one of the main functions of further processing in the visual system beyond striate cortex is to combine signals representing stimulus and reference. 1 Introduction Ever since the first microelectrode studies of the retina, it has been clear that our understanding of peripheral parts of the visual system (retina, geniculate, and primary cortex) progresses more rapidly than that of visual cortical areas further removed from the photoreceptors. Contributing to this lack of understanding is the fact that it is more difficult to find relevant stimuli and experimental paradigms for studying the latter areas than the former ones. Recent developments in neural networks (Lehky and Sejnowski 1988; Zipser and Andersen 1988) have suggested that such systems could assist the visual physiologist by suggesting which computations might be performed by higher order areas. There has been a recent surge of interest in the relationship between single cell response properties and behavioral thresholds (Bradley et al. 1987; Vogels and Orban 1989). The question at hand is how much additional processing is required to account for the behavioral thresholds beyond the primary cortical areas in which cells were recorded. Most investigators have concluded that processing beyond striate cortex (Vl) should achieve invariance of the signals arising from V1. At that level single cell responses do indeed depend on many parameters including the object of discrimination, although discrimination itself is invariant for random changes in irrelevant parameters in humans (Burbeck and Regan Neural Computation 2, 152-161 (1990) @ 1990 Massachusetts Institute of Technology
Modeling Orientation Discrimination
153
1983; Paradiso et al. 1989) as well as in animal models (De Weerd, Vandenbussche, and Orban unpublished; Vogels and Orban unpublished). We have investigated this question by using a feedforward, three-layer perceptron to link single cell orientation tuning and just noticeable differences orientation both measured in the behaving monkey. We have used as inputs units with orientation tuning similar to that of striate neurons recorded in the awake monkey performing an orientation discrimination task (Vogels and Orban 1989). In this task, a temporal same-different procedure was used. Two gratings were presented in succession at the same parafoveal position while the monkey fixated a fixation target. If the two gratings had the same orientation the monkey had to maintain fixation. If they differed in orientation, the monkey had to saccade to the grating. The network was trained by backpropagation to achieve the same discrimination performance as the monkey. The network decided whether the stimulus presented to the input units was tilted clockwise or anticlockwise from the reference orientation. This discrimination procedure, an identification procedure, was more easy to implement in a three-layer perceptron than the temporal same-different procedure used in the animal experiments. Human psychophysical studies have shown that different psychophysical procedures used to determine discrimination thresholds yield very similar thresholds (Vogels and Orban 1986). Initially (Orban et al. 1989), we trained and tested the network only at one reference orientation. This study revealed little about further cortical processing, since no hidden units were necessary to achieve optimal performance. In the present study we show that when the network is trained and tested at multiple reference orientations, hidden units are required, and we have studied the properties of these hidden units to make predictions for neurophysiological experiments. 2 TheModel
The number of input and hidden units of the multilayer perceptron was variable, but only two output units were used, corresponding to the two perceptual decisions, tilted clockwise or counterclockwise from reference. Input units either represented the stimulus (stimulus units) or the reference orientation (reference units). Stimulus units (Fig. 1A) had Gaussian orientation tunings modeling the tunings recorded in the discriminating monkey (Vogels and Orban 1989) and their variability. Typically 10-40 units were equally spaced over the stimulus orientation range (180"). There were as many reference units as reference orientations and their tuning curve was an impulse function, that is, their value was maximum when a given reference was trained or tested but otherwise minimum. Hidden and output units were biased, that is, a constant was added to each unit's inputs before computing the unit's output. Determining this constant value was part of the learning. Both during training and testing
M. Devos and G. A. Orban
154
t+
15
/
I I
-A-+/ A +/ I
I
I
I
I
1
2
3
4
5
# of reference orientations
Figure 1: Network without hidden units: orientation tuning curves plotting each unit's activity as a function of stimulus Orientation of input units (A) and output units (B-D) and resulting psychometric curves plotting percentage correct decisions as a function of stimulus orientation (E and F). The network was trained and tested for one reference orientation (B and E), three references (C and F), and five references (D). In A-D the full line indicates the average tuning and the stippled lines one standard deviation around the mean. The 10 input units had Gaussian tuning with a SD of 19", and a response strength (activity at optimal orientation) of 110 spikes/sec.
Modeling Orientation Discrimination
155
the orientations presented to the network were more closely spaced near the reference(s) than those further away. The network was trained with a variant of the backpropagation (Devos and Orban 1988). Training was terminated when the orientation performance curve, plotting percentage correct discrimination as a function of orientation, remained stable for five successive training tests. From these curves just noticeable differences (jnds) were estimated by taking the average orientation difference corresponding to 75% correct for both sides of the reference orientation. The jnds given for each set of network parameters, such as, for example, the number of stimulus or hidden units are averages of 50 tests, since for each set of parameters the training was repeated five times and each configuration obtained after training was tested 10 times. 3 Hidden Units Are Necessary to Discriminate at Multiple References
As previously mentioned, no hidden units are necessary for discrimination at a single reference (Orban et al. 1989). In this case, the output units exhibit a sharp change in activity at the reference orientation (Fig. 1B) and the “psychometric curve” has a single narrow symmetrical dip at the reference orientation (Fig. 1E). Increasing the number of references to three induces multiple slopes in the orientation tuning of the output unit (Fig. 1 0 . Since these slopes are less steep than for a single reference, the dips in the psychometric curve are somewhat wider, yielding larger jnds (Fig. 2), but the psychometric curves have additional dips that do not correspond to references (Fig. 1F). Hence the jnds represent the network performance only for orientations close to one of the three references. When the number of references is increased to five, the tuning curve of the output units is nearly flat and the network can no longer learn the problem (Fig. 1D). The network of Figure 1 had no hidden units and only 10 stimulus units. Increasing the number of stimulus units to 40 improves the performance for two, three, and especially four references (Fig. 2), but the learning still fails at five references. Introducing four or eight hidden units immediately solves the problem (Fig. 2). As shown in Figure 3, the output units again display a steep change in activity at each reference orientation, but this can be achieved only by a change in orientation tuning of the output units with change in reference (Fig. 3). 4 Orientation Tuning of Hidden Units Changes with Reference
~
As shown in Figure 3, not only the tuning curve of output units, but also that of hidden units changes with reference orientation. Analysis of the different networks obtained for a wide range of conditions (10 to 40 input units, 2 to 8 hidden units, 2 to 5 references) reveals that the strategy of the network is always the same. Output units need to have pulse-like tuning
M. Devos and G. A. Orban
156
”
INPUT UNIT
OUTPUT UNIT: 1 reference
.-ch>
-0
> %
I
+a
I
2
-5
0.5. .
*,(’.> L
L
0
=C J
5 L
z
3
Figure 3: Orientation tuning curves of output (A and B) and hidden units (CH) of a network performing at two references: 45" (A,C,E,G) and 90" (B,D,F,H). Same conventions as in Figure 1. curves with 90" periodicity to yield a psychometric curve with a single narrow dip at the reference. These 90" wide pulse curves are synthetized from the curves of the hidden units, which also have steep parts in their tuning curve, but which are not necessarily separated by 90". With a change in reference the curves of the hidden units also change so that
M. Devos and G. A. Orban
158
B
A OUTPUT UNIT
1
-90
0
SAMPLE HIDDEN UNIT
1
I
90
-90
I
0
90
stimulus orientation (degrees)
Figure 4: Orientation tuning curves of an output unit (A) and a hidden unit (B) in a network with four hidden units in which the reference units are directly connected to the output units. Same conventions as in Figure 1.
different "building blocks" become available to synthetize the 90" wide pulse curves of the output units. These changes in the hidden units' tuning curve with changes in reference are obtained by giving very high weights to the connections between reference units and hidden units. Hence, turning a reference unit on or off will shift the "working slopes" of the hidden units. This scheme suggests that the number of hidden units required will increase as the number of references increases. In our study, which used a maximum of five references, a network with eight hidden units did not perform better than one with four hidden units (Fig. 2). One could argue that the change in orientation tuning of the hidden units is not an essential feature of the network, but a trivial consequence of the connection between the reference units and the hidden units. Therefore we devised a network in which reference units were directly connected to the output units and trained this network in orientation discrimination at five references. Although the result was slightly better than with no hidden units, the network performed poorly. The output units' tuning curves displayed multiple weak slopes (Fig. 41, which yielded psychometric curves with multiple dips as in the case of a network without hidden units and fewer references (e.g., 3, Fig. 1). Of course, this problem could be treated by adding a second layer of hidden units connected with the reference units. Hence, a connection between the last hidden unit layer and the reference units is essential for good network performance.
Modeling Orientation Discrimination
159
5 Conclusions Others, such as Crick (19891, have pointed out that backpropagation networks are not adequate models of brain function. They are used here merely as proof of existence in a way analogous to Lehky and Sejnowski (1988) and Zipser and Andersen (1988). It could however be argued that contrary to those studies, the properties of the hidden units we obtained are physiologically unrealistic. Orientation tuning curves are generally found to be bell shaped (Orban 1989 for review) and their selectivity is generally summarized by their bandwidth. The few studies devoted to orientation tuning of higher order cortical areas have compared bandwidths (see Orban 1989 for review). Tuning curves with multiple peaks have nonetheless been reported for a few V1 units (De Valois et al. 1982; Vogels and Orban 1989) and for a number (10/147) of V3 units (Felleman and Van Essen 1987). The hidden unit of Figure 4 had a tuning curve with two peaks separated by somewhat less than 90" and which can be described as multipeak. It may be that the multipeak tuning curves reported in the physiological literature in fact are extremes of tuning curves with steep slopes that allow sharp discriminations. Our results also suggest that the brain combines signals from the stimulus and from the reference orientations. Recent physiological studies (Haenny et al. 1988) suggest that some V4 units may encode the reference orientation in a matching to sample task. On the other hand, lesions of IT impair orientation discrimination (Dean 1978). Hence, it seems that for both simple discriminations and for more difficult object recognition, the pathway from V1 to IT is required. This pathway is relayed by V4 where orientation-selective units have been described (Desimone and Schein 1987). Hence a convergence of stimulus and reference signals is possible either in V4 or in IT. The latter structure could then correspond to the last step in visual processing before the decision (i.e., last layer of hidden units); indeed IT has extensive connections with limbic and frontal cortex areas (for review see Rolls 1985). The units selective for reference orientation in the sample to match tasks used by Haenny et al. (1988) were relatively broadly tuned, suggesting a distributed coding as is the case for stimulus orientation. We have used a local value coding for the reference and it remains to be seen whether or not a different encoding scheme for the reference will yield a different interaction between reference and stimulus at the level of the hidden units. However, the present simulations suggest that the contribution of higher order cortical areas in orientation discrimination should be investigated at multiple reference orientations, and that the neurons should have steep slopes in their tunings, possibly changing with reference. The strongest prediction derived from these simulations is that cortical neurons should change their orientation tuning curves for the test stimulus based on a previously displayed reference stimulus. This change could either be a shift in the slopes of the tuning curve or a
160
M. Devos and G. A. Orban
switch from tuning to nontuning depending on the reference orientation. We are presently testing these predictions in IT of monkeys performing an orientation discrimination task.
Acknowledgments The technical help of P. Kayenbergh, G. Vanparrijs, and G. Meulemans as well as the typewriting of Y.Celis is kindly acknowledged. This work was supported by Grant RFO/AI/Ol from the Belgian Ministry of Science to GAO.
References Bradley, A., Skottun, B. C., Ohzawa, I., Sclar, G., and Freeman, R. 1987. Visual orientation and spatial frequency discrimination: A comparison of single neurons and behavior. J. Neurophysiol. 57,755-772. Burbeck, C. A., and Regan, D. 1983. Independence of orientation and size in spatial discriminations. J. Opt. SOC.Am. 73, 1691-1694. Crick, F. 1989. The recent excitement about neural networks. Nature (London) 337,129-132. Dean, P. 1978. Visual cortex ablation and thresholds for successively presented stimuli in rhesus monkeys: I. Orientation. Exp. Brain Res. 32, 445458. Desimone, R., and Schein, S. J. 1987. Visual properties of neurons in area V4 of the macaque: Sensitivity to stimulus form. J. Neurophysiol. 57, 835-868. De Valois, R. L., Yund, E. W., and Hepler, N. 1982. The orientation and direction selectivity of cells in macaque visual cortex. Vision Res. 22, 531-544. Devos, M. R., and Orban, G. A. 1988. Self-adapting back-propagation. Proc. Neuro-Nhes 104-112. Felleman, D. J., and Van Essen, D. C. 1987. Receptive field properties of neurons in area V3 of macaque monkey extrastriate cortex. f. Neurophysiol. 57, 889920. Haenny, P. E., Maunsell, J. H. R., and Schiller, P. H. 1988. State dependent activity in monkey visual cortex: II. Retinal and extraretinal factors in V4. Exp. Brain Res. 69,245-259. Lehky, S. R., and Sejnowski, T. J. 1988. Network model of shape-form shading: Neural function arises from both receptive and projective fields. Nature (London) 333,452454. Orban, G. A. In press. Quantitative electrophysiology of visual cortical neurons. In Vision and Visual Dysfunction, The EIecfrophysioIogyof Vision, Vol. 5, A. Leventhal, ed., Macmillan, New York. Orban, G. A., Devos, M. R., and Vogels, R. In press. Cheapmonkey: Comparing ANN and the primate brain on a simple perceptual task Orientation discrimination. Proc. NATO ARW. Paradiso, M. A., Carney, T., and Freeman, R. D. 1989. Cortical processing of hyperacuity tasks. Vision Res. 29, 247-254.
Modeling Orientation Discrimination
161
Rolls, E. T. 1985. Connections, functions and dysfunctions of limbic structures, the pre-frontal cortex and hypothalamus. In The Scientijic Basis of Clinical Neurology, M. Swash and C. Kennard, eds., pp. 201-213. Churchill Livingstone, London. Vogels, R., and Orban, G. A. 1986. Decision factors affecting l i e orientation judgments in the method of single stimuli. Percept. Psychophys. 40, 74-84. Vogels, R., and Orban, G. A. 1989. Orientation discrimination thresholds of single striate cells in the discriminating monkey. SOC. Neurosc. Abstr. 15, 324. Zipser, D., and Andersen, R. A. 1988. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature (London) 331, 679-684.
Received 25 September 1989; accepted 23 January 1990.
Communicated by Ralph Linsker
Temporal Differentiation and Violation of Time-Reversal Invariance in Neurocomputation of Visual Information D. S. Tang Microelectronics and Computer Technology Corporation, 3500 West Balcones Center Drive, Austin, TX 78759-6509 USA
V. Menon Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712 USA
Information-theoretic techniques have been employed to study the time-dependent connection strength of a three-layer feedforward neural network. The analysis shows (1) there is a natural emergence of time-dependent receptive field that performs temporal differentiation and (2) the result is shown to be a consequence of a mechanism based on violation of the time-reversal invariance in the visual information processing system. Both analytic and numerical studies are presented. 1 Introduction
A synthetic three-layer feedforward neural network is proposed and shown to have the ability to detect time-dependent changes in the input signal. Using a few physical assumptions about the properties of the transmission channels, we deduce purely within the informationtheoretical framework (Shannon and Weaver 1949) that temporal differentiation is based on the violation of time-reversal invariance of the information rate. These results may be relevant to the study of early visual information processing in the retina and the construction of artificial neural network for motion detection. An earlier information-theoretic study (Linsker 1989) with a different model has indicated that the cell's output is approximately a linear combination of a smoothed input and a smoothed first time derivative of the input. Here, we show analytically how the temporal differentiation emerges. Section 2 discusses the informationtheoretic formalism of time-dependent neural nets in general terms. A simple three-layer feedforward network is introduced. Analytic solutions to the time-dependent transfer function are derived in Section 3, which are followed by a study of the properties with further numerical calculations in Section 4. Section 5 describes the input-output relations between the layers. Section 6 summarizes the main results. Neural Computation 2,162-172 (1990) @ 1990 Massachusetts Institute of Technology
Neurocomputation of Visual Information
163
2 Time-Dependent Information-Theoretic Formalism
Consider a three-layer feedforward network (Linsker 1986). Each layer of neurons is assumed to be two-dimensional in the X-Y plane spatially. Light signals with variable intensity are assumed to be incident on the first layer, layer A, in the Z-direction. Activities of the layer A neurons are relayed to the second layer, layer B. The output activities in layer B are then sent to layer C neurons. Below we specify the assumptions of our model. Each output neuron at location T is locally connected to its input neurons in the previous layer. The input neurons are spatially distributed according to the gaussian distribution density p(R) = C, exp(-R2/2a2) with C, = N / 2 r a 2 . N is the total number of input neurons, a' is the variance, and R is measured relative to T . All spatial vectors are two-dimensional in the X-Y plane. The output signal, Y,(t),of the ith layer B neuron is assumed to be linear,
Each connection has a constant weight COmodulated by a microscopic time delay factor exp[-b(t - 7-11 as in an RC circuit. This time delay models a physical transmission property of the channel, which possesses a simple form of memory, i.e., past signals persist in the channel for a time interval of the order l / b . Here, b is the reciprocal of the decay time constant. The temporal summation is from a finite past time t - 6 to the present time t. From the causality principle, no future input signals X j ( T ) ,T > t, contribute to the present output signal X ( t ) . It is assumed that the time scale 6 >> l / b is satisfied. The index (2) in the spatial summation means that the N input neurons are randomly generated according to the density p relative to the location ~i of the ith output neuron in accordance with assumption 1. The stochastic input signals X j ( 7 )are assumed to satisfy the a priori probability distribution function
(2.2)
which is a gaussian and statistically independent, B$ = u&j6,,. Here, fi is the total number of distinct space-time labels of X~(T). X is the mean of X j b ) .
D. S. Tang and V. Menon
164
3. The output signal, Z,(t), of the mth neuron in layer C is assumed to be
where the transfer function %,(TI satisfies the constraints (1)‘HTX = A0 in matrix notation, and (2) &=,, ,N,r=t-6, ,t XF1,(~)l2 = AI. A0 and A1 are real constants. The first constraint is on the normalization of the transfer function. The second constraint restricts the value of the statistical mean of the transfer functions up to a sign. The resultant restriction on the transfer functions when these two constraints are considered together is a net constraint on the variance of the transfer functions since the variance is directly proportional to A0 - AI. In the far-past, T < t - 6, and the future, T > t, ‘H,(T) is set to zero. These transfer functions will be determined by maximizing the information rate in the next section. Here, an additive noise n , ( ~added ) to K ( T ) [i.e., Y,(T)4 Y , ( T ) + ~ , ( Tin ) ] equation (2.3) is assumed for the information-theoretic study. The noises n,(r)indexed by the location label i and the time label T are assumed to be statistically independent and satisfy a gaussian distribution of variance p2 with zero mean. Below we derive the information-theoreticequation that characterizes the behavior of the transfer function ‘H,(T). The information rate for the signal transferred from layer B to layer C can be shown to be (2.4)
with W being the spatiotemporal correlation matrix ( (Y,(T)Y,(T’))). From equations (2.1) and (2.21, the following expressions for the correlation matrix can be derived.
with
and Grr, = e - b l T - r ‘ l
e(s - lT
-T1~)
(2.7)
Here, B denotes the Heaviside step function. To arrive at equation (2.7), terms of order exp(-b6) have been neglected as the time scale assumption Sb >> 1 is evoked. h is a constant independent of space and time. Its explicit form is irrelevant in subsequent discussions, as will be shown below.
Neurocomputation of Visual Information
165
Now, the method of Lagrange multiplier is employed to optimize the information rate. This is equivalent to performing the following variational calculation with respect to the transfer function
[ [p(.)I2
{
(5 7-Pw7-k2 l+
-
A,]}
=0
(2.8)
Here, k2 is a Lagrange multiplier. It is a measure of the rate of change of the signal with respect to the variation of the value of the overall transfer function. The variational equation above produces the folIowing eigenvalue problem governing the behavior of the transfer function
jT‘
This equation defines the morphology of the receptive field of the neurons in layer C in space and time. For simplicity, one can absorb the multiplicative factor h in X and k2. We assume that this is done by setting h = 1. 3 Analytic Expressions of the Transfer Function
In continuous space and time variables, the eigenvalue problem is reduced with the use of the time-translational invariance of the temporal correlation function (equation 2.7) to the homogeneous Fredholm integral equation X7-l(r,T ) =
Irn dR -x j
/b’2 -612
MR, 7’)
d r ’ p ( R ) [Q(r- R)G(T- 7 ’ )+ kz] (3.1)
Here, Q(r- R ) and G(T- T ’ ) are the continuous version of Qij and GTTlof equations (2.6) and (2.7), respectively. Note that the kernel respects both space- and time-reversal invariances. It means that the kernel itself does not have a preferential direction in time and space. Note also that the information rate also respects the same symmetry operation. However, if it turns out that the solutions N ( r , T ) do not respect some or all of the symmetries of the information rate, a symmetry-breaking phenomenon is then said to appear. This can have unexpected effects on the input-output relationship in equation (2.3). In particular, the violation of time-reversal invariance underlies temporal differentiation, the ability to detect the temporal changes of the input signal. This will be discussed in the next section. The solutions to this integral equation can be found by both the Green’s function method and the eigenfunction expansion method (Morse
D. S. Tang and V. Menon
166
and Feshbach 1953). They can be classified into different classes according to their time-reversal and space-reversal characteristics, X ( r ,T ) = X(-r ,- T ) , ~ ( Tr ) ,= - X ( - r , T I , X ( r , T ) = Z ( r , -TI, ‘MT,7)= M - r , -T), . . ., etc. In this study, we are interested only in the solutions that correspond to the two largest eigenvalues (i.e., the larger one of these two has the maximum information rate). They are as follows.
3.1 Symmetric Solution. This solution obeys both space- and timereversal invariance.
+
[-Q-
-612 5 7 I s/2 (3.2)
X ( ~ , T )= 0, otherwise with A = 612, D = 1.244, 77 = X/1.072C,,m2, and
(3.3) with y being determined by (3.4) In equation (3.2), /Q is the normalization constant. The eigenvalue is determined from q by 1 - 2a27rCp -{(1-
x
k2
11 + (Q2/02,)1
+
) (d
1 [exp(-bb) - 11 11+ (a2/0i)1 11 + (cu2/u&>1 b 2(b2 + r2)sinya - 61) (3.5) by(b cos yA - y sin Ay>
1
+
Note that all the space- and time-dependent terms in equation (3.2) are even functions of both space and time. In obtaining these equations, we have used in the eigenfunction expansion the fact that the iteration equation 2a2/0:+, = 1- 1/(3+ 2a2/03 with CT; = 30’ converges rapidly to 2a2/0k so that the approximation 0: = 02’ = . .. = 0,’= . . . = 02 = 2.7320~’ is accurate with error less than one percent. Details may be found in Tang (1989).
Neurocomputation of Visual Information
167
3.2 Time-Antisymmetric-Space-Symmetric Solution. function is Z ( T , T ) = hle-(T’ 2 205) sin y r , -612 5 r 5 612
Z ( r , r ) = 0,
The transfer
(3.6)
otherwise
The eigenvalue is A=
with
AT
AT
(3.7)
0.268(3 + 2a2/aL) being determined by 2b
xT --2 b2+y in which y satisfies the following self-consistent equation
y = -b tan Ay
(3.8)
hl is the normalization constant. 4 Properties of the Solution to the Eigenvalue Problem The behavior of the eigenvalue of the symmetric solution (equation 3.5) is shown as the curved line in Figure 1. In contrast, the eigenvalues of the antisymmetric solution (equation 3.7) are independent of the parameter k2 since the integral Jf$2 sin y r d r is identically zero. Therefore, the eigenvalue for the antisymmetric case remains constant as k2 varies, as depicted by the horizontal line in Figure 1. The statement that the eigenvalue does not depend on k2 is true for any eigenfunctions that are antisymmetric with respect to either time- or space-reversal operation. The interesting result is that for positive and not too negative k2 Values, the eigenvalue of the symmetric solution is the largest. That is, the transfer function producing the maximum information rate respects the symmetry (time- and space-reversal invariances) of the information rate. However, as k2 becomes sufficiently negative, the time-antisymmetric solution has the largest eigenvalue since the maximum symmetric eigenvalue decreases below that of the time-antisymmetric solution. In this regime, the symmetry breaking of the time-reversal invariance occurs. Within the range of arbitrarily large positive value of k2 to the point of the transition of the symmetry breaking of the time-reversal invariance, parity conservation is respected. The transfer functions at the center of the receptive field plotted against time for different values of ICZ are shown in Figure 2. Transfer functions located farther from th center show similar behavior except for a decreasing magnitude, as a re ult of the spatial gaussian decay. It is found that no spatial center-surround morphology corresponding to maximum information rate appears for any given time in the k2 regime we are considering.
cq
D. S. Tang and V. Menon
168
12
11.5
11
5 w
1
TIME-sYMhErRIcSOLUTIOI
10.5
?
5
2 10
/
9.5
TIMEANTIS-C
8.
-
-3.02
-0.015
-0.01
-0.005
SOLUTION
I
I
I
0.005
0.01
0.015
Figure 1: The maximum eigenvalues for the symmetric and time-antisymmetric solutions plotted against the Lagrange multiplier kz. Here, the time constant l / b is arbitrarily chosen as 5 msec.
169
Neurocomputation of Visual Information
4
I
I
I
I
I
I
I I I I I I
3
I
I I I
I
I I I
2
I
I
I
I I I I I I I
1
c
-1
-1
_.
-4
I
I
I
-40
-20
I I
0
I
I
20
40
TIME(ms)
Figure 2: Time-dependent properties of the temporal transfer function. The present time is arbitrarily chosen to be at 50 msec. The time in the far-past is at -50 msec. Curves a, b, and c depict the symmetric transfer functions with eigenvalues 10.2, 9.5, and 9, respectively. Curve d is the time-antisymmetric transfer function with eigenvalue equal to 9.24. The time constant l / b is 5 msec.
D. S. Tang and V. Menon
170
5 Input-Output Relations
From the results of last section, the output events can naturally be classified into time-symmetric or time-antisymmetric events. The timesymmetric/antisymmetric events are defined as the set of output activities obtained by the linear input-output relationship having a timereversal invariant/noninvariant transfer function. They are statistically independent from each other, from equations (2.3) and (2.9)
{ (Z;zmmetric(t)
(a))= 0
ZEtisymmetric
(5.1)
The symmetric transfer function (equation 3.2) does not discriminate the inputs in the near-past [O, 6/21 from those in the far-past [-6/2,01. The output event duplicates as much of the input signals as possibly allowed in this three-layer feedforward network operating in an optimal manner within the information-theoretic framework. This is illustrated by curves a, b, and c in Figure 3. In obtaining these figures, the light signal impinging on layer A is assumed to be a stationary light spot modeled by a step function e(t),that is, it is off for time t 5 0 and on for time t 2 0. The curve labeled layer B output is a typical output signal of the response of an RC circuit with a step function input (equation 2.1). These results suggest that the input-output relation with the symmetric transfer function (equation 3.2) in this IEZ regime operates in the information-relaying mode. It acts as a passive relay. The speed of the response is determined by the width 6 of the time window - the shorter the width the faster the speed. The behavior of the input-output relation with the timeantisymmetric transfer function (equation 3.6) is totally different. It is a simple form of temporal differentiation as illustrated by curve d in Figure 3. Temporally constant input signal produces zero output. Note that the peak of the output is time delayed by 0.56 compared with the time defining the fastest change in the input function. This temporal differentiation is not identical to the time derivative in calculus, even though both can detect the temporal changes in the input signal and both are time-antisymmetric operations. These transfer functions process the input signals to form the output differently. In the regime of time-reversal invariance, the transfer functions (equation 3.2) simply relay the input to the output without actively processing the inputs, curves a, b, and c in Figure 3. However, in the regime of time-reversal noninvariance, the transfer functions (equation 3.6) have the capability to extract the temporal changes of the temporal input signals, curve d in Figure 3. 6 Conclusions
We have analyzed, from information-theoretic considerations, how a simple synthetic three-layer, feedforward neural network acquires the ability
Neurocomputation of Visual Information
171
16(
14L
12(
lo(
8[
6C
4c
2c
C
LAYER C OUTPUT
I I I I
I
-20
0
I
I
I
50
100
150
0
TIME(ms)
Figure 3: The inputboutput relations. The input to layer B is a stationary and local light spot defined by O ( t ) . Curves a, b, and c are layer C outputs that correspond to the symmetric transfer functions a, b, and c in Figure 2, respectively. Curve d is the layer C output with time-antisymmetric transfer function corresponding to the d curve in Figure 2. Note that only curve d signals the temporal change in the incoming signal.
172
D. S . Tang and V. Menon
to detect temporal changes. Symmetry breaking in time-reversal invariance has been identified as the source of such an ability in our model. Furthermore, the symmetry-classesto which the transfer function belongs define the distinct categories of the temporal events in the output sample space. We summarize below the main results of this paper. 1. The persistence of signals in the network is an important aspect of temporal signal processing. In the present study we have modeled this as a channel with memory (equation 2.1) and it underlies the results obtained. 2. Eigenvectors of the constrained spatial-temporal correlation function are the transfer functions of the three-layer, feedforward neural network model studied here.
3. The symmetries that the eigenvectors respect and define the classes of the output events. 4. There are two modes of the temporal information processing: one is the information-relaying mode defined by the time-symmetric transfer function (equation 3.2) and the other is the informationanalyzing mode defined by the time-antisymmetric transfer function (equation 3.6).
5. Realization of the information-analyzing mode is done by a symmetry-breaking mechanism. 6. The breaking of the time-reversal invariance leads to temporal differentiation.
References Linsker, R. 1986. From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83, 7508. Linsker, R. 1988. Self-organizationin a perpetual network. Computer 21(3), 105. Linsker, R. 1989. In Advances in Neural Information Processing System I, D. S. Touretzky, ed.,p. 186. Morgan Kaufman, San Mateo, CA. Morse, P. M., and Feshbach, H. 1953. Methods of Theoretical Physics, Vol. 1. McGraw-Hill, New York. Shannon, C. E., and Weaver, W. 1949. The Mathematical Theory of Communication. Univ. of Illinois Press, Urbana. Tang, D. S. 1989. Information-theoretic solutions to early visual information processing: Analytic results. Phys. Rev. A 40, 6626. Received 16 October 1989; accepted 23 February 1990
Communicated by Ralph Linsker
Analysis of Linsker’s Simulations of Hebbian Rules David J. C. MacKay Computation and Neural Systems, Galtech 164-30 CNS, Pasadena, CA 91125 USA
Kenneth D. Miller Department of Physiology, University of California, San Francisco, CA 94143-0444 USA
Linsker has reported the development of center-surround receptive fields and oriented receptive fields in simuiations of a Hebb-type equation in a linear network. The dynamics of the learning rule are analyzed in terms of the eigenvectors of the covariance matrix of cell activities. Analytic and computational results for Linsker’s covariance matrices, and some general theorems, lead to an explanation of the emergence of center-surround and certain oriented structures. We estimate criteria for the parameter regime in which center-surround structures emerge.
Linsker (1986, 1988) has studied by simulation the evolution of weight vectors under a Hebb-type teacherless learning rule in a feedfonvard linear network. The equation for the evolution of the weight vector w of a single neuron, derived by ensemble averaging the Hebbian rule over the statistics of the input patterns, is’
‘Our definition of equation 1.1 differs from Linsker’s by the omission of a factor of 1 / N before the sum term, where N is the number of synapses. Also, Linsker allowed more general hard limits, TLE - 1 5 w,5 TLE,0 < n~ < 1, which he implemented either directly or by allowing a fraction 711: of synapses to be excitatory (0 5 UJ? 5 1) and the remaining fraction 1 - n~ to be inhibitory (-1 5 w; 5 0). These two formulations are essentially mathematically equivalent; this equivalence depends on the fact that the spatial distributions of inputs and correlations in activity among inputs were taken to be independent of whether the inputs were excitatory or inhibitory. Linsker summarized results for 0.35 5 7 1 5~ 0.65 for his layer B + C, but did not report any dependence of results on n E within this range and focused discussion on n E = 0.5. At higher layers only n~ = 0.5 was discussed. Equation 1.1 is equivalent to 7LE = 0.5. Our analysis does not depend critically on this choice; what is critical is that the origin be well within the interior of the hypercube of allowed synaptic weights, so that initial development is linear.
Neural Computation 2,173-187 (1990) @ 1990 Massachusetts Institute of Technology
David J. C. MacKay and Kenneth D. Miller
174
where Q is the covariance matrix of activities of the inputs to the neuron. The covariance matrix depends on the covariance function, which describes the dependence of the covariance of two input cells' activities on their separation in the input field, and on the location of the synapses, which is determined by a synaptic density function. Linsker used a gaussian synaptic density function. Similar equations have been developed and studied by others (Miller et al. 1986, 1989). Depending on the covariance function and the two parameters k~ and k2, different weight structures emerge. Using a gaussian covariance function (his layer B + C), Linsker reported the emergence of nontrivial weight structures, ranging from saturated structures through centersurround structures to bilobed-oriented structures. The analysis in this paper examines the properties of equation 1.1. We concentrate on the gaussian covariances in Linsker's layer f3 C. We give an explanation of the structures reported by Linsker and discuss criteria for the emergence of center-surround weight structures. Several of the results are more general, applying to any covariance matrix Q. Space constrains us to postpone general discussion, technical details, and discussion of other model networks, to a future publication (MacKay and Miller 1990). --f
2 Analysis in Terms of Eigenvectors
We write equation 1.1 as a first-order differential equation for the weight vector w:
w = (Q + k&w
+ kln
subject to -wmx 5 w, 5 wmaX
(2.1)
where J is the matrix JtJ = 1 V i , j , and n is the DC vector n, = 1 Vz. This equation is linear, up to the hard limits on w,. These hard limits define a hypercube in weight space within which the dynamics are confined. We make the following assumption: Assumption 1. The principal features of the dynamics are established before the hard limits are reached. When the hypercube is reached, it captures and preserves the existing weight structure with little subsequent change. The matrix Q + k2J is symmetric, so it has a complete orthonormal set of eigenvectors2e(,)with real eigenvalues A., The linear dynamics within the hypercube can be characterized in terms of these eigenvectors, each of 2The indices a and b will be used to denote the eigenvector basis for w, while the indices i and j will be used for the synaptic basis.
Linsker's Simulations of Hebbian Rules
175
which represents an independently evolving weight configuration. First, equation 2.1 has a fixed point at
Second, relative to the fixed point, the component of w in the direction of an eigenvector grows or decays exponentially at a rate proportional to the corresponding eigenvalue. Writing w(t) = C , w,(t)e(,), equation 2.1 yields
w,(tf
-
w," = [wa(0)- w3ex.t
(2.3)
Thus, the principal emergent features of the dynamics are determined by the following three factors: 1. The principal eigenvectors of Q + k2J, that is, the eigenvectors with largest positive eigenvalues. These are the fastest growing weight configurations.
2. Eigenvectors of Q + IC2J with negative eigenvalue. Each is associated with an attracting constraint surface, the hyperplane defined by w, = w,".
3. The location of the fixed point of equation 1.1. This is important for two reasons: (a) it determines the location of the constraint surfaces and (b) the fixed point gives a "head start" to the growth rate of eigenvectors e(a)for which Iw,"~ is large compared to Iw,(O)I (see Fig. 3). 3 Eigenvectors of Q
We first examine the eigenvectors and eigenvalues of Q. The principal eigenvector of Q dominates the dynamics of equation 2.1 for kl = 0, k? = 0. The subsequent eigenvectors of Q become important as kl and k2 are varied. Some numerical results on the spectrum of Q have appeared in Linsker (1987,1990)and Miller (1990). Analyses of the spectrum when output cells are laterally interconnected appear in Miller et al. (1986, 1989). 3.1 Properties of Circularly Symmetric Systems. If an operator commutes with the rotation operator, its eigenfunctions can be written as eigenfunctions of the rotation operator. For Linsker's system, in the continuum limit, the operator Q + k2J is unchanged under rotation of the system. So the eigenfunctions of Q + k2J can be written as the product of a radial function and one of the angular functions cos 18, sin 18, I = 0,1,2. . .. To describe these eigenfunctions we borrow from qilantum mechanics the notation n = 1 , 2 , 3 . .. and 1 = s, p, d . . . to denote the function's total number of nodes = 0, 1 , 2 .. . and number of angular
David J. C. MacKay and Kenneth D. Miller
176
Name
Eigenfunction
XJN
Is
e-r2/2R
1CIA
2p 2s
T
cos 0e-r2/2R
12C/A
(1 - ~ ~ / r & -%CIA ~ ~ / ~ ~
Table 1: The First Three Eigenfunctions of the Operator Qfr,r'). Q(r, r') = e-lr-r'12/2Ce-T'2/2A,where C and A denote the characteristic sizes of the covariance function and synaptic density function. r denotes two-dimensional spatial position relative to the center of the synaptic arbor, and T = 1 1. The eigenvalues X are normalized by the effective number of synapses N = 27rA.
nodes = 0,1,2. . ., respectively. For example, "2s" and "2p" both denote eigenfunctions with one node, which is radial in 2s and angular in 2p (see Fig. 1). For monotonic and nonnegative covariance functions, we conjecture that the leading eigenfunctions of Q are ordered in eigenvalue by their numbers of nodes such that the eigenfunction [nl]has larger eigenvalue than both [(n+l)l]and [n(l-tl>].This conjectureis obeyed in analytical and numerical results we have obtained for Linsker's and similar systems. The general validity of this conjecture is under investigation.
3.2 Analytic Calculations for k2 = 0. We have solved analytically for the first three eigenfunctions and eigenvalues of the covariance matrix for layer B -+ C of Linsker's network, in the continuum limit (Table 1). Is, the function with no changes of sign, is the principal eigenfunction of Q; Zp, the bilobed-oriented function, is the second eigenfunction; and 2s, the center-surround eigenfunction, is third.3 Figure l a shows the first six eigenfunctions for layer B -+ C of Linsker (1986). 32s is degenerate with 3d at kz = 0.
Linsker's Simulations of Hebbian Rules
177
Figure 1: Eigenfunctions of the operator Q+kzJ. In each row the eigenfunctions have the same eigenvalue, with largest eigenvalue at the top. Eigenvalues (in arbitrary units): (a) k2 = 0 Is, 2.26; 2p, 1.0; 2s and 3d, 0.41. (b) k2 = -3: 2p, 1.0; Zs, 0.66; Is, -17.8. The gray scale indicates the range from maximum negative to maximum positive synaptic weight within each eigenfunction. Eigenfunctions of the operator (e-lr-f12/2c + k~)e-r'2/2Awere computed for C I A = 213 (as used by Linsker for most layer i3 4 C simulations) on a circle of radius 12.5 grid = 6.15 grid intervals. intervals, with
178
David J. C. MacKay and Kenneth D. Miller
4 The Effects of the Parameters k1 and k2
Varying k2 changes the eigenvectors and eigenvalues of the matrix Q+k2J. Varying kl moves the fixed point of the dynamics with respect to the origin. We now analyze these two changes, and their effects on the dynamics. Definition. Let A be the unit vector in the direction of the DC vector n. We refer to (w . fi) as the DC Component of w. The DC component is proportional to the sum of the synaptic strengths in a weight vector. For example, 2p and all the other eigenfunctions with angular nodes have zero DC component. Only the s-modes have a nonzero DC component. 4.1 General Theorem: The Effect of k2. We now characterize the effect of adding k J to any covariance matrix Q.
Theorem 1. For any covariance matrix Q, the spectrum of eigenvectors and eigenvalues of Q + k2J obeys the following: 1. Eigenvectors of Q with no DC component, and their eigenvalues, are unaffected by k2.
2. The other eigenvectors, with nonzero DC component, vary with k2. Their eigenvalues increase continuously and monotonically with k2 between asymptotic limits such that the upper limit of one eigenvalue is the lower limit of the eigenvalue above.
3. There is at most one negative eigenvalue. 4 . All but one of the eigenvalues remain finite. In the limits IC2 -+ &00 there is a DC eigenvector A with eigenvalue -+ CZN, where N is the dimensionality of Q, that is, the number of synapses.
The properties stated in this theorem, whose proof is in MacKay and Miller (1990), are summarized pictorially by the spectral structure shown in Figure 2. 4.2 Implications for Linsker's System. For Linsker's circularly symmetric systems, all the eigenfunctions with angular nodes have zero DC component and are thus independent of k2. The eigenvalues that vary with IC2 are those of the s-modes. The leading s-modes at k2 = 0 are Is, 2s; as k2 is decreased to -00, these modes transform continuously into 2s, 3s respectively (Fig. 2).4 Is becomes an eigenvector with negative eigenvalue, and it approaches the DC vector A. This eigenvector enforces a constraint w . A = wFp. A, and thus determines that the final average synaptic strength is equal to wm . n/N. 4The 2s eigenfunctions at kz = 0 and k2 = -aboth have one radial node, but are not identical functions.
Linsker’s Simulations of Hebbian Rules
179
Figure 2: General spectrum of eigenvalues of Q + kzJ as a function of kz. A: Eigenvectors with DC component. B: Eigenvectors with zero DC component. C: Adjacent DC eigenvalues share a common asymptote. D: There is only one negative eigenvalue. The annotations in parentheses refer to the eigenvectors of Linsker’s system. Linsker (1986) used k2 = -3. This value of k2 is sufficiently large that the properties of the k2 + --oo limit hold (MacKay and Miller 19901, and in the following we concentrate interchangeably on kz = -3 and kz + -m. The computed eigenfunctions for Linsker’s system at layer B -+ C are shown in Figure l b for kz = -3. The principal eigenfunction is 2p. The center-surround eigenfunction 2s is the principal symmetric eigenfunction, but it still has smaller eigenvalue than 2p. 4.3 Effect of k,. Varying kl changes the location of the fixed point of equation 2.1. From equation 2.2, the fixed point is displaced from the origin only in the direction of eigenvectors that have nonzero DC
180
David J. C. MacKay and Kenneth D. Miller
component, that is, only in the direction of the s-modes. This has two important effects, as discussed in Section 2 (1)The s-modes are given a head start in growth rate that increases as kl is increased. In particular, the principal s-mode, the center-surround eigenvector 2s, may outgrow the principal eigenvector 2p. (2) The constraint surface is moved when kl is changed. For large negative k2,the constraint surface fixes the average synaptic strength in the final weight vector. To leading order in 1/k2, Linsker showed that the constraint is C w3= kl/)kZl.5 4.4 Summary of the Effects of kl and k2. We can now anticipate the explanation for the emergence of center-surround cells: For kl = 0, k2 = 0, the dynamics are dominated by 1s. The center-surround eigenfunction 2s is third in line behind 2p, the bilobed function. Making kz large and negative removes 1s from the lead. 2p becomes the principal eigenfunction and dominates the dynamics for k1 ‘v 0, so that the circular symmetry is broken. Finally, increasing kl/lkz( gives a head start to the center-surround function 2s. Increasing kl /I k2 I also increases the final average synaptic strength, so large kl/lk21 also produces a large DC bias. The center-surround regime therefore lies sandwiched between a 2p-dominated regime and an all-excitatory regime. k l / I k2 I has to be large enough that 2s dominates over 2p, and small enough that the DC bias does not obscure the center-surround structure. We now estimate this parameter regime.
5 Criteria for the Center-Surround Regime
We use two approaches to determine the DC bias at which 2s and 2p are equally favored. This DC bias gives an estimate for the boundary between the regimes dominated by 2s and 2p. 1. Energy Criterion: We first estimate the level of DC bias at which the weight vector composed of (2s plus DC bias) and the weight vector composed of (2p plus DC bias) are energetically equally favored. This gives an estimate of the level of DC bias above which 2s will dominate under simulated annealing, which explores the entire space of possible weight configurations. 2. Time Development Criterion: Second, we estimate the level of DC bias above which 2s will dominate under simulations of time development of equation 1.1. We estimate the relationship between the parameters such that, starting from a typical random distribution of initial weights, the 2s mode reaches the saturating hypercube at the same time as the 2p mode. 5T0 next order, this expression becomes C w3 = l i l / l k z + 41, where 4 = ( Q t l ) , the average covariance (averaged over i and j ) . The additional term largely resolves the discrepancy between Linsker’s g and kl/lk21 in Linsker (1986).
Linsker's Simulations of Hebbian Rules
181
Both criteria will depend on an estimate of the complex effect of the weight limits -w,,, 5 wi 5 w,,,. (Without this hypercube of saturation constraints, 2p will always dominate the dynamics of equation 1.1 after a sufficiently long time.) We introduce g = kl/(lk21Nwm,) as a measure of the average synaptic strength induced by the DC constraint, such that g = 1 means all synapses equal w,,.~ Noting that a vector of amplitude f l w m a x has rms synaptic strength wmax,we make the following estimate of the constraint imposed by the hypercube (discussed further in MacKay and Miller 1990):
Assumption 2. When the DC level is constrained to be g, the component h(g) in the direction of a typical unit AC vector at which the hypercube constraint is "reached is h(g) = f i w m a X ( l - 9). Assumptions 1 and 2 may not adequately characterize the effects of the hypercube on the dynamics, so the numerical estimates of the precise locations of the boundaries between the regions may be in error. However, the qualitative picture presented by these boundaries is informative. 5.1 Energy Criterion. Linsker suggested analysis of equation 1.1 in terms of the energy function on which the dynamics perform constrained gradient descent. The energy of a configuration w = C wae(a)is
(5.1) a
where n, is the DC component of eigenvector e(a). We consider two configurations, one with wzP equal to its maximum value h(g) and wzS = 0, and one with wzP = 0 and wzS = siF(n&(g). The component wls is the same in both cases. All the other components are assumed to be small and to contribute no bias in energy between the two configurations. The energies of these configurations will be our estimates of the energies of saturated configurations obtained by saturating 2p and 2s, respectively, subject to the constraints. We compare these two energies and find the DC level g = gE at which they are equal:7
For Linsker's layer B
--+
C connections, our estimate of g E is 0.16.
5.2 Time Development Criterion. The energy criterion does not take into account the initial conditions from which equation 1.1 starts. We now derive a second criterion that attempts to do this. @I'his is equal to twice Linsker's g. 7X/N is written as a single entity because X 0: N . Also nzs a constant as kz 4 00.
N
l/kz, so nzskz tends to
182
David J. C. MacKay and Kenneth D. Miller
Figure 3: Schematic diagram illustrating the criteria for 2s to dominate. The polygon of size h(g) represents the hypercube. Energy criterion: The points marked EzP and E b show the locations at which the energy estimates were made. Time development criterion: The gray cloud surrounding the origin represents the distribution of initial weight vectors. If W ~ ~ ( is O sufficiently ) small compared to and if the hypercube is sufficiently close, then the weight vector reaches the hypercube in the direction of 2s before wp has grown appreciably.
WE,
If the initial random component in the direction of Zp, wZp(O), is sufficiently smallcompared to WE,which provides 2s with a head start, then wzP may never start growing appreciably before the growth of wa saturates (Fig. 3). The initial component wzp(0) is a random quantity whose
Linsker's Simulations of Hebbian Rules
183
typical magnitude can be estimated statistically from the weight initialization parameters. U I Z ~ ( Oscales ) ~ ~ as l/v% relative to the nonrandom quantity Hence the initial relative magnitude of wzP can be made arbitrarily small by increasing N , and the emergence of center-surround structures may be achieved at any g by using an N sufficiently large to suppress the initial symmetry breaking fluctuations. We estimate the boundary between the regimes dominated by 2s and 2p by finding the choice of parameters such that wZp(t) and w2&) reach the hypercube at the same time. We evaluate the time tzs at which wzS reaches the hypercube.s Our estimate of the typical starting component for 2p is wzp(0)rms= &(g)wmax where u(g) is a dimensionless standard deviation derived in MacKay and Miller (1990). We set wzp(tzs) = h(g), and solve for W , the number of synapses above which wzS reaches the hypercube before Q,, in terms of g:
WE.
5.3 Discussion of the Two Criteria. Figure 4 shows gE and N*(g). The two criteria give different boundaries. In regime A, 2p is estimated to both emerge under equation 1.1, and to be energetically favored. Similarly, in regime C , 2s is estimated to dominate equation 1.1, and to be energetically favored. In regime D, the initial fluctuations are so big that although 2s is energetically favored, symmetry breaking structures can dominate equation 1.1.9 Lastly, in regime B, although 2p is energetically favored, 2s will reach saturation first because N is sufficiently large that the symmetry breaking fluctuations are suppressed. Whether this saturated 2s structure will be stable, or whether it might gradually destabilize into a 2p-like structure, is not predicted by our analysis." The possible difference between simulated annealing and equation 1.1 makes it clear that if initial conditions are important (regimes B and D), the use of simulated annealing on the energy function as a quick way of finding the outcome of equation 1.1 may give erroneous results. Figure 4 also shows the areas in the parameter space in which Linsker made the simulations he reported. The agreement between experiment and our estimated boundaries is reasonable. sWe set wz,(O) = 0, neglecting its fluctuations, which for large N are negligible : . compared with w 91f the initial component of 2s is toward the fixed point, the 2s component must first shrink to zero before it can then grow in the opposite direction. Thus, large fluctuations may either hinder or help 2s, while they always help 2p. loIn a one-dimensional model system we have found that both cases may be obtained, depending sensitively on the parameters.
David J. C. MacKay and Kenneth D. Miller
184
N 1000
800
600 400 200
"0
.1
*
gE
.2
I
I
J
.3
.4
.5
g
Figure 4: Boundaries estimated by the two criteria for C I A = 213. To the left of the line labeled gE, the energy criterion predicts that 2p is favored; to the right, 2s is favored. Above and below the line N*(g),the time development criterion estimates that 2s and 2p, respectively, will dominate equation 1.1. The regions X, Y, mark the regimes studied by Linsker: (X)N = 300 - 600,g = 0.3-0.6: the region in which Linsker reported robust center-surround; (Y)N = 30C-600, g 2, with singular matrix of the related quadratic form. The correlation matrix does not exist in this case and gaussian functions considered in Moody and Darken (1989a,b), Moody (1989), Lee and Kil (1988), and Niranjan and Fallside (1988) are not, in general, semiaffine. In this note we point out that versions of the Stone-Weierstrass theorem [a basic tool in Hornik et al. (1989) and Stinchcombe and White (1989)l apply even more directly to gaussian radial basis hidden units. In particular, linear combinations of gaussians over a compact, convex set form an algebra of maps (see Section 2 ) . The ”universal approximation” theo-
Eric J. Hartman, James D. Keeler, and Jacek M. Kowalski
212
rem immediately follows, for simple, single hidden-layer neural nets, as used in Moody and Darken (1989a,b). This puts on a firmer basis the use of gaussian hidden units, previously motivated by computational convenience and supported by some biological evidence (see Section 3). In Section 3 we also address some related questions [i.e., recently proposed generalized units (Durbin and Rumelhart 1989)l and list some open problems.
n
2 Gaussian Hidden Units as Universal Approximators
Let x, and xp be two given points in the R" space. Consider a "weighted distance" map h : R" + R given by
where a, and gives
up
are real positive constants. Straightforward algebra
h(x) = (a, + ap)llx - qll 2 - a,ap(a, + ap)-lllx, - Xp1l2
where x-, is a convex linear combination of x, and q = c,x, + cpxp, c1 = a,(a, + u p ) -1 c2 =
ap(a,
(2.2)
xp:
+ ap)-l c1 + c2 = 1
(2.3)
As a next step, consider a two-parameter family F of restricted gaussians :K R fa,,x,(x) = exp(-a,IIx-x,11*),
fa,,x,
+
a, > O,x, E K,x E K
(2.4)
where K is any convex compact subset of R". Let C be the set of all finite linear combinations with real coefficients of elements from T . Multiplying two elements from C one obtains linear combinations of products of gaussians. Such linear combinations are still elements of L as fa,,x, (x)fa@,x@(x) = Da@fa,+aa,x,(XI
(2.5)
where D,p is a positive constant given by D,p = exp[--a,ap(a,
+ ap)-'((x,
-
xp((2 1
(2.6)
Equation (2.5) merely represents a well-known fact that a product of two gaussians is a gaussian (see, e.g., Shavitt 1963). For our purposes it is important, however, that the center of a "new" gaussian is still in K if K is convex. Thus C is an algebra of gaussians on K , as C is closed with respect to multiplication: (2.7)
Neural Networks with Gaussian Units
213
It is trivial to observe that the algebra C separates (distinguishes) points of K , that is, that for arbitrary XI, x2 E K , x1 x2 there is a function f a , , x , in C such that fa,,x,(xd fa,,x,(x~)- (Pick, e.g., x, = XI and use monotonicity of exponentials.) Moreover, L does not vanish at any point of K (i.e., for every point x E K there is obviously a function in C different from zero there.) The above observations allow an immediate use of the Stone's theorem as formulated, e.g., in Rudin (1958). For the reader's convenience we quote this theorem.
+
+
Theorem 1 (Stone.) Let C be an algebra of real continuous functions on a compact set K , and suppose C separates points of K and does not vanish at any point of K . Then C,, the uniform closure of C, contains all real-valued, continuous functions on K .
C,, the uniform closure of C, is the set of all functions obtained as limits of uniformly convergent sequences of elements from L. Obviously, C is a subalgebra of the algebra C ( K ) of all continuous functions on K with the supremum norm. It follows that L is dense in C(K). Thus we arrive at the following corollary: Corollary. Let C be the set of all finite linear combinations with real coefficients of elements from F,the set of gaussian radial basis functions defined on the convex compact subset K of R" (see equation 2.4). Then any function in C(K) can be uniformly approximated to an arbitrary accuracy by elements of C. Another version of the Stone theorem (see, e.g., Lorentz 1986) is directly related to more complicated hidden units. Let G = {g} be a family of continuous real-valued functions on a compact set A c R". Consider generalized polynomials in terms of the g-functions: (2.8) where are real coefficients and ni are real nonnegative integers. Again, if the family G distinguishes points of K , and does not vanish on K , then each continuous real function can be approximated to arbitrary accuracy by polynomials of the type in 2.8. 3 Comments and Final Remarks
Special properties of the gaussian family allowed a simple proof of the approximation theorem for single hidden-layer networks with a linear output. The condition that gaussians have to be centered at points in K is not very restrictive. Indeed, one can include any reasonable closed "receptor field" S (domain of an approximated map) into a larger convex
Eric J. Hartman, James D. Keeler, and Jacek M. Kowalski
214
and compact set K , and continuously extend functions on S onto K (Tietze’s extension theorem). For a given (or preselected) set of gaussian centers and standard deviations, one can, of course, find the coefficients of the ”best” representation by the least-squares method. Under the additional assumption that the approximated functions are themselves members of the algebra L, one may still have a situation where the number of gaussian components, location of their centers, and standard deviations is unknown (a standard problem in spectral analysis). The procedure proposed by Medgyessy (1961) allows the unique determination of parameters. The general version of the StoneWeierstrass theorem with the generalized polynomials (see equation 2.8) is ideally suited for architectures using C units. However, it cannot be applied to complex-valued functions (Lorentz 1986). Hence, the problem of the approximation abilities of a new type of n-networks, recently proposed by Durbin and Rumelhart (19891, remains open. In general situations, the theoretical problem of an optimal approximation with varying number of radial basis functions, varying locations and ranges is much more subtle. Selecting some subsets A of the input space (restricting the environment) one may look for an extremal gaussian subspace (”the best approximation system” for all inputs of the type A), or try to characterize A by its metric entropy (Lorentz 1986).
n
References Durbin, R., and Rumelhart, D.E. 1989. Product units: A computationally powerful and biologically plausible extension to back propagation networks. Neural Comp. 1, 133. Eckmann, J.P., and Ruelle, D. 1985. Ergodic theory of chaos and strange attractors. Rev. Mod. Phys. 57, 617. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183. Hornik, M., Stinchcornbe, M. and White, H. 1989. Multilayer feedforward networks are universal approximators. NeuraZ Networks, in press. Lapedes, A., and Farber, R. 1987. Nonlinear Signal Processing Using Neural Networks: Prediction and System Modeling. Tech. Rep. LA-UR-88-418, Los Alamos National Laboratory, Los Alamos, NM. Lee, S., and Kil, R.M. 1988. Multilayer feedforward potential function network, 1-161. lEEE Int. Conf. Neural Networks, San Diego, CA. Lorentz, G.G. 1986. Approximations of Functions. Chelsea Publ. Co., New York. Medgyessy, P. 1961. Decomposition of Superpositions of Distribution Functions, Publishing House of the Hungarian Academy of Sciences, Budapest. Moody, J. 1989. Fast learning in multi-resolution hierarchies. Yale Computer Science, preprint.
Neural Networks with Gaussian Units
215
Moody, J., and Darken, C. 1989a. Learning with localized receptive fields. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds. Morgan Kaufmann, San Mateo, CA. Moody, J., and Darken, C. 1989b. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Niranjan, M., and Fallside, F. 1988. Neural networks and radial basis functions in classdying static speech patterns. Engineering Dept., Cambridge University, CLJED/F-INFENG/TR22. Neural Networks 2, 359. Rudin, W. 1958. Principles of Mathematical Analysis, 2nd rev. ed. McGraw-Hill, New York, Toronto, London. Shavitt, I.S. 1963. The Gaussian function in calculations of statistical mechanics and quantum mechanics. In Methods in Computational Physics: Quantum Mechanics, Vol. 2, pt. 1, B. Alder, S. Fernbach, and M. Rottenberg, eds. Academic Press, New York. Stinchcombe, M., and White, H. 1989. Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions. Conf. Neural Networks, Washington, D.C., IEEE and INNS, 1,613.
Received 18 August 1989; accepted 9 February 1990.
Communicated by Richard Lippmann
A Neural Network for Nonlinear Bayesian Estimation in Drug Therapy Reza Shadmehr Department o f Computer Science, University of Southern California, Los Angeles, CA 90089 USA
David Z. DArgenio Department o f Biomedical Engineering, University of Southern California, Los Angeles, CA 90089 USA
The feasibility of developing a neural network to perform nonlinear Bayesian estimation from sparse data is explored using an example from clinical pharmacology. The problem involves estimating parameters of a dynamic model describing the pharmacokinetics of the bronchodilator theophylline from limited plasma concentration measurements of the drug obtained in a patient. The estimation performance of a backpropagation trained network is compared to that of the maximum likelihood estimator as well as the maximum a posteriori probability estimator. In the example considered, the estimator prediction errors (model parameters and outputs) obtained from the trained neural network were similar to those obtained using the nonlinear Bayesian estimator.
1 Introduction The performance of the backpropagation learning algorithm in pattern classification problems has been compared to that of the nearest-neighbor classifier by a number of investigators (Gorman and Sejnowski 1988; Burr 1988; Weideman et al. 1989). The general finding has been that the algorithm results in a neural network whose performance is comparable (Burr 1988; Weideman et al. 1989) or better (Gorman and Sejnowski 1988) than the nearest-neighbor technique. Since the probability of correct classification for the nearest-neighbor technique can be used to obtain upper and lower bounds on the Bayes probability of correct classification, the performance of the network trained by Gorman and Sejnowski (1988) is said to have approached that of a Bayes decision rule. Benchmarking the backpropagation algorithm's performance is necessary in pattern classification problems where class distributions intersect. Yet few investigators (Kohonen et al. 1988) have compared the performance of a backpropagation trained network in a statistical Neural Computation 2,216-225 (1990) @ 1990 Massachusetts Institute of Technology
Neural Network for Bayesian Estimation
217
pattern recognition or estimation task, to the performance of a Bayesian or other statistical estimators. Since Bayesian estimators require a priori knowledge regarding the underlying statistical nature of the classification problem, and simplifying assumptions must be made to apply such estimators in a sparse data environment, a comparison of the neural network and Bayesian techniques would be valuable since neural networks have the advantage of requiring fewer assumptions in representing an unknown system. In this paper we compare the performance of a backpropagation trained neural network developed to solve a nonlinear estimation problem to the performance of two traditional statistical estimation approaches: maximum likelihood estimation and Bayesian estimation. The particular problem considered arises in the field of clinical pharmacology where it is often necessary to individualize a critically ill patient’s drug regimen to produce the desired therapeutic response. One approach to this dosage control problem involves using measurements of the drug’s response in the patient to estimate parameters of a dynamic model describing the pharmacokinetics of the drug (i.e., its absorbtion, distribution, and elimination from the body). From this patient-specific model, an individualized therapeutic drug regimen can be calculated. A variety of techniques have been proposed for such feedback control of drug therapy, some of which are applied on a routine basis in many hospitals [see Vozeh and Steimer (1985) for a general discussion of this problem]. In the clinical patient care setting, unfortunately, only a very limited number of noisy measurements are available from which to estimate model parameters. To solve this sparse data, nonlinear estimation problem, both maximum likelihood and Bayesian estimation methods have been employed (e.g., Sheiner et al. 1975, Sawchuk et al. 1977). The a priori information required to implement the latter is generally available from clinical trials involving the drug in target patient populations. 2 The Pharmacotherapeutic Example
The example considered involves the drug theophylline, which is a potent bronchodilator that is often administered as a continuous intravenous infusion in acutely ill patients for treatment of airway obstruction. Since both the therapeutic and toxic effects of theophylline parallel its concentration in the blood, the administration of the drug is generally controlled so as to achieve a specified target plasma drug concentration. In a population study involving critically ill hospitalized patients receiving intravenous theophylline for relief of asthma or chronic bronchitis, Powell et al. (1978) found that the plasma concentration of theophylline, y(t), could be related to its infusion rate, r(t), by a simple one-compartment, two-parameter dynamic model [i.e., d y ( t ) / d t = -(CL/V)y(t)+r(t)/V].In the patients studied (nonsmokerswith no other
218
Reza Shadmehr and David Z . DArgenio
organ disfunction), significant variability was observed in the two kinetic model parameters: distribution volume V (liters/kg body weight) = 0.50 & 0.16 (mean f SD); elimination clearance CL (liters/kg/hr) = 0.0386 f 0.0187. In what follows, it will be assumed that the population distribution of V and C L can be described by a bivariate log-normal density with the above moments and a correlation between parameters of 0.5. For notational convenience, a will be used to denote the vector of model parameters (a = [V CLIT) and p and R used to represent the prior mean parameter vector and covariance matrix, respectively. Given this a priori population information, a typical initial infusion regimen would consist of a constant loading infusion, T I , equal to 10.0 mg/kg/hr for 0.5 hr, followed by a maintenance infusion, rz, of 0.39 mg/kg/hr. This dosage regimen is designed to produce plasma concentrations of approximately 10 pg/ml for the patient representing the population mean (such a blood level is generally effective yet nontoxic). Because of the significant intersubject variability in the pharmacokinetics of theophylline, however, it is often necessary to adjust the maintenance infusion based on plasma concentration measurements obtained from the patient to achieve the selected target concentration. Toward this end, plasma concentration measurements are obtained at several times during the initial dosage regimen to estimate the patient's drug clearance and volume. We assume that the plasma measurements, z(t), can be related to the dynamic model's prediction of plasma concentration, y ( t , a), as follows: z ( t ) = y(t, a ) + e(t). The measurement error, eW, is assumed to be an independent, Gaussian random variable with mean zero and standard deviation of ~ ( a=)0.15 x y(t,a). A typical clinical scenario might involve only two measurements, z ( t l ) and where tl = 1.5 hr and t 2 = 10.0 hr. The problem then involves estimating V and C L using the measurements made in the patient, the kinetic model, knowledge of the measurement error, as well as the prior distribution of model parameters. 3 Estimation Procedures
Two traditional statistical approaches have been used to solve this sparse data system estimation problem: maximum likelihood ( M L ) estimation and a Bayesian procedure that calculates the maximum a posteriori probability ( M A P ) . Given the estimation problem defined above, the M L estimate, a M L of , the model parameters, a, is defined as follows: (3.1)
Neural Network for Bayesian Estimation
219
where z = [z(t1)z(t2)IT,y(cy) = [ y ( t l , a ) y ( t 2 , a ) l T ,and C(a) = diag {ut1( a )ot,(a)}.The MAP estimator is defined as follows:
wherev = {vi},i= 1,2,@= {&},i = j = 1,2, with vi = 1npi-4%i/2,2= 1,2, and & j = l n ( ~ i ~ / p i j p i j +i,l )j , = 1,2.The mean and covariance of the prior parameter distribution, p and 0 (see above), define the quantities pi and wij. Also, A(a) = diag(Ina1 Inaz}. The corresponding estimates of the drug's concentration in the plasma can also be obtained using the above parameter estimates together with the kinetic model. To obtain the M L and M A P estimates a general purpose pharmacokinetic modeling and data analysis software package was employed, which uses the NelderMead simplex algorithm to perform the required minimizations and a robust stiff /norutiff differential equation solver to obtain the output of the kinetic model (DArgenio and Schumitzky 1988). As an alternate approach, a feedforward, three-layer neural network was designed and trained to function as a nonlinear estimator. The architecture of this network consisted of two input units, seven hidden units, and four output units. The number of hidden units was arrived at empirically. The inputs to this network were the patient's noisy plasma samples z(t1) and z(tZ), and the outputs were the network's estimates for the patient's distribution volume and elimination clearance (a") as well as for the theophylline plasma concentration at the two observation times Iy(td, y(t2)I. To determine the weights of the network, a training set was simulated using the kinetic model defined above. Model parameters (1000 pairs) were randomly selected according to the log-normal prior distribution defining the population (ai,i = 1,.. . , lOOO), and the resulting model outputs determined at the two observation times [y(tl, ai), y(t2, ah,i = I , . . . ,10001. Noisy plasma concentration measurements were then simulated [ ~ ( t lz(t2)i, ) ~ , i = 1,. . . ,10001 according to the output error model defined previously. From this set of inputs and outputs, the backpropagation algorithm (Rumelhart et al. 1986) was used to train the network as follows. A set of 50 vectors was selected from the full training set, which included the vectors containing the five smallest and five largest values of V and CL. After the vectors had been learned, the performance of the network was evaluated on the full training set. Next, 20 more vectors were added to the original 50 vectors and the network was retrained. This procedure was repeated until addition of 20 new training vectors did not produce appreciable improvement in the ability of the network to estimate parameters in the full training set. The final network was the result of training on a set of 170 vectors, each vector being presented
Reza Shadmehr and David Z. D'Argenio
220
to the network approximately 32,000 times. As trained, the network approximates the minimum expected (over the space of parameters and observations) mean squared error estimate for a , y ( t l ) and y(t2). [See Asoh and Otsu (1989) for discussion of the relation between nonlinear data analysis problems and neural networks.] 4 Results
The performance of the three estimators ( M L , M A P , N N ) was evaluated using a test set (1000 elements) simulated in the same manner as the training set. Figures 1 and 2 show plots of the estimates of V and CL, respectively, versus their true values from the test set data, using each of the three estimators. Also shown in each graph are the lines of regression (solid line) and identity (dashed line). To better quantify the performance of each estimator, the mean and root mean squared prediction error ( M p e and RMSpe, respectively) were determined for each of the two parameters and each of the two plasma concentrations. For example, the prediction error (percent) for the N N volume estimate was calculated as pe, = (y" - V,)lOO/V,, where V , is the true value of volume for the ith sample from the test set and y" is the corresponding N N estimate. Table 1 summarizes the resulting values of the Mpe for each of the three estimators. From inspection of Table 1we conclude that the biases associated with each estimator, as measured by the Mpe for each quantity, are relatively small, and comparable. As a single measure of both the bias and variability of the estimators, the R M S p e given in Table 2 indicate that, with respect to the parameters V and CL, the precision of the N N and M A P estimators is similar and significantly better than that of the M L estimator in the example considered here. For both the nonlinear maximum likelihood and Bayesian estimators, an asymptotic error analysis could be employed to provide approximate errors for given parameter estimates. In an effort to supply some type of
Estimator ML MAP "
2.5 3.4 1.0 6.1 4.7 3.8
-1.1 0.8 0.6
-3.0 1.5 7.3
Table 1: Mean Prediction Errors ( M p e ) for the Parameters (V and C L ) and Plasma Concentrations [y(tl) and y(tz)] as Calculated, for Each of the Three Estimators, from the Simulated Test Set.
221
Neural Network for Bayesian Estimation
1.50
-
3 9 s>
075
0 00
1:
0.00
4 0
0.75
1.50
Figure 1: Estimates of V for the M L , M A P , and N N procedures (top to bottom), plotted versus the true value of V for each of the 1000 elements of the test set. The corresponding regression lines are as follows: V M L= l.OV+0.004, r2 = 0.74; V M A P= 0.80V + 0.094, r2 = 0.81; V” = 0.95V + 0.044, r2 = 0.80.
Reza Shadmehr and David Z. D'Argenio
222
,
"'"1
, .
OW
I
0
0.075
0.150
CL (Llkglhr)
Figure 2 Estimates of C L for the ML, MAP, and N N procedures (top to bottom), versus their true values as obtained from the test set data. The corresponding regression lines are as follows: C L M L= 0.96CL + 0.002, r2 = 0.61; CLMAP= 0.73CL+ 0.010, r2 = 0.72; CL" = 0.69CL + 0.010, r2 = 0.69.
Neural Network for Bayesian Estimation
Estimator ML MAP NN
V 21. 14. 16.
223
RMSpe (%I C L y(t1) 44. 16. 30. 12. 13. 31.
Ye21
16. 13. 14.
Table 2: Root Mean Square Prediction Errors ( R M S p e ) for Each Estimator. error analysis for the N N estimator, Figure 3 was constructed from the test set data and estimation results. The upper panel shows the mean and standard deviation of the prediction error associated with the N N estimates of V in each of the indicated intervals. The corresponding results for C L are shown in the lower panel of Figure 3. These results could then be used to provide approximate error information corresponding to a particular point estimate (V" and CL") from the neural network. 5 Discussion
These results demonstrate the feasibility of using a backpropagation trained neural network to perform nonlinear estimation from sparse data. In the example presented herein, the estimation performance of the network was shown to be similar to a Bayesian estimator (maximum a posteriori probability estimator). The performance of the trained network in this example is especially noteworthy in light of the considerable difficulty in resolving parameters due to the uncertainty in the mapping model inherent in this estimation problem, which is analogous to intersection of class distributions in classification problems. While the particular example examined in this paper represents a realistic scenario involving the drug theophylline, to have practical utility the resulting network would need to be generalized to accommodate different dose infusion rates, dose times, observation times, and number of observations. Using an appropriately constructed training set, simulated to reflect the above, it may be possible to produce such a sufficiently generalized neural network estimator that could be applied to drug therapy problems in the clinical environment. It is of further interest to note that the network can be trained on simulations from a more complete model for the underlying process (e.g., physiologically based model as opposed to the compartment type model used herein), while still producing estimates of parameters that will be of primary clinical interest (e.g., systemic drug clearance, volume of distribution). Such an approach has the important advantage over traditional statistical estimators of building into the estimation procedure robustness to model simplification errors.
Reza Shadmehr and David Z . IYArgenio
224
40-
30
-
8
v
z
?
20-
K 10
-
OJ
+/0.30 I
0
I
I
I
I
0.45
0.60
0.75
0.90
V”
f -
1.50
(Llkg)
401 30
c
-
20-
%
8 a
10
-
4
0-
-10
-
Figure 3: Distribution of prediction errors of volume (upper) and clearance (lower) for the N N estimator as obtained from the test set data. Prediction errors are displayed as mean (e) plus one standard deviation above the mean.
Acknowledgments This work was supported in part by NIH Grant P41-RRO1861.R.S. was supported by an IBM fellowship in Computer Science.
References Asoh, H., and Otsu, N. 1989. Nonlinear data analysis and multilayer perceptrons. IEEE Int. Joint Conf. Neural Networks 11, 411-415. Burr, D. J. 1988. Experiments on neural net recognition of spoken and written text. IEEE Trans. Acoustics Speech, Signal Processing 36, 1362-1165.
Neural Network for Bayesian Estimation
225
DArgenio, D. Z., and Schumitzky, A. 1988. ADAPT I1 User’s Guide. Biomedical Simulations Resource, University of Southern California, Los Angeles. Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75-89. Kohonen, T., Barna, G., and Chrisley, R. 1988. Statistical pattern recognition with neural networks: Benchmarking studies. l E E E Int. Conf. Neural Networks 1, 61-68. Powell, J. R., Vozeh, S., Hopewell, P., Costello, J., Sheiner, L. B., and Riegelman, S. 1978. Theophylline disposition in acutely ill hospitalized patients: The effect of smoking, heart failure, severe airway obstruction, and pneumonia. Am. Rev. Resp. Dis. 118,229-238. Rumelhart, D.E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by backpropagation errors. Nature 323,533-536. Sawchuk, R. J., Zaske, D. E., Cipolle, R. J., Wargin, W. A., and Strate, R. G. 1977. Kinetic model for gentamicin dosing with the use of individual patient parameters. Clin. Pharrnacol. Thera. 21, 362-369. Sheiner, L. B., Halkin, H., Peck, C., Rosenberg, B., and Melmon, K. L. 1975. Improved computer-assisted digoxin therapy: A method using feedback of measured serum digoxin concentrations. Ann. Intern. Med. 82, 619-627. Vozeh, S., and Steimer, J. -L. 1985. Feedback control methods for drug dosage optimisation; Concepts, classification and clinical application. Clin. Pharmacokinet. 10, 457476. Weideman, W. E., Manry, M. T., and Yau, H. C. 1989. A comparison of a nearest neighbor classifier and a neural network for numeric hand print character recognition. IEEE lnt. Joint Conf. Neural Networks 1, 117-120.
Received 16 November 1989; accepted 6 February 1990.
Communicated by Jack Cowan
Analysis of Neural Networks with Redundancy Yoshio IzuP Alex Pentland Vision Science Group, El 5-387, The Media Laboratory, Massachusetts Institute of Technology, 20 Ames Street, Cambridge, M A 02139 USA
Biological systems have a large degree of redundancy, a fact that is usually thought to have little effect beyond providing reliable function despite the death of individual neurons. We have discovered, however, that redundancy can qualitatively change the computations carried out by a network. We prove that for both feedforward and feedback networks the simple duplication of nodes and connections results in more accurate, faster, and more stable computation. 1 Introduction
It has been estimated that the human brain has more than 10" neurons in all of its many functional subdivisions, and that each neuron is connected to around lo4 other neurons (Amari 1978; DARPA 1988). Furthermore, these neurons and connections are redundant. Artificial systems, in constrast, have been very much smaller and have normally had no redundancy at all. This lack of redundancy in artificial systems is due both to cost and to the generally held notion that redundancy in biological systems serves primarily to overcome the problems caused by the death of individual neurons. While it is true that redundant neural networks are more resistant to such damage (Tanaka et al. 1988), we will show that there are other, perhaps more important computational effects associated with network redundancy. In this paper we mathematically analyze the functional effects of neuron duplication, the simplest form of redundancy, and prove that duplicated neural networks converge faster than unduplicated networks, require less accuracy in interneuron communication, and converge to more accurate solutions. These results are obtained by showing that each duplicated network is equivalent to an unduplicated one with sharpened nonlinearities and initial values that are normally distributed with a smaller variance. *Current address: Industrial Systems Lab., Mitsubishi Electric Corp., 8-1-1, Tsukaguchi, Amagasaki, Hyogo 661 Japan. Neural Computation 2, 226-238 (1990) @ 1990 Massachusetts Institute of Technology
Neural Networks with Redundancy
227
Further, we prove that the asynchronous operation of such networks produces faster and more stable convergence than the synchronous operation of the same network. These results are obtained by showing that each asynchronous network is approximately equivalent to a synchronous network that uses the Hessian of the energy function to update the network weights. 2 Feedforward Neural Networks
2.1 Network Duplication. For simplicity we start by considering three-layer feedforward neural networks (Rumlehart et al. 19861, which are duplicated L times at the input layer and A4 times at the hidden layer. A duplication factor of A4 means that M neurons have exactly same inputs, that is, each input forks into M identical signals that are fed into the corresponding neurons. To produce a duplicated network from an unduplicated one, we start by copying each input layer neuron and its input-output connections L times, setting the initial weights between input and hidden layers to uniformly distributed random values. We then duplicate the hidden layer M times by simply copying each neuron and its associated weights and connections A4 times. Although we will not mathematically analyze the case of hidden-layer weights that are randomly distributed rather than simply copied, it is known experimentally that randomly distributed weights produce better convergence. The energy function or error function of these neural networks are defined as:
where K p and 4 are the ith input and kth output of the pth training data, Wizo is the weight between the ith neuron with lth duplication at the input layer and the jth neuron with mth duplication at the hidden layer, W$) is the weight between the jth neuron with mth duplication at the hidden layer and the kth neuron at the output layer, and g ( x ) = 1/(1+ e P ) is the sigmoid function. For n = 1 , 2 and r = ml, m, learning can be conducted by either the simple gradient method,
or by the momentum method,
where 7 and p may be thought of as the network gain, and (1- a ) as the network damping.
Yoshio Izui and Alex Pentland
228
2.2 Equivalent Neural Networks. By employing the average of the duplicated weights
which is normally distributed we can derive an unduplicated network that is equivalent to the above duplicated network. The energy function of this unduplicated network is
where we now assume a single network duplication factor D = L = M for simplicity of exposition. By rewriting equations 2.2 and 2.3 using the averaged weights we can derive learning equations for this equivalent, unduplicated network. For the gradient update rule we have (2.6)
for the momentum update rule we obtain
Comparison of the original unduplicated network's update equations with the above update rules for the duplicated network's equivalent shows that duplicated networks have their sigmoid function sharpened by the network duplication factor D, but that the "force" causing the weights to evolve is reduced by 1 / D . As a consequence both the duplicated and unduplicated networks follow the same path in weight space as they evolve. The factor of D will, however, cause the duplicated network to converge much faster than the unduplicated network, as will be shown in the next section. 2.3 Convergence Speed.
2.3.1 The Gradient Descent Update Method. We first consider the gradient descent = - method of updating the weights. First, let us define DW1ji, W 2 k j = DW2kj, and % 2 = d K i / d t , zXj = dWTj/dt. The network convergence t h e Tgadient can be obtained by first rewriting equation 2.6 to obtain an expression for dt, and then by integrating dt:
ci
Neural Networks with Redundancy
229
where
For given and initial values the integral part of equation 2.8 is a constant. Thus we obtain the result that a network‘s convergence time Tgradimt is proportional to 1/D, where D is the network duplication factor.’ The dramatic speed-up in convergence caused by duplication may be a major reason biological systems employ such a high level of redundancy. Note that the same D-fold speed-up can be achieved in an unduplicated network by simply sharpening the sigmoid function by a factor of D. Readers should be cautioned, however, that these equations describe a continuous system, whereas computer simulations employ a finite difference scheme that uses discrete time steps. As D becomes large the finite difference approximation can break down, at which point use of a D-sharpened sigmoid will no longer result in faster convergence. The value of D at which breakdown occurs is a function of the network‘s maximum weight velocities.
2.3.2 The Momentum Update Method. The more complex momentum update method may be similarly treated. The convergence time Tmomentum is (2.10)
(2.11) and for n = 1,2, (2.12)
-
An2
=
(3) 2
(2.13)
If we assume that PD >> (1 - a), that is, that the gain of the system times the sensitivity of the sigmoid function is much larger than the
where the WTi have a normal distribution p~ = N ( 0 ,W,2/3D). For small Wa and large D,p~ is approximately a delta function located at the average zero, so that the convergence time Tgradient is not much affected by the distribution of initial values.
Yoshio Izui and Alex Pentland
230
amount of damping in the system, then the damping may be ignored to achieve the following approximation: (2.14) The solution of (2.14) is (2.15) where Clji is a constant. When D is large equation 2.10 may be simplified as follows:
I
f
ds
where d s is as in equation 2.9. We may simplify equation 2.16 still further by noting that
(2.17) We first use this relation to obtain Einitial,the energy at the initial state, (2.18) assuming the standard initial values zTi = .5?j = 0. We can then use equations 2.18 and 2.17 to reduce our expression for Tmomentum to the following: Tmomentum
=-
ds J2P(Einitial - E)
(2.19)
Thus we obtain the result that for the momentum method the netwhere D is the network work convergence time is proportional to l/n, duplication factor. Thus, as with the gradient descent method, dramatic speed-ups are available by employing network redundancy. Again, the same effect may be obtained for unduplicated networks by simply using a D-sharpened sigmoid function, however with the momentum method great care must be taken to keep D small enough that the finite difference approximation still holds.
Neural Networks with Redundancy
cn
231
4.c
-..
I neory
0
I
2.0
‘ 0
I
1
I
2
I
3
Log10 D Figure 1: The relationship between convergence epochs and D. 2.4 Experimental Results. Figure 1 shows experimental results illustrating how convergence speed is increased by network redundancy. This example shows the number training set presentations (”epochs”)required to learn an XOR problem as a function of the network duplication factor D. Learning was conducted using a momentum update method with a = 0.9, P = 0.1, W, = 0.5, and convergence criterion of 10% error, each data point is based on the average of 10 to 100 different trials. The above theoretical result predicts a slope of -0.5 for this graph, the best fit to the data has a slope of -0.43 which is within experimental sampling error. 3 Feedback Neural Networks
3.1 Network Duplication and Equivalence. The energy function for feedback neural networks (Hopfield and Tank 1985) with D duplication is defined in a manner similar to that for feedforward networks:
where Tij = Tji is the weight between neuron i and j in their original index, Ii is the forced signal from environment to neuron i in its original
Yoshio Izui and Alex Pentland
232
index, and y(')is the output signal at the 1 duplicated ith neuron as defined in the following two equations: (3.2) (3.3) where ui')is the internal signal and r is the decay factor. Given large 7 and random, zero-average initial values of u:",the duplicated network governing equations 3.1, 3.2, and 3.3 can be rewritten to obtain an equivalent unduplicated network as follows:
(3.5) (3.6)
(3.7) (3.8) The equivalent unduplicated energy function is defined by equation 3.4, the dynamic behavior of the neurons is described by equation 3.5, and the transfer function at each neuron is given by equation 3.6. Examination of equation 3.5 reveals that this equivalent network has an updating "force" that is D times larger than the original unduplicated network, so we may expect that the duplicated network will have a faster rate of convergence. 3.2 Convergence Speed. The convergence time tained as in the feedforward case:
Tfeedback
can be ob-
(3.9)
Given large T and D,
Tfeedback
can be approximated by (3.10)
As with feedforward neural networks, the time integral is a constant given initial values, so that we obtain the result that network convergence time Tfedback is proportional to 1/ D.
Neural Networks with Redundancy
233
3 2 1
1
2
3
Figure 2 The relationship between convergence iterations and D. 3.3 Experimental Results. Figure 2 illustrates how network convergence is speeded up by redundancy. In this figure the number of iterations required to obtain convergence is plotted for a traveling salesman problem with 10 cities as a function of network duplication D. In these examples 7 = 10.0, dt = 0.01, K j and Ii are randomly distributed over &l.O, and u4 are randomly distributed over Each data point is the average of 100 trials. The above theoretical result predicts a slope of -1.0, the best fit to the data has a slope of -0.8 which is within experimental sampling error.
=tl/m.
3.4 Solution Accuracy. As the duplication factor D increases, the distribution of initial values u8 in the equivalent unduplicated network becomes progressively more narrow, as up is normally distributed with mean zero and variance ui/3D, where -uo < uf < uo. Experimentally, it
234
Yoshio Izui and Alex Pentland
is known (Uesaka 1988, Abe 1989) that if initial values are concentrated near the center of the ui's range then better solutions will be obtained. Thus the fact that increasing D produces a narrowing of the distribution of up indicates that we may expect that increasing D will also produce increased solution accuracy. We have experimentally verified this expectation in the case of the traveling salesman problem. 3.5 Communication Accuracy. One major problem for analog implementations of neural networks is that great accuracy in interneuron communication (i.e., accuracy in specifylng the weights) is required to reach an accurate solution. Network duplication reduces this problem by allowing statistical averaging of signals to achieve great overall accuracy with only low-accuracy communication links. For example, if the u:in a feedback network have a range of f128 and uniform noise with a range of f4 giving a communication accuracy of five bits, then for the averaged up the noise will have a standard deviation of only f 2 . 3 / n , giving roughly 10 bits of communication accuracy when D = 100. 4 Operation Mode
Given the advantages of redundancy demonstrated above it seems very desirable to employ large numbers of neurons; to accomplish this in a practical and timely manner requires a large degree of hardware parallelism, and the difficulty of synchronizing large numbers of processors makes asynchronous operation very attractive. It seems, therefore, that one consequence of a large degree of redundancy is the sort of asynchronous operation seen in biological systems. The computational effects of choosing a synchronous or asynchronous mode of operation has generally been regarded as negligible, although there are experimental reports of better performance using asynchronous update rules. In the following we mathematically analyze the performance of synchronous and asynchronous operation networks by proving that the operation of an asynchronous network is equivalent to that of a particular type of synchronous network whose update rule considers the Hessian of the energy function. We can therefore show that asynchronous operation will generally result in faster and more stable network convergence. 4.1 Equivalent Operation Mode. In the preceding discussion we assumed synchronous operation where the network state W(Tj) is updated at each time ?+I = Tj + AT:
(4.1)
Neural Networks with Redundancy
235
where
and E is the energy or error function. To describe asynchronous operation (and for simplicity we will consider only the gradient descent update rule) we further divide the time interval AT into K smaller steps tl, such that tl+l = tl+At where At = ATIK, to = 0 and K is large, thus obtaining the following update equations:
is now the time averaged update equation, which is related to the detailed behavior of the network by the relations
(4.4)
and
Equations 4.3 to 4.6 describe “microstate“ updates that are conducted throughout each interval T3 whenever the gradient at subinterval tl is available, that is, whenever one of the network’s neurons fire.
4.2 Synchronous Equivalent to Asynchronous Operation. We first define the gradient and Laplacian of E at asynchronous times t l to be
and define that all subscripts of A, B, and t are taken to be modulo K . We will next note that the gradient at time tl+l can be obtained by using the gradient at time tl and Laplacian at times tl, . . . ,t l as below: (4.8) (4.9) I
(4.10)
Yoshio Lzui and Alex Pentland
236
Assuming that AT is small, and thus that At is also small, then at time Tj the K-time-step time-averaged gradient is (4.11) (4.12) (4.13) (4.14) Thus the time-averaged state update equation for an asynchronous network is (4.15) This update function is reminiscent of second-order update functions which take into account the curvature of the energy surface by employing the Hessian of the energy function (Becker and Cun 1988):
dw dT
- = -q(1+ pVhE)-'VwE
(4.16)
where the identity matrix I is a "stabilizer" that improves performance when the Laplacian is small. Taking the first-order Taylor expansion of equation 4.16 about B&E = 0, we obtain
dw
- = -q(1-
dT
pV&E)VwE
(4.17)
and setting p = qATf2 we see that equations 4.15 and 4.16 are equivalent (given small AT so that the Taylor expansion is accurate). Thus equation 4.17 is a synchronous second-order update rule that is identical to the time-averaged asynchronous update rule of equation 4.15. The only assumption required to obtain this equivalence is that the time step AT is small enough that the approximations of equations 4.13 and 4.17 are valid. Investigating equation 4.17 reveals the source of the advantages enjoyed by asynchronous update rules. In the first stages of the convergence process (where the energy surface is normally concave upward) V L E are negative and thus larger updating steps are taken, speeding up the overall convergence rate. On the other hand, during the last stages of convergence (where the energy surface is concave downward) V L E is positive and thus smaller updating steps are taken, preventing undesired oscillations.
Neural Networks with Redundancy
237
5 Conclusion
We have analyzed the effects of duplication, the simplest form of redundancy, on the performance of feedforward and feedback neural networks. We have been able to prove that D-duplicated networks are equivalent to unduplicated networks that have (1) a D-sharpened sigmoid as the transfer function, and (2) normally distributed initial weights. Further, we have been able to prove that the asynchronous operation of such networks using a gradient descent update rule is equivalent to using a synchronous second-order update rule that considers the Hessian of the energy function. By considering the properties of these equivalent unduplicated networks we have shown that the effects of increasing network redundancy are increased speed of convergence, increased solution accuracy, and the ability to use limited accuracy interneuron communication. We have also shown that the effects of asynchronous operation are faster and more stable network convergence. In light of these results it now appears that the asynchronous, highly redundant nature of biological systems is computationally important and not merely a side-effect of limited neuronal transmission speed and lifetime. One practical consequence of these results is that in computer simulations one can obtain most of the computational advantages of a duplicated network by simply sharpening the transfer function, initializing the weights as the normally distributed values, and employing a secondorder update rule.
References Abe, S. 1989. Theories on the Hopfield neural networks. Int. Joint Conf. Neural Networks, Washington, D.C., 1557-1564. Amari, S. 1978. Mathematics of Neural Networks, 1. Sangyo Tosyo, Tokyo, Japan (in Japanese). Becker, S., and Cun, Y. L. 1988. Improving the convergence of back-propagation learning with second order methods. Proc. Connectionist Models Summer School, CMU, Pittsburgh, 29-37. DARPA. 1988. Neural Network Study 31. AFCEA International Press, Fairfax, VA.
Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Biol. Cybernet. 52, 141-152. Rumelhart, D. E., Hinton, G . E., and Williams, R. J. 1986. Learning internal representations by error propogation. In D. E. Rumelhart, J. L. McClelIand, and the PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. MIT Press, Cambridge, MA.
238
Yoshio Izui and Alex Pentland
Tanaka, H., Matsuda, S., Ogi, H., h i , Y., Taoka, H., and Sakaguchi, T. 1988. Redundant coding for fault tolerant computing on Hopfield network. Abstr. First Annu. lNNS Meeting, Boston, 141. Uesaka, Y. 1988. On the Stability of Neural Nefwork With the Energy Function Induced from a Real-Valued Function of Binay Variables. IEICE of Japan, Tech. Rep. PRU-88-6 7-14 (in Japanese).
Received 26 December 1989; accepted 9 February 1990.
Communicated by Jack Cowan
Stability of the Random Neural Network Model Erol Gelenbe Ecole des Hautes Etudes en Informatique, Universite R e d Descartes (Paris V), 45 rue des Saints-Pkres, 75006 Paris, fiance
In a recent paper (Gelenbe 1989) we introduced a new neural network model, called the Random Network, in which "negative" or "positive" signals circulate, modeling inhibitory and excitatory signals. These signals can arrive either from other neurons or from the outside world: they are summed at the input of each neuron and constitute its signal potential. The state of each neuron in this model is its signal potential, while the network state is the vector of signal potentials at each neuron. If its potential is positive, a neuron fires, and sends out signals to the other neurons of the network or to the outside world. As it does so its signal potential is depleted. We have shown (Gelenbe 1989) that in the Markovian case, this model has product form, that is, the steadystate probability distribution of its potential vector is the product of the marginal probabilities of the potential at each neuron. The signal flow equations of the network, which describe the rate at which positive or negative signals arrive at each neuron, are nonlinear, so that their existence and uniqueness are not easily established except for the case of feedforward (or backpropagation) networks (Gelenbe 1989). In this paper we show that whenever the solution to these signal flow equations exists, it is unique. We then examine two subclasses of networks - balanced and damped networks - and obtain stability conditions in each case. In practical terms, these stability conditions guarantee that the unique solution can be found to the signal flow equations and therefore that the network has a well-defined steady-state behavior. 1 Introduction
We consider a network of n neurons in which positive and negative signals circulate. Each neuron accumulates signals as they arrive, and can fire if its total signal count at a given instant of time is positive. Firing then occurs at random according to an exponential distribution of constant rate, and it sends signals out to other neurons or to the outside of the network. In this model, each neuron i of the network is represented at any time t by its input signal potential ki(t),which we shall simply call the potential. Neural Computation 2, 239-247 (1990) @ 1990 Massachusetts Institute of Technology
Erol Gelenbe
240
Positive and negative signals have different roles in the network; positive signals represent excitation, while negative signals represent inhibition. A negative signal reduces by 2 the potential of the neuron at which it arrives (i.e., it "cancels" an existing signal) or has no effect on the signal potential if it is already zero, while an arriving positive signal adds 2 to the neuron potential. The potential at a neuron is constituted only by positive signals that have accumulated, have not yet been cancelled by negative signals, and have not yet been sent out by the neuron as it fires. Signals can either arrive at a neuron from the outside of the network (exogenous signals) or from other neurons. Each time a neuron fires, a signal leaves it depleting the total input potential of the neuron. A signal that leaves neuron z heads for neuron j with probability p+(i,j) as a positive (or normal) signal, or as a negative signal with probability p - ( z , j ) , or it departs from the network with probability d(z). Let p ( i , j ) = p+(z,j ) + p-(z, j ) ; it is the transition probability of a Markov chain representing the movement of signals between neurons. We shall assume that p + ( i , i ) = 0 and p-(i,z) = 0; though the former assumption is not essential we insist on the fact that the latter indeed is to our model; this assumption excludes the possibility of a neuron sending a signal directly to itself. Clearly we shall have C p ( i , j )+&) = 1 for 1 5 i 5 n 3
A neuron is capable of firing and emitting signals if its potential is strictly positive. We assume that exogenous signals arrive at each neuron in Poisson streams of positive or negative signals. In Gelenbe (1989) we show that the purely Markovian version of this network, with positive signals that arrive at the ith neuron according to a Poisson process of rate A(i), negative signals which arrive to the ith neuron according to a Poisson process of rate X(z), iid exponential neuron firing times with rates r(l),.. . ,r(n),and Markovian movements of signals between neurons, has a product form solution. That is, the network's stationary probability distribution can be written as the product of the marginal probabilities of the state of each neuron. Thus in steady state the network's neurons are seemingly independent, though they are in fact coupled via the signals that move from one neuron to the other in the network. The model we propose has associative memory capabilities, as we shall see in Example 1. It also has a certain number of interesting features: 1. It appears to represent more closely the manner in which signals are transmitted in a biophysical neural network where they travel as voltage spikes rather than as fixed signal levels. 2. It is computationally efficient in the feedforward case, and whenever network stability can be shown as for balanced and damped networks.
Random Neural Network Model
241
3. It is closely related to the connexionist model (Gelenbe 1989) and it is possible to go from one model to the other. 4. It represents neuron potential and therefore the level of excitation as an integer, rather as a binary variable, which leads to more detailed information on system state.
As one may expect from previous models of neural networks (Kandel and Schwartz 1985), the signal flow equations that yield the rate of signal arrival and hence the rate of firing of each neuron in steady state are nonlinear. Thus in Gelenbe (1989) we were able to establish their existence (and also a method for computing them) only in the case of feedforward networks, that is, in networks where a signal cannot return eventually to a neuron that it has already visited either in negative or positive form. This of course covers the case of backpropagation networks (Kandel and Schwartz 1985). In this paper we deal with networks with feedback. We are able to establish uniqueness of solutions whenever existence can be shown. Then we show existence for two classes of networks: "balanced" and "damped" networks. 2 General Properties of the Random Network Model
The following property proved in Gelenbe (1989) states that the steadystate probability distribution of network state can always be expressed as the product of the probabilities of the states of each neuron. Thus the network in steady state is seemingZy composed of independent neurons, though this is obviously not the case. Let k ( t ) be the vector of signal potentials at time t, and k = ( k l , . . . , krJ be a particular value of the vector; we are obviously interested in the quantity ~ ( kt ),= Prob[k(t) = k ] . Let ~ ( kdenote ) the stationary probability distribution p ( k >=
Prob[k(t) = k ]
if it exists.
Theorem 1. (Gelenbe 1989) Let qi
= A+(Z)/[r(Z) + X-(i)l
(2.1)
where the X + ( i ) , X-(i) for i = 1,. . . , n satisfy the following system of nonlinear simuitaneous equations:
Erol Gelenbe
242
If a unique nonnegative solution {A+(i), A - ( i ) } exists to equations 2.1 and 2.2 such that each qi < 1, then: 11
Corollary 1.1. The probability that neuron i is firing in steady state is simply given by qi and the average neuron potential in steady state is simply Ai= qi/[l - qil.
By Theorem 1 we are guaranteed a stationary solution of product form provided the nonlinear signal flow equations have a nonnegative solution. The following result guarantees uniqueness in general. Theorem 2. If the solutions to equations 2.1 and 2.2 exist with they are unique.
qi
< 1, then
Proof. Since { k ( t ) : t 2 0) is an irreducible and aperiodic Markov chain (Gelenbe and Pujolle 1986), if a positive stationary solution p ( k ) exists, then it is unique. By Theorem 1, if the 0 < 4%< 1solution to equations 2.1 and 2.2 exist for i = 1,. . . ,n, then ~ ( kis) given by equation 2.3 and is clearly positive for all k. Suppose now that for some i there are two different qa, qi satisfying equations 2.1 and 2.2. But this implies that for all kzllimt+mP[k,(t) = 01 has two different values [l - q,] and [l - qi], which contradicts the uniqueness of p(k); hence the result. a,,
We say that a network is feedforward if for any sequence i l l . . . ,i,, . . ., = i, for r > s implies
. . . ,im of neurons, i, m-l
J-Jp k , iV+d= 0 v=l
I
Theorem 3. (Gelenbe 1989)If the network is feedforward, then the solutions A+($, A-(i) to equations 2.1 and 2.2 exist and are unique. The main purpose of this paper is to extend the class of networks for which existence of solutions to equations 2.1 and 2.2 is established. We shall deal with balanced networks and with damped networks.
Example 1. [A feedback network with associative memory for (O,O), (l,O), (031 The system is composed of two neurons shown in Figure 1. Each neuron receives flows of positive signals of rates A(l), A(2) into neurons 1 and 2. A signal leaving neuron 1 enters neuron 2 as a negative signal and a signal leaving neuron 2 enters neuron 1 as a negative signal p-(l, 2) = p-(2,1) = 1. The network is an example of a "damped network discussed in Section 4.
Random Neural Network Model
I
243
negative
I
Figure 1: The neural network with positive and negative signals examined in Example 1.
where vectors k with negative elements are to be ignored and 1[X] takes the value 1 if X is true and 0 otherwise. According to Theorems 1 and 2 the unique solution to these equations if it exists, is p ( k ) = (1 - u)(l
- V)UL'UL2
if u < 1 , v < 1, where -u = A(l)/[r(l)+A-(l)I, u = A(2)/[r(2)+X-(2)], with -X-(2) = u r ( l ) ,X(1) = vr(2). Since u,u are solutions to two simultaneous second degree equations, existence problems for this example are simple. For instance when A(1) = A(2) = r(1) = r(2) = 1 we obtain X-(l) = X-(2) = 0.5[5'/' - 11, so that
Erol Gelenbe
244
u , v in the expression for p ( k ) become u = v = 2/11 + 51/21 = 0.617, SO that the average potential at each neuron is A1 = A2 = 1.611. If we set A(1) = 1, A(2) = 0, with the same values of r(1) = r(2) = 1, we see that neuron 1 saturates (its average potential becomes infinite), while the second neuron’s input potential is zero, and vice versa if we set A(1) = 0, A(2) = 1. Thus this network recognizes the inputs (0,l) and (1,O). 3 Balanced Neural Networks
We now consider a class of networks whose signal flow equations have a particularly simple solution. We shall say that a network with negative and positive signals is balanced if the ratio
Si =
[x
qjr(j)p+(j,2) + A(i)l/tC q j r ( j ) p - c , 2) + X(i)+ ~ ( 9 1
j
j
is identical for any i = 1,.. . ,n. This in effect means that all the qi are identical. Theorem 4. The signal flow equations 2.1 and 2.2 have a (unique) solution if the network is balanced. Proof. From equations 2.1 and 2.2 we write qt =
[x
qjr(j)p+G,i ) + A(i)l/[x q j d j ) p - ( j , i) + X ( i )
j
+ di)l
(3.1)
j
If the system is balanced, qi = qj for all i , j . From equation 3.1 we then have that the common q = qi satisfies the quadratic equation:
q2R-(i)+ q [ X ( i ) + r(i) - R’(i)] - A(i) = 0
(3.2)
where R-(i) = C j r ( j ) p - ( j ,i), R+(i)= C, r(j)p+c,i). The positive root of this quadratic equation, which will be independent of i, is the solution of interest: q =
{(R’(i) - X ( i ) - r(i)) + [ ( R f ( i) X(i) -~ ( 2
+ 4R-(i)A(i)1”2}/2R-(i)
) ) ~
4 Damped Networks
We shall say that a random neural network is damped if p‘6, i) 2 0, p - 0 , i) 2 0 with the following property:
r(i)+ Mi) > A M +
C rG)p+(j,i), j
for all i = 1,. . . ,n
Random Neural Network Model
245
Though this may seem to be a strong condition, Example 1 shows that it is of interest. A special class of damped networks that seems to crop up in various examples is those in which all internal signals are inhibitive (such as Example 1) so that p + ( j ,i ) = 0.
Theorem 5. If the network is damped then the customer flow equations 2.1 and 2.2 always have a solution with qi < 1, which is unique by Theorem 2. Proof. The proof uses a method developed for nonlinear equilibrium equations. It is based on the construction of an n-dimensional vector homotopy function H ( q , x ) for a real number 0 5 2 < 1. Let us define the following n-vectors: q = (41,. . . , q n ) ,
F ( q ) = [F1(q),. . . , F,(q)l
where F,(q) =
[x
qjr(j)p+(j,i)
+ A(i)l,"Cq j ~ ( j ) p - ( ji), + X ( i ) + di)I j
j
Clearly, the equation we are interested in is q = F(q),which, when it has a solution in D = [0, l]",yields the values qi of Theorem 1. Notice that F(q): R" 4 R". Notice also that F(q) E C2. Consider the mappings F(q): D -+ R". We are interested in the interior points of D since we seek solutions 0 < qi < 1. Write D = Do U SD where SD is the boundary of D, and Do is the set of interior points. Let y = (yl, . . . , yn) where yi =
Ix d j ) p + ( j ,i) + A(i)l/[A(i) + ddl 3
By assumption yi < 1 for all i
=
1, . . . ,n. Now define
H ( q , 2 ) = (1- s)(q - 9 ) + 21q - F(q)l, 0 I z
< 1.
Clearly H(q, 0) = q - y and H ( q , 1)= q - F(q). Consider
H-'
=(9:q E
D , H ( q , x ) = 0 and 0 5 z < 1)
We can show that H-' and SD have an empty intersection, that is, as 17: varies from 0 to 1 the solution of H ( q , x), if it exists, does not touch the boundary of D. To prove this assume the contrary; this implies that for some 2 = Z* there exists some q = q* for which H ( q * , x * )= 0 and such that qt* = 0 or 1. If q: = 0 we can write -(1
-
Z*)Yi - s*Fi(q*)= 0
or 2*/(1 - 2 * )= -yz/Fi(q*) < 0
*
2*
0 and 0 < F&*) < yi so that [l - Fi(q*)l> 0, contradicting again the assumption about x. Thus H(q, x) = 0 cannot have a solution on the boundary SD for any 0 5 x < 1. As a consequence, applying Theorem 3.2.1 of Garcia and Zangwill(1981) (which is a LeraySchauder form of the fixed-point theorem), it follows that F(q) = q has at least one solution in Do ;it is unique as a consequence of Theorem 4.
5 Conclusions We pursue the study of a new type of neural network model, which we had introduced previously, and which had been shown to have a product form steady-state solution (Gelenbe 1989). The nonlinear signal flow equations of the model concerning the negative and positive signals that circulate in the network have been shown to have a unique solution when the network has a feedforward structure (Gelenbe 1989) that is equivalent to backpropagation networks (Rumelhart et al. 1986). In this paper we present new results for networks with feedback, and begin with a simple example that exhibits associative memory capabilities. The key theoretical issue that has to be resolved each time the random neural network with feedback is used is the existence of a solution to the signal flow equations; we therefore first show that these equations have a unique solution whenever the solution exists. We then study two classes of networks that have feedback balanced networks and damped networks, and obtain conditions for the existence of the solution in each case. Acknowledgments The author gratefully acknowledges the hospitality of the Operations Research Department at Stanford University, where this work was carried out during July and August of 1989, and the friendly atmosphere of Dave Rumelhart's seminar in the Psychology Department at Stanford where this research was presented and discussed during the summer of 1989. The work was sponsored by C3-CNRS.
Random Neural Network Model
247
References Garcia, C. D., and Zangwill, W. I. 1981. Pathways to Solutions, Fixed Points, and Equilibria. Prentice-Hall, Englewood Cliffs, NJ. Gelenbe, E. 1989. Random neural networks with negative and positive signals and product form solution. Neural Cornp. 1, 502-510. Gelenbe, E., and Pujolle, G. 1986. lntroduction to Networks of Queues. Wiley, Chichester and New York. Kandel, E. C., and Schwartz, J. H. 1985. Principles of Neural Science. Elsevier, Amsterdam. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. Parallel Distributed Processing, Vols. I and 11. Bradford Books and MlT Press, Cambridge, MA. ~~
Received 2 November 1989; accepted 9 February 1990.
Communicated by Les Valiant
The Perceptron Algorithm is Fast for Nonmalicious Distributions Eric B. Baum NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 USA
Within the context of Valiant's protocol for learning, the perceptron algorithm is shown to learn an arbitrary half-space in time O(n2/c3) if 0, the probability distribution of examples, is taken uniform over the unit sphere 5'". Here t is the accuracy parameter. This is surprisingly fast, as "standard" approaches involve solution of a linear programming problem involving O(n/e) constraints in n dimensions. A modification of Valiant's distribution-independent protocol for learning is proposed in which the distribution and the function to be learned may be chosen by adversaries, however these adversaries may not communicate. It is argued that this definition is more reasonable and applicable to real world learning than Valiant's. Under this definition, the perceptron algorithm is shown to be a distribution-independent learning algorithm. In an appendix we show that, for uniform distributions, some classes of infinite V-C dimension including convex sets and a class of nested differences of convex sets are learnable. 1 Introduction
The perceptron algorithm was proved in the early 1960s (Rosenblatt 1962) to converge and yield a half space separating any set of linearly separable classified examples. Interest in this algorithm waned in the 1970s after it was emphasized (Minsky and Papert 1969) (1) that the class of problems solvable by a single half space was limited, and (2) that the perceptron algorithm, although converging in finite time, did not converge in polynomial time. In the 1980s, however, it has become evident that there is no hope of providing a learning algorithm that can learn arbitrary functions in polynomial time and much research has thus been restricted to algorithms that learn a function drawn from a particular class of functions. Moreover, learning theory has focused on protocols like that of Valiant (1984), where we seek to classify, not a fixed set of examples, but examples drawn from a probability distribution. This allows a natural notion of "generalization." There are very few classes that have yet been proven learnable in polynomial time, and one of these is the class of half spaces. Thus, there is considerable theoretical interest now in studying the problem of learning a single half space, and so it is Neural Computation 2,248-260 (1990) @ 1990 Massachusetts Institute of Technology
The Perceptron Algorithm is Fast
249
natural to reexamine the perceptron algorithm within the formalism of Valiant. In Valiant's protocol, a class of functions is called learnable if there is a learning algorithm that works in polynomial time independent of the distribution D generating the examples. Under this definition the perceptron learning algorithm is not a polynomial time learning algorithm. However we will argue in Section 2 that this definition is too restrictive. We will consider in Section 3 the behavior of the perceptron algorithm if D is taken to be the uniform distribution on the unit sphere S". In this case, we will see that the perceptron algorithm converges remarkably rapidly. Indeed we will give a time bound that is faster than any bound known to us for any algorithm solving this problem. Then, in Section 4, we will present what we believe to be a more natural definition of distribution-independent learning in this context, which we will call nonmalicious distribution-independent learning. We will see that the perceptron algorithm is indeed a polynomial time nonmalicious distribution-independent learning algorithm. In Appendix A, we sketch proofs that, if one restricts attention to the uniform distribution, some classes with infinite Vapnik-Chervonenkis dimension such as the class of convex sets and the class of nested differences of convex sets (which we define) are learnable. These results support our assertion that distribution independence is too much to ask for, and may also be of independent interest.
2 Distribution-Independent Learning
In Valiant's protocol (Valiant 1984), a class F of Boolean functions on 8" is called learnable if a learning algorithm A exists that satisfies the following conditions. Pick some probability distribution D on Xn. A is allowed to call examples, which are pairs [z,f(z)], where z is drawn according to the distribution D. A is a valid learning algorithm for F if for any probability distribution D on X",for any 0 < 6 , Ot and w t.xf < Ot for all p, the algorithm converges in finite time to output a ( w H ,0,) such that w H . xy 2 OH and W H .x! < OH. We will normalize so that wt-wt= 1. Note that Iwt-x-Ot( is the Euclidean distance from z to the separating hyperplane {y : wt . y = O t } . The algorithm is the following. Start with some initial candidate (wo,O,J, that we will take to be (0,O).Cycle through the examples. For each example, test whether that example is correctly classified. If so, proceed to the next example. If not, modify the candidate by (Wk+, =Wk
*
e,,
Xg,
=
ok 7 1)
(3.1)
where the sign of the modification is determined by the classification of the misclassified example. In this section we will apply the perceptron algorithm to the problem of learning in the probabilistic context described in Section 2, where, however, the distribution D generating examples is uniform on the unit sphere S". Rather than have a fixed set of examples, we apply the algorithm in a slightly novel way: we call an example, perform a perceptron update step, discard the example, and iterate until we converge to accuracy E . If ~ we applied the perceptron algorithm in the standard way, it seemingly would not converge as rapidly. We will return to this point at the end of this section. Now the number of updates the perceptron algorithm must make to learn a given set of examples is well known to be 0(1/12>, where 1 is the minimum distance from an example to the classifying hyperplane (see 2We say that our candidate half space has accuracy t when the probability that it misclassifies an example drawn from D is no greater than E .
Eric 8. Baum
252
e.g., Minsky and Papert 1969). In order to learn to E accuracy in the sense of Valiant, we will observe that for the uniform distribution we do not need to correctly classify examples closer to the target separating hyperplane than WE/&. Thus we will prove that the perceptron algorithm will converge (with probability 1 - 6) after O(n/e2)updates, which will occur after presentations of examples. Indeed take Ot = 0 so the target hyperplane passes through the origin. Parallel hyperplanes a distance n/2 above and below the target hyperplane bound a band B of probability measure (3.2)
(for n 2 21, where A , = 2~("+')/~/r"(n+ 11/21 is the area of S" (see Fig. 1). Using the readily obtainable (e.g., by Stirling's formula) bound that A,-]/A, < 6,and the fact that the integrand is nowhere greater than 1, we find that for K = 6 / 2 6 , the band has measure less than €12. If Ot # 0, a band of width K will have less measure than it would for Bt = 0. We will thus continue to argue (without loss of generality) by assuming the worst case condition that Bt = 0. Since B has measure less than t / 2 , if we have not yet converged to accuracy E , there is no more than probability 1 / 2 that the next example on which we update will be in B. We will show that once we have made
6 48 -1
mo = max(1441n-
2'
K2
updates, we have converged unless more than 7/12 of the updates are in B. The probability of making this fraction of the updates in B, however, is less than 6/2 if the probability of each update lying in B is not more than 112. We conclude with confidence 1- 612 that the probability our next update will be in B is greater than 1/2 and thus that we have converged to €-accuracy. Indeed, consider the change in the quantity N(0)
QWt -
wk
/I2
+ 11 a0t
-
ok
]I2
(3.3)
when we update. AN
= /I a w t - Wk+l /I2 + I/ aQt - Qk+l /I2 - wk 11' - 11 aet - e k [I2 =
F2awt x* 5 '
+
II x 112
-
/I ffwt (3.4)
h 2Wk . x+ 7 20k
+I
Now note that f ( w k . x+ - 0,) < 0 since x was misclassified by (wk, 0,) (else we would not update). Let A = [ ~ ( .wx+~ - &)I. If z E B, then A 5 0. If z 4 B, then A 5 - ~ / 2 . Recalling x2 = 1, we see that AN < 2 for x E B and AN < -an + 2 for 3: 4 B. If we choose a = 8/n, we find that
The Perceptron Algorithm is Fast
253
BandB
Figure 1: The target hyperplane intersects the sphere S" along its equator (if Qt = 0) shown as the central line. Points in (say) the upper hemisphere are classified as positive examples and those in the lower as negative examples. The band B is formed by intersecting the sphere with two planes parallel to the target hyperplane and a distance n/2 above and below it.
A N 5 -6 for J: 6 B. Recall that, for k = 0, with ( ~ 0 ~ 0 = 0 )(O,O), we have N = cy2 = 64/tc2. Thus we see that if we have made 0 updates on points outside B, and I updates on points in B, N < 0 if 6 0 - 21 > 64/n2. But N is positive semidefinite. Once we have made 48/n2 total updates, at least 7/12 of the updates must thus have been on examples in B. If you assume that the probability of updates falling in B is less than 1/2 (and thus that our hypothesis half space is not yet at €-accuracy), then the probability that more than 7/12 of
s 48 mo = max(1441n- -) 2' K 2 updates fall in B is less than 612. To see this define LE(p,m, r ) as the probability of having at most T successes in m independent Bernoulli trials with probability of success p and recall (Angluin and Valiant 19791, for 0 5 p 5 1, that LE[p,m, (I
-
p>mp1 I e--/32mp'2
(3.5)
Eric B. Baum
254
Applying this formula with m = mOlp = 1/2, P = 1/6 shows the desired result. We conclude that the probability of making mo updates without converging to t-accuracy is less than 6/2. However, as it approaches 1 - E accuracy, the algorithm will update only on a fraction E of the examples. To get, with confidence 1 - 6/2, mo updates, it suffices to call M = 2m,/t examples. Thus, we see that the perceptron algorithm converges, with confidence 1- 6, after we have called 2 6 48n M = - max(1441n-, --) t 2 €2
(3.6)
examples. Each example could be processed in time of order 1 on a "neuron," which computes w k ' x in time 1and updates each of its "synaptic weights" in parallel. On a serial computer, however, processing each example will take time of order n, so that we have a time of order O(n2/e3)for convergence on a serial computer. This is remarkably fast. The general learning procedure, described in Section 2, is to call MO(e,6, n + 1) examples and find a separating halfspace, by some polynomial time algorithm for linear programming such as Karmarkar's algorithm. This linear programming problem thus contains Q(n/t)constraints in n dimensions. Even to write down the problem thus takes time St(n2/e). The upper time bound to solve this given by For large n the perceptron algorithm is Karmarkar (1984) is O(n5.5~-2). faster by a factor of n3.5.Of course it is likely that Karmarkar's algorithm for the particular distribution could be proved to work faster than Cl(n5.5) of examples of interest. If, however, Karmarkar's algorithm requires a number of iterations depending even logarithmically on n, it will scale worse (for large n) than the perceptron algorithm? Notice also that if we simply called M&, 6, n + 1) examples and used the perceptron algorithm, in the traditional way, to find a linear separator for this set of examples, our time performance would not be nearly as good. In fact, equation 3.2 tells us that we would expect one of these . ~ ) the target hyperplane, since we examples to be a distance O ( E / ~ 'from ) and a band of width O ( C / ~ ' .has ~ ) measure are calling n ( n / ~examples Q(c/n).Thus, this approach would take time R(n4/c3), or a factor of n2 worse than the one we have proposed. An alternative approach to learning using only O ( ~ / Eexamples, ) would be to call Mo(t/4,6, n + 1)examples and apply the perceptron algorithm to these until a fraction 1 - ~ / had 2 been correctly classified. This would suffice to assure that the hypothesis half space so generated would (with confidence 1 - 6) have error less than t, as is seen from Blumer et al. (1987, Theorem A3.3). It is unclear to us what time performance this procedure would yield. 3We thank P. Vaidya for a discussion on this point.
The Perceptron Algorithm is Fast
255
4 Nonmalicious Distribution-Independent Learning
Next we propose modification of the distribution-independence assumption, which we have argued is too strong to apply to real world learning. We begin with an informal description. We allow an adversary (adversary 1) to choose the function f in the class F to present to the learning algorithm A. We allow a second adversary (adversary 2) to choose the distribution D arbitrarily. We demand that (with probability 1 - 6) A converge to produce an t-accurate hypothesis g. Thus far we have not changed Valiant's definition. Our restriction is simply that before their choice of distribution and function, adversaries 1 and 2 are not allowed to exchange information. Thus, they must work independently. This seems to us an entirely natural and reasonable restriction in the real world. Now if we pick any distribution and any hyperplane independently, it is highly unlikely that the probability measure will be concentrated close to the hyperplane. Thus, we expect to see that under our restriction, the perceptron algorithm is a distribution-independent learning algorithm for H and converges in time O(~'/E~S')on a serial computer. If adversary 1 and adversary 2 do not exchange information, the least we can expect is that they have no notion of a preferred direction on the sphere. Thus, our informal demand that these two adversaries do not exchange information should imply, at least, that adversary 1 is equally likely to choose any w t (relative, e.g., to whatever direction adversary 2 takes as his z axis). This formalizes, sufficiently for our current purposes, the notion of nonmalicious distribution independence.
Theorem 1. Let U be the uniform probability measure on S" and D any other probability distribution on 5'". Let R be any region on S" of Umeasure €6and let x label some point in R. Choose a point y on S" randomly according to U . Consider the region R' formed by translating R rigidly so that x is mapped to y. Then the probability that the measure D(R') > E is less than 6. Proof. Fix any point z E S". Now choose y and thus R'. The probability 2 E R' is €6. Thus, in particular, if we choose a point p according to D and then choose R', the probability that p E R' is €6. Now assume that there is probability greater than 6 that D(R') > E. Then we arrive immediately at a contradiction, since we discover that I the probability that p E R' is greater than €6.
Corollary 2. The perceptron algorithm is a nonmalicious distribution-independent learning algorithm for half spaces on the unit sphere that converges, with confidence 1 - S to accuracy 1 - E in time of order O ( ~ L ~ / E ~ S ' ) on a serial computer. Proof Sketch. Let K' = ~6/2&. Apply Theorem 1 to show that a band formed by hyperplanes a distance d / 2 on either side of the target
256
Eric B. Baum
hyperplane has probability less than 5 of having measure for examples greater than t/2. Then apply the arguments of the last section, with n' in place of K . I
5 Summary We have argued that the distribution independence condition, although tempting theoretically because of elegant results that show one can rapidly gather enough information for learning, is too restrictive for practical investigations. Very few classes of functions are known to be distribution-independent learnable in polynomial time (and arguably these are trivial cases). Moreover, some classes of functions have been shown not to be learnable by construction of small, cryptographically secure subclasses. These results seem to tell us little about learning in the natural world, and we would thus prefer a less restrictive definition of learnable. Finally we argued that distribution-independent learning requires enormous and unreasonable knowledge of the function to be learned, namely that it come from some specific class of finite V-C dimension. We show in Appendix A, that for uniform distributions, at least some classes of infinite V-C dimension are learnable. Motivated by these arguments we computed the speed of convergence of the perceptron algorithm for a simple, natural distribution, uniform on S". We found it converges with high confidence to accuracy E in time O(n2/t3) on a serial computer. This is substantially faster, for large n, than the bounds known to us for any other learning algorithm. This speed is obtained, in part, because we used a variant of the perceptron algorithm that, rather than cycling through a fixed set of examples, called a new example for each update step. Finally we proposed what we feel is a more natural definition of learnability, nonmalicious distribution-independent learnability, where although the distribution of examples D and the target concept may both be chosen by adversaries, these adversaries may not collude. We showed that the perceptron learning algorithm is a nonmalicious, distributionindependent polynomial time learning algorithm on s". Appendix A Convex Polyhedral Sets Are Learnable for Uniform Distribution In this appendix we sketch proofs that two classes of functions with infinite V-C dimension are learnable. These classes are the class of convex sets and a class of nested differences of convex sets which we define. These results support our conjecture that full distribution independence is too restrictive a criterion to ask for if we want our results to have interesting applications. We believe these results are also of independent interest.
The Perceptron Algorithm is Fast
257
Theorem 3. The class C of convex sets is learnable in time polynomial in c-* and 6-’if the distribution of examples is uniform on the unit square in d dimensions.
Remarks. (1) C is well known to have infinite V-C dimension. ( 2 ) So far as we know, C is not learnable in time polynomial in d as well. Proof Sketch? We work, for simplicity, in two dimensions. Our arguments can readily be extended to d dimensions. The learning algorithm is to call M examples (where M will be specified). The positive examples are by definition within the convex set to be learned. Let M+ be the set of positive examples. We classify examples as negative if they are linearly separable from M,, i.e. outside of c,, the convex hull of M+. Clearly this approach will never misclassify a negative example, but may misclassify positive examples which are outside c+ and inside ct. To show €-accuracy, we must choose M large enough so that, with confidence 1 - 6, the symmetric difference of the target set ct and c+ has area less than E. Divide the unit square into k2 equal subsquares (see Fig. 2.) Call the set of subsquares that the boundary of ct intersects 11. It is easy to see that the cardinality of 11 is no greater than 4k. The set 1 2 of subsquares just inside 11 also has cardinality no greater than 4k, and likewise for the set 1 3 of subsquares just inside 1 2 . If we have an example in each of the squares in 1 2 , then ct and c, clearly have symmetric difference at most equal the area of I1 U I2 U I3 5 12k x k-2 = 12/lc. Thus, take k = 1216. Now choose M sufficiently large so that after M trials there is less than 6 probability we have not got an example in each of the 4k squares in 12. Thus, we need LE(k-’, M , 4k) < 6. Using equation 3.5, we see that I A4 = 500/c21n6 will suffice. Actually, one can learn (for uniform distributions) a more complex class of functions formed out of nested convex regions. For any set (c1, CZ, . . . ,cl} of 2 convex regions in Rd, let R1 = c1 and for j = 2 , . . . , 1 let R3 = R 3 - ~ n cj. Then define a concept f = R1- R2 + R3 - . . . Rl. The class C of concepts so formed we call nested convex sets (see Fig. 3). This class can be learned by an iterative procedure that peels the onion. Call a sufficient number of examples. (One can easily see that a number polynomial in I , € , and 6 but of course exponential in d will suffice.) Let the set of examples so obtained be called S. Those negative examples that are linearly separable from all positive examples are in the outermost layer. Class these in set S1. Those positive examples that are linearly separable from all negative examples in S - S 1 lie in the next layer - call this set of positive examples S,. Those negative examples in 4This proof is inspired by arguments presented in Pollard (1984, pp. 22-24). After this proof was completed, the author heard D. Haussler present related, unpublished results at the 1989 Snowbird meeting on Neural Networks for Computing.
258
Eric B. Baum
Figure 2: The boundary of the target concept ct is shown. The set 11 of little squares intersecting the boundary of Q is hatched vertically. The set 1 2 of squares just inside 11 is hatched horizontally. The set I3 of squares just inside 12 is hatched diagonally. If we have an example in each square in 12, the convex hull of these examplescontains all points inside Q except possibly those in 11~12, or I3.
S - S1 linearly separable from all positive examples in S - Sz lie in the next layer, S,. In this way one builds up 1 + 1 sets of examples. (Some of these sets may be empty.) One can then apply the methods of Theorem 3 to build a classifying function from the outside in. If the innermost layer 4+1 is (say) negative examples, then any future example is called negative if it is not linearly separable from Sl+l,or is linearly separable from S i and not linearly separable from Si-l, or is linearly separable from Si-2 but not linearly separable from Sl-3, etc.
The Perceptron Algorithm is Fast
Figure 3: c1 is the five-sided region, c2 is the triangular region, and square. The positive region c1 - c2 u cl + c3 U c2 U c1 is shaded.
259
c3
is the
Acknowledgments I would like to thank L. E. Baum for conversations and L. G. Valiant for comments on a draft. Portions of the work reported here were performed while the author was an employee of Princeton University and of the Jet Propulsion Laboratory, California Institute of Technology, and were supported by NSF Grant DMR-8518163 and agencies of the US. Department of Defense including the Innovative Science and Technology Office of the Strategic Defense Initiative Organization.
260
Eric 8. Baum
References Angluin, D. and Valiant, L. G . 1979. Fast probabilistic algorithms for Hamiltonian circuits and matchings. J. Comp. Systems Sci. 18, 155-193. Baum, E. B. 1990. On learning a union of half spaces. J. Complex. 5(4). Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. 1987. Learnability and the Vapnik-Chmonenkis Dimension. U.C.S.C. Tech. Rep. UCSC-CRL-8720, and J. ACM, to appear. Karmarkar, N. 1984. A new polynomial time algorithm for linear programming. Combinatorica 4,373-395. Kearns, M. and Valiant, L. 1989. Cryptographic limitations on learning Boolean formulae and finite automata. Proc. 21st ACM Symp. Theory of Computing, 433-444. Minsky, M. and Papert, S. 1969. Perceptrons, and Introduction to Computational Geometry. MIT Press, Cambridge, MA. Pollard, D. 1984. Convergence of Stochastic Processes. Springer-Verlag, New York. Rosenblatt, E 1962. Principles of Neurodynamics. Spartan Books, New York. Schapire, R. 1989. The strength of weak learnability. In Proceedings of the Thirtieth Annual Symposium on Foundations of Computer Science, Research Triangle Park, NC, 28-33. Valiant, L. G. 1984. A theory of the learnable. Comm. ACM 27(11), 1134-1142. Received 28 July 1989; accepted 7 December 1989.
REVIEW
Communicated by Scott Kirkpatrick
Parallel Distributed Approaches to Combinatorial Optimization: Benchmark Studies on Traveling Salesman Problem Carsten Peterson Department of Theoretical Physics, University of Lund, Solvegatan 14A, S-22362 Lund, Sweden We present and summarize the results from SO; loo-, and 200-city TSP benchmarks presented at the 1989 Neural Information Processing Systems (NIPS) postconference workshop using neural network, elastic net, genetic algorithm, and simulated annealing approaches. These results are also compared with a state-of-the-art hybrid approach consisting of greedy solutions, exhaustive search, and simulated annealing. 1 Background
Using neural networks to find approximate solutions to difficult optimization problems is a very attractive prospect. In the original paper (Hopfield and Tank 1985) 10- and 30-city *aveling salesman problems (TSP) were studied with very good results for the N = 10 case. For N = 30 the authors report on difficulties in finding optimal parameters. In Wilson and Pawley (1988) further studies of the Tank-Hopfield approach were made with respect to refinements and extension to larger problem sizes. Wilson and Pawley (1988) find the results discouraging. These and other similar findings have created a negative opinion about the entire concept of using neural network algorithms for optimization problems in the community. Recently a novel scheme for mapping optimization problems onto neural networks was developed (Peterson and Soderberg 1989a). The key new ingredient in this method is the reduction of solution space by one dimension by using multistate neurons [Potts spin (Wu 198311, thereby avoiding the destructive redundancy that plagues the approach of the original work by Tank and Hopfield (1985). The idea of using Potts glass for optimization problems was first introduced by Kanter and Sompolinsky (1987). This encoding was also employed by Van den Bout and Miller (1988). Very encouraging results were found when exploring this technique numerically (Peterson and Soderberg 1989a). An alternative approach to solve TSP in brain-style computing was developed by Durbin and Willshaw (1987)and Durbin et al. (1989),where Neural Computation 2, 261-269 (1990) @ 1990 Massachusetts Institute of Technology
262
Carsten Peterson
a feature map algorithm is used. Basically, an elastic "rubber band" is allowed to expand to touch all cities. The dynamic variables are the coordinates on the band, which vary with a gradient descent prescription on a cleverly chosen energy function. It has recently been demonstrated that there is a strong correspondence between this elastic net algorithm and the Potts approach (Simic 1990; Yuille 1990; Peterson and Soderberg 1989b). Parallel to these developments genetic algorithms have been developed for solving these kind of problems (Holland 1975; Muhlenbein et al. 1988) with extremely high quality results. Given the above mentioned scepticism toward the neural network approach and the relatively unknown success of the genetic approach we found it worthwhile to test these three different parallel distributed approaches on a common set of problems and compare the results with "standard simulated annealing. The simulations of the different algorithms were done completely independently at different geographic locations and presented at the 1989 NIPS postconference workshop (Keystone, Colorado). To further increase the value of this minireport we have also included comparisons with a hybrid approach consisting of greedy solutions, exhaustive search, and simulated annealing (Kirkpatrick and Toulouse 1985). The testbeds consisted of 50-, loo-, and 200-city TSP with randomly chosen city coordinates within a unit square. All approaches used an identical set of such city coordinates. The reason for choosing TSP is its wide acceptance as a NP-complete benchmark problem. The problem sizes were selected to be large enough to challenge the algorithms and at the same time feasible with limited CPU availability. Since the neural network approaches are known to have a harder time with random (due to the mean field theory averaging involved) than structured problems (Peterson and Anderson 1988) we chose the former. 2 The Algorithms
Before comparing and discussing the results we briefly list the key ingredients and parameter choices for each of the algorithms. 2.1 The Potts Neural Network (Peterson and Sbderberg 1989a). This algorithm is based on an energy function similar to the one used in the original work by Tank and Hopfield (1985).
(2.1)
In equation 2.1 the first term miminizes the tour length (Dij is the intercity distance matrix), and the second and third terms ensure that each
Approaches to CombinatorialOptimization
263
city is visited exactly once. A major novel property is that the condition
1si,= 1
(2.2)
a
is always satisfied; the dynamics is confined to a hyperplane rather than a hypercube. Consequently the corresponding mean field equations read (2.3)
where V,, =< S,,
>T
and the local fields Ui, are given by
1 dE (2.4) T aKa The mean field equations (equation 2.3) are minimizing the free energy ( F = E - T S ) corresponding to E in equation 2.1. A crucial parameter when solving equations 2.3 and 2.4 is the temperature T . It should be chosen in the vicinity of the critical temperature T,. In Peterson and Soderberg (1989a) is a method for estimating T, in advance by estimating the eigenvalue distribution of the linearized version of equation 2.3. This turns out to be very important for obtaining good solutions. For the details of annealing schedule, choice of a,P etc. used in this benchmark study we refer to the "black box" prescription in Peterson and Soderberg (1989a, Sect. 7).
u.au
-
2.2 T h e Elastic Net (Durbin and Willshaw 1987). This approach is more geometric. It is a mapping from a plane to a circle such that each city on the plane is mapped onto a point on the circle (path). The N city coordinates are denoted xi. Points on the path are denoted yu, where a = 1,.. . ,M. Note that M can in principle be larger than N . The algorithm works as follows: Start with a small radius circle containing the M y a coordinates with an origin slightly displaced from the center of gravity for the N cities. Let the y coordinates be the dynamic variables and change them such that the energy
(2.5)
is minimized. Gradient descent on equation 2.5 causes the initial circle to expand in order to minimize the distances between y and x coordinates in the first term of equation 2.5 at the same time as the total length is minimized by the second term. Good numerical results were obtained with this method with M > N (Durbin and Willshaw 1987). The parameter K in equation 2.5 has the role of a temperature and as in the case of the neural network approach above a critical value KO can be computed
Carsten Peterson
264
50 125 0.2 2.0 0.29 100 250 0.2 2.0 0.26 200 500 0.2 4.0 0.27
300 300 182
Table 1: Parameters Used for the Elastic Net Algorithm. The parameters 01 and /3 are chosen to satisfy conditions for valid tours (Durbin et al. 1989). from a linear expansion (Durbin et al. 1989). The values of the parameters used in this benchmark (Durbin and Yuille, private communication) can be found in Table 1. This algorithm is closely related to the Potts neural network (Simic 1990; Yuille 1990; Peterson and Soderberg 1989b). Loosely speaking this connection goes as follows. Since the mean field variables V,, are probabilities (cf. equation 2.2) the average distances between tour positions t a b = Iya - ybl and average distances between cities and tour positions dza = Ix, - yal can be expressed in terms of the distance matrix D,. The second term in equation 2.5 can then be identified with the tour length term in equation 2.1 if the metric is chosen to be De rather than D,. The first term in equation 2.5 corresponds to the entropy of the Potts system (equation 2.1); gradient descent on equation 2.5 corresponds to minimizing the free energy of the Potts system, which is exactly what the MFT equations are doing. There is a difference between the two approaches, which has consequences on the simulation level. Each decision element S,, in the Potts neural network approach consists of two binary variables, which in the mean field theory treatment becomes two analog variables; N cities require N 2 analog variables. In the elastic net case N cities only require 2M(M > N ) analog variables; it is a more economical way of representing the problem. 2.3 The Genetic Algorithm (Muhlenbein et al. 1988; Gorges-Schleuter 1989). For details we refer the reader to Muhlenbein et al. (1988) and Gorges-Schleuter (1989). Here we briefly list the main steps and the parameters used. 1. Give the problem to M individuals.
2. Let each individual compute a local minimum (2-quick'). 3. Let each individual choose a partner for mating. In contrast to earlier genetic algorithms (Holland 1975) global ranking of all indi'This is the 2-opt of Lin (1965) with no checkout.
Approaches to Combinatorial Optimization
N
M
D
50 64 8 100 200
r1 : ~2 : . . . : T g
0.25:0.20:0.15:0.10:0.10:0.10:0.05
265
[c1,c21
m
[ N / 4 , N / 2 ] 0.01
N,,, 30 23 562
Table 2 Parameters Used for the Genetic Algorithm. The choice of A4 = 64 was motivated by the available 64 T800 transputer configuration of Muhlenbein (private communication).
1. To = 10; Lo = N . 2. Until variance of cost function < 0.05, update according to T = T/0.8; L = N (heating up). 3. While percentage of accepted moves > 50%, update according to T = 0.95 x T ; L = N (cooling). 4. Until number of uphill moves = 0, update according to T = 0.95 x T ; L = 16N (slow cooling).
Figure 1: Annealing schedule. viduals is not used. Rather, local neighborhoods of size D - 1 were used in which the selection is done with weights ( T I , 7-2, . . . ,T D ) . The global best gets weight T I and and the remaining local neighbors get 7-2, . . . ,TD, respectively.
4. Crossover and mutation. A random string of "genes" is copied from the parent to the offspring. The string size is randomly chosen in the interval [el, c21. Mutation rate = m. 5. If not converged return to point 2. The parameters used for the benchmarks (Muhlenbein, private communication) are shown in Table 2. 2.4 Simulated Annealing (Kirkpatrick et al. 1983). The parameters of this algorithm are the initial temperature To, and the annealing schedule, which determines the next value of T and the length of time L spent at each T . The annealing schedule used is very generous (see Fig. 1) and is
Carsten Peterson
266
based on a requirement that the temperature be high enough such that the variance of the energy is less than T :
((E’) - (E)’) / T 30 problems.” This algorithm has also been tested on larger problem sizes (C. Peterson, unpublished) than presented in this report with no sign of quality deterioration3 It is somewhat surprising that the performance of the neural network algorithm is of less quality than that of the elastic net given the close connection discussed above. We believe this is due to the fact that the former is more sensitive to the position of T,. Indeed, subsequent modifications of the annealing procedure and choice of T, (C. Peterson, unpublished) have lead to better re~ults.~Also, the performance of the NN algorithm can be substantially improved
31n order to strictly adhere to the NIPS presentations we did not include these extensions and improvements in this report.
Carsten Peterson
268
when starting out from a greedy solution, heating the system u p and letting it relax with the MFT equations? 0
Another important issue is computing time. It splits up into two parts: number of operations (in serial execution) per iteration for a given problem size N and the number of iterations needed for convergence. In Table 4 we compare these numbers for the different algorithms. The convergence times in this table are all empirical. The numbers in Table 4 have limited value since the real strength in the distributed parallel approaches is their inherent parallelism.
Acknowledgments I would like to thank Richard Durbin, Alan Yuille, Heinz Miihlenbein, and Scott Kirkpatrick for providing simulation results from the elastic net (Durbin and Yuille), genetic algorithm (Miihlenbein), and a hybrid approach (Kirkpatrick) for these comparisons. Also the stimulating atmosphere provided by the organizers of the 1989 NIPS postconference workshop is very much appreciated.
References Durbin, R., and Willshaw, G. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689. Durbin, R., Szeliski, R., and Yuille, A. 1989. An analysis of the elastic net approach to the traveling salesman problem. Neural Comp. 1,348. Gorges-Schleuter, M. 1989. ASPARAGOS - An asynchronous parallel genetic optimization strategy. Proceedings of the Third International Conference on Genetic Algorithms, D. Schaffer, ed., p. 422. Morgan Kaufmann, San Mateo. Holland, J. H. 1975. Adaption in Natural and Adaptive Systems. University of Michigan Press, Ann Arbor. Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Biol. Cybmnet. 52, 141. Kanter, I., and Sompolinsky, H. 1987. Graph optimization problems and the Potts glass. J. Phys. A 20, L673. Kirkpatrick, S., and Toulouse, G. 1985. Configuration space analysis of the travelling salesman problem. J. Phys. 46, 1277. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220,671. Lin, S. 1965. Computer solutions of the traveling salesman problem. Bell Syst. Techn. J. 44, 2245. Miihlenbein, H., Gorges-Schleuter, M., and Kramer, 0. 1988. Evolution algorithms in combinatorial optimization. Parallel Comp. 7, 65. Peterson, C., and Anderson, J. R. 1988. Applicability of mean field theory neural network methods to graph partitioning. Tech. Rep. MCC-ACA-ST-064-88.
Approaches to Combinatorial Optimization
269
Peterson, C., and Siiderberg, B. 1989a. A new method for mapping optimization problems onto neural networks. Int. 1. Neural Syst. 1, 3. Peterson, C., and Soderberg, B. 1989b. The elastic net as a mean field theory neural network. Tech. Rep. LU TI’ 80-18. Simic, l? D. 1990. Statistical mechanics as the underlying theory of “elastic” and “neural” optimizations, Tech. Reps. CALT-68-1556 and C3P-787. Network: Corny. in Neural Syst. (in press). Van den Bout, D. E., and Miller, T. K., 111. 1988. A traveling salesman objective function that works. Proceedings of the I E E E International Conference on Neural Networks, p. 299. Wilson, G. V., and Pawley, G. S. 1988. On the stability of the travelling salesman problem algorithm of Hopfield and Tank. B i d . Cybernet. 58, 63. Wu, F. Y. 1983. The Potts model. Rev. Modern Phys. 54, 235. Yuille, A. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Corny. 2, 1-24.
Received 1 February 90; accepted 22 May 90.
Communicated by Fernando Pineda
NOTE
Faster Learning for Dynamic Recurrent Backpropagation Yan Fang Terrence J.Sejnowski The Salk Institute, Computational Neurobiology Laboratory, 10010 N.Torrey Pines Road, La Jolla, C A 92037 U S A
The backpropagation learning algorithm for feedforward networks (Rumelhart et al. 1986) has recently been generalized to recurrent networks (Pineda 1989). The algorithm has been further generalized by Pearlmutter (1989) to recurrent networks that produce time-dependent trajectories. The latter method requires much more training time than the feedforward or static recurrent algorithms. Furthermore, the learning can be unstable and the asymptotic accuracy unacceptable for some problems. In this note, we report a modification of the delta weight update rule that significantly improves both the performance and the speed of the original Pearlmutter learning algorithm. Our modified updating rule, a variation on that originally proposed by Jacobs (1988), allows adaptable independent learning rates for individual parameters in the algorithm. The update rule for the ith weight, w,,is given by the delta-bar-delta rule:
) each epoch given by with the change in learning rate ~ , ( ton if &(t- I)&(t)> o if &(t - 1)&(t)< 0
(1.2)
otherwise where K, are parameters for an additive increase, and 4, are parameters for a multiplicative decrease in the learning rates E,, and
Neural Computation 2, 27G273 (1990) @ 1990 Massachusetts Institute of Technology
Dynamic Recurrent Backpropagation
271
where E ( t ) is the total error for epoch t, and
&(t)= (1 - &)S,(t)
+ 19,6,(t
-
1)
(1.4)
where 29, are momentum parameters. Unlike the traditional delta rule that performs steepest descent on the local error surface, the error gradient vector { b , ( t ) } and the weight update vector { Awl} have different directions. This learning rule assures that the learning rate E, will be incremented by K , if the error derivatives of consecutive epochs have the same sign, which generally means a smooth local error surface. On the other hand, if the error derivatives keep on changing sign, the algorithm decreases the learning rates. This scheme achieves fast parameter estimation while avoiding most cases of catastrophic divergences. In addition to learning the weights, the time constants in dynamic algorithms can also be learned by applying the same procedure. One problem with the above adaptational method is that the learning rate increments, K ~ were , too large during the late stages of learning when fine adjustments should be made. Scaling the increments to the squared error was found to give good performance:
This introduces a global parameter, A, but one that could be broadcast to all weights in a parallel implementation. We simulated the figure "eight" presented in Pearlmutter (1989) using the modified delta-bar-delta updating rule, the result of which is shown in Figure la. This is a task for which hidden units are necessary because the trajectory crosses itself. According to the learning curve in Figure lb, the error decreased rapidly and the trajectory converged within 2000 epochs to values that were better than that reported by Pearlmutter (1989) after 20,000 epochs.' We also solved the same problem using a standard conjugate gradient algorithm to update the weights (Press et al. 1988). The conjugate gradient method converged very quickly, but always to local minima (Figure lc). It has the additional disadvantage in a parallel implementation of requiring global information for the weight updates. We have successfully applied the above adaptational algorithm to other problems for which the original method was unstable and did not produce acceptable solutions. In most of these cases both the speed of learning and the final convergence were significantly improved (Lockery et al. 1990a,b). 'We replicated this result, but the original algorithm was very sensitive to the choice of parameters and initial conditions.
Yan Fang and Terrence J. Sejnowski
272
3.5 3 25 L
O
L
2 15 1
t
i
t
0.5
o !
0
0
10
20
30
Epoch
Epoch
(b)
(4
40
50
60
Figure 1: (a) Output from a trained network (solid) plotted against the desired figure (markers) after 1672 learning epochs. Initial weights were randomly sampled from -1.0 to 1.0 and initial time constants from 1.0 to 3.0. An upper limit of 10 and a lower limit of 0.01 were put on the range of the time constants to reduce instabilities. About 75% of the simulation runs produced stable solutions and this example had better than average performance. (b,c) Learning curve of the same situation as in (a). Parameters used: 4 = 0.5, 19 = 0.1, X = 0.01, time step size At = 0.25. Final error E = 0.005. Average CPU time per epoch (on a MIPS M/120) was 0.07 sec. Notice the dramatic spiking after the first plateau. (c) Learning curve using a conjugate gradient method started with the same initial weights and time constants. Final error E = 1.7. Average CPU time per epoch was 2 sec.
Dynamic Recurrent Backpropagation
273
References Jacobs, R. A. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks 1(4),295-307. Lockery, S. R., Fang, Y., and Sejnowski, T. J. 1990a. Neural network analysis of distributed representations of sensorimotor transformations in the leech. In Neurai lnformafion Processing Systems 1989, D. Touretzky, ed. MorganKaufmann, Los Altos. Lockery, S. R., Fang, Y., and Sejnowski, T. J. 1990b. A dynamic neural network model of sensorimotor transformations in the leech. Neural Comp. 2, 274282. Pearlmutter, 8.A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1(2),263-269. Pineda, F. J. 1989. Generalization of back-propagation to recurrent neural networks. Phys. Rev. Lett. 19(59),2229-2232. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1988. Numerical Recipes in C, Chapter 10. Cambridge University Press, Cambridge. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by backpropagating errors. Nature (London) 323, 533-536.
Received 8 January 90; accepted 23 May 90.
Communicated by Richard Andersen
A Dynamic Neural Network Model of Sensorimotor Transformations in the Leech Shawn R. Lockery Yan Fang Terrence J. Sejnowski Computational Neurobiological Laboratory, Salk Institute for Biological Studies, Box 85800, Sun Diego, C A 92138 U S A Interneurons in leech ganglia receive multiple sensory inputs and make synaptic contacts with many motor neurons. These "hidden" units coordinate several different behaviors. We used physiological and anatomical constraints to construct a model of the local bending reflex. Dynamic networks were trained on experimentally derived inputoutput patterns using recurrent backpropagation. Units in the model were modified to include electrical synapses and multiple synaptic time constants. The properties of the hidden units that emerged in the simulations matched those in the leech. The model and data support distributed rather than localist representations in the local bending reflex. These results also explain counterintuitive aspects of the local bending circuitry. 1 Introduction Neural network modeling techniques have recently been used to predict and analyze the connectivity of biological neural circuits (Zipser and Andersen 1988; Lehky and Sejnowski 1988; Anastasio and Robinson 1989). Neurons are represented as simplified processing units and arranged into model networks that are then trained to reproduce the inputautput function of the reflex or brain region of interest. After training, the receptive and projective field of hidden units in the network often bear striking similarities to actual neurons and can suggest functional roles of neurons with inputs and outputs that are hard to grasp intuitively. We applied this approach to the local bending reflex of the leech, a threelayered, feedforward network comprising a small number of identifiable neurons whose connectivity and input-output function have been determined physiologically. We found that model local bending networks trained using recurrent backpropagation (Pineda 1987; Pearlmutter 1989) to reproduce a physiologically determined input-output function contained hidden units whose connectivity and temporal response properties closely resembled those of identified neurons in the biological network. Neural Computation 2,274-282 (1990) @ 1990 Massachusetts Institute of Technology
Dynamic Neural Network Model in the Leech
275
The similarity between model and actual neurons suggested that local bending is produced by distributed representations of sensory and motor information. 2 The Local Bending Reflex
In response to a mechanical stimulus, the leech withdraws from the site of contact (Fig. la). This is accomplished by contraction of longitudinal muscles beneath the stimulus and relaxation of longitudinal muscles on the opposite side of the body, resulting in a U-shaped local bend (Kristan 1982). The form of the response is independent of the site of stimulation: dorsal, ventral, and lateral stimuli produce an appropriately oriented withdrawal. Major input to the local bending reflex is provided by four pressure sensitive mechanoreceptors called I' cells, each with a receptive field confined to a single quadrant of the body wall (Fig. lb). Output to the muscles is provided by eight types of longitudinal muscle motor neurons, one to four excitatory and inhibitory motor neurons for each body wall quadrant (Stuart 1970; Ort et al. 1974). Motor neurons are connected by chemical and electrical synapses that introduce the possibility of feedback among the motor neurons. Dorsal, ventral, and lateral stimuli each produces a pattern of P cell activation that results in a unique pattern of activation and inhibition of the motor neurons (Lockery and Kristan 1990a). Connections between sensory and motor neurons are mediated by a layer of interneurons (Kristan 1982). Nine types of local bending interneurons have been identified (Lockeryand Kristan 1990b). These comprise the subset of the local bending interneurons that contributes to dorsal local bending because they are excited by the dorsal P cell and in turn excite the dorsal excitatory motor neuron. There appear to be no functional connections between interneurons. Other interneurons remain to be identified, such as those that inhibit the dorsal excitatory motor neurons. Interneuron input connections were determined by recording the amplitude of the postsynaptic potential in an interneuron while each of the P cells was stimulated with a standard train of impulses (Lockery and Kristan 1990b). Output connections were determined by recording the amplitude of the postsynaptic potential in each motor neuron when an interneuron was stimulated with a standard current pulse. Most interneurons received substantial input from three or four P cells, indicating that the local bending network forms a distributed representation of sensory input (Fig. lc). 3 Neural Network Model
Because-sensory input is represented in a distributed fashion, most interneurons are active in all forms of local bending. Thus, in addition
S. R. Lockery, Y. Fang, and T. J. Sejnowski
276
a
b
Left
Rlght
Dorsal , 2,
Venlral
Lateral
.,-
& A “!!/
C
left
\ riB
ht
excitatory
inhibitory
Figure 1: (a) Local bending behavior. Partial view of a leech in response to dorsal, ventral, and lateral stimuli. (b) Local bending circuit. The main input to the reflex is provided by the dorsal and ventral P cells (PD and PV). Control of local bending is largely provided by motor neurons whose field of innervation is restricted to single left-right, dorsal-ventral quadrants of the body; dorsal and ventral quadrants are innervated by both excitatory (DE and VE) and inhibitory (DI and VI)motor neurons. Motor neurons are connected by electrical synapses (resistor symbol) and excitatory (triangle) and inhibitory (filled circle) chemical synapses. Sensory input to motor neurons is mediated by a layer of interneurons. Interneurons that were excited by PD and that in turn excite DE have been identified (hatched) ; other types of interneurons remain to be identified (open). (c) Input and output connections of the nine types of dorsal local bending interneurons. Within each gray box, the upper panel shows input connections from sensory neurons, the middle panel shows output connections to inhibitory motor neurons, and the lower panel shows output connections t‘o excitatory motor neurons. Box area is proportional to the amplitude of the connection determined from intracellular recordings of interneurons or motor neurons. White boxes indicate excitatory connections and black boxes indicate inhibitory connections. Blank spaces denote connections whose strength has not been determined.
Dynamic Neural Network Model in the Leech
277
to contributing to dorsal local bending, most interneurons are also active during ventral and lateral bending when some or all of their output effects are inappropriate to the observed behavioral response. This suggests that the inappropriate effects of the dorsal bending interneurons must be offset by other as yet unidentified interneurons and raises the possibility that local bending is the result of simultaneous activation of a population of interneurons with multiple sensory inputs and both appropriate and inappropriate effects on many motor neurons. It was not obvious, however, that such a population was sufficient, given the constraints imposed by the input-output function and connections known to exist in the network. The possibility remained that interneurons specific for each form of the behavior were required to produce each output pattern. To address this issue, we used recurrent backpropagation (Pearlmutter 1989) to train a dynamic network of model neurons (Fig. 2a). The network had four input units representing the four P cells, and eight output units representing the eight motor neuron types. Between input and output units was a single layer of 10 hidden units representing the interneurons. Neurons were represented as single electrical compartments with an input resistance and time constant. The membrane potential (K) of each neuron was given by
where and R, are the time constant and input resistance of the neuron and I, and I, are the sum of the electrical and chemical synaptic currents from presynaptic neurons. Current due to electrical synapses was given by
where gij is the coupling conductance between neuron i and j . To implement the delay associated with chemical synapses, synapse units (s-units) were inserted between pairs of neurons connected by chemical synapses. The activation of each s-unit was given by
where Tzjis the synaptic time constant and f ( K ) was a physiologically determined sigmoidal function (0 5 f 5 1) relating pre- and postsynaptic membrane potential at an identified monosynaptic connection in the leech (Granzow et al. 1985). Current due to chemical synapses was given by
I,
=
c
WySt.?
3
where wiJ is the strength of the chemical synapse between units i and j . Thus synaptic current is a graded function of presynaptic voltage, a common feature of neurons in the leech (Friesen 1985; Granzow et al. 1985;
S. R. Lockery, Y. Fang, and T. J. Sejnowski
278
Rlght
Sensory neurons
@@
neurons
Figure 2 (a) The local bending network model. Four sensory neurons were connected to eight motor neurons via a layer of 10 intemeurons. Neurons were represented as single electricalcompartmentswhose voltage varied as a function of time (see text). Known electrical and chemical connections among motor neurons were assigned fixed connection strengths (g's and w's in the motor layer) determined from intracellular recordings. Interneuron input and output connections were adjusted by recurrent backpropagation. Chemical synaptic delays were implemented by inserting s-units between chemically connected pairs of neurons. S-units with different time constants were inserted between sensory and interneurons to account for fast and slow components of synaptic potentials recorded in interneurons (see Fig. 4). (b)Output of the model network in response to simultaneous activation of both PDs (stim). The response of each motor neuron (rows) is shown before and after training. The desired responses from the training set are shown on the right for comparison (target).
Thompson and Stent 1976)and other invertebrates (Katz and Miledi 1967; Burrows and Siegler 1978; Nagayama and Hisada 1987). Chemical and electrical synaptic strengths between motor neurons were determined by recording from pairs of motor neurons and were not adjusted by the training algorithm. Interneuron input and output connections were given small initial values that were randomly assigned and subsequently adjusted during training. During training, input connections were constrained to be positive to reflect the fact that only excitatory intemeuron input connections were seen (Fig. lc), but no constraints were placed on the number of input or output connections. Synaptic time constants were assigned fixed values. These were adjusted by hand to fit
Dynamic Neural Network Model in the Leech
279
the time course of motor neuron synaptic potentials (Lockery and Kristan 1990a), or determined from pairwise motor neuron recordings (Granzow et al. 1985). 4 Results
Model networks were trained to produce the amplitude and time course of synaptic potentials recorded in all eight motor neurons in response to trains of P cell impulses (Lockery and Kristan 1990a). The training set included the response of all eight motor neurons when each P cell was stimulated alone and when P cells were stimulated in pairs. After 6,000-10,000 training epochs, the output of the model closely matched the desired output for all patterns in the training set (Fig. 2b). To compare interneurons in the model network to actual interneurons, simulated physiological experiments were performed. Interneuron input connections were determined by recording the amplitude of the postsynaptic potential in a model interneuron while each of the P cells was stimulated with a standard current pulse. Output connections were determined by recording the amplitude of the postsynaptic potential in each motor neuron when an interneuron was stimulated with a standard current pulse. Model and actual interneurons were compared by counting the number of input and output connections that were sufficiently strong to produce a postsynaptic potential of 0.5 mV or more in response to a standard stimulus. Model interneurons (Fig. 3a), like those in the real network (Fig. lc), received three or four substantial connections from P cells and had significant effects on most of the motor neurons (Fig. 3b and c). Most model interneurons were therefore active during each form of the behavior and the output connections of the interneurons were only partially consistent with each form of the local bending response. Thus the appropriate motor neuron responses were produced by the summation of many appropriate and inappropriate interneuron effects. This result demonstrates that the apparently appropriate and inappropriate effects of dorsal local bending interneurons reflect the contribution these interneurons make to other forms of the local bending reflex. There was also agreement between the time course of the response of model and actual interneurons to P cell stimulation (Fig. 4). In the actual network, interneuron synaptic potentials in response to trains of P cell impulses had a fast and slow component. Some interneurons showed only the fast component, some only the slow, and some showed both components (mixed). Although no constraints were placed on the temporal response properties of interneurons, the same three types of interneuron were found in the model network. The three different types of interneuron temporal response were due to different relative connection strengths of fast and slow s-units impinging on a given interneuron (Fig. 2a).
S. R. Lockery, Y. Fang, and T. J. Sejnowski
280
.
.
dotmi ventrai-
oSmV
oSmV oSmV
0 excitatory b
inhibitory C
Number of input connections
Number of output connections
Figure 3: (a) Input and output connections of model local bending interneurons. Model interneurons, like the actual interneurons, received substantial inputs from three or four sensory neurons and had signhcant effects on most of the motor neurons. Symbols as in Figure lc. (b,c) Number of interneurons in model (hatched) and actual (solid) networks with the indicated number of significant input and output connections. Input connections (b) and output connections (c) were considered sigruficantif the synaptic potential in the postsynaptic neuron was greater than 0.5 mV. Counts are given in percent of all interneurons because model and actual networks had different numbers of interneurons.
5 Discussion
Our results show that the network modeling approach can be adapted to models with more realistic neurons and synaptic connections, including electrical connections, which occur in both invertebrates and vertebrates. The qualitative similarity between model and actual interneurons demonstrates that a population of interneurons resembling the identified dorsal local bending interneurons could mediate local bending in a distributed
Dynamic Neural Network Model in the Leech
Data
281
Model
Slow
-a
Fast
L
Figure 4: Actual (data) and simulated (model) synaptic potentials recorded from three types of interneuron. Actual synaptic potentials were recorded in response to a train of P ceIl impulses (stim). Simulated synaptic potentials were recorded in response to a pulse of current in the P cell, which approximates a step change in P cell firing frequency. processing system without additional interneurons specific for different forms of local bending. Interneurons in the model also displayed the diversity in temporal responses seen in interneurons in the leech. This represents an advance over our previous model in which temporal dynamics were not represented (Lockery et al. 1989). Clearly, the training algorithm did not produce exact matches between model and actual interneurons, but this was not surprising since the identified local bending interneurons represent only a subset of the interneurons in the reflex. More exact matches could be obtained by using two pools of model interneurons, one to represent identified neurons, the other to represent unidentified neurons. Model neurons in the latter pool wouId constitute testable physiological predictions of the connectivity of unidentified local bending interneurons.
Acknowledgments Supported by the Bank of America-Giannini Foundation, the Drown Foundation, and the Mathers Foundation.
282
S. R. Lockery, Y. Fang, and T. J. Sejnowski
References Anastasio, T., and Robinson, D. A. 1989. Distributed parallel processing in the vestibulo-oculomotor system. Neural Comp. 1,230-241. Burrows, M., and Siegler, M. V. S. 1978. Graded synaptic transmission between local interneurones and motor neurones in the metathoracic ganglion of the locust. J. Physiol. 285, 231-255. Friesen, W. 0. 1985. Neuronal control of leech swimming movements: Interactions between cell 60 and previously described oscillator neurons. 7. Comp. Physiol. 156, 231-242. Granzow, B., Friesen, W. O., and Kristan, W. B., Jr. 1985. Physiological and morphological analysis of synaptic transmission between leech motor neurons. J. Neurosci. 5, 2035-2050. Katz, B., and Miledi, R. 1967. Synaptic transmission in the absence of nerve impulses. J. Physioi. 192, 407436. Kristan, W. B., Jr. 1982. Sensory and motor neurons responsible for local bending in the leech. J. Exp. B i d . 96,161-180. Lehky, S. R., and Sejnowski, T. J. 1988. Network model of shape-from-shading: Neural function arises from both receptive and projective fields. Nature (London) 333,452-454. Lockery, S. R., and Kristan, W. B., Jr. 1990a. Distributed processing of sensory information in the leech. I. Input-output relations of the local bending reflex. J. Neurosci. 10, 1811-1815. Lockery, S. R., and Kristan, W. B., Jr. 1990b. Distributed processing of sensory information in the leech. 11. Identification of interneurons contributing to the local bending reflex. 1. Neurosci. 10, 1815-1829. Lockery, S. R., Wittenberg, G., Kristan, W. B., Jr., and Cottrell, W. G. 1989. Function of identified interneurons in the leech elucidated using neural networks trained by back-propagation. Nature (London) 340,468471. Nagayama, T., and Hisada, M. 1987. Opposing parallel connections through crayfish local nonspiking interneurons. J. Comp. Neurol. 257,347-358. Ort, C. A., Kristan, W. B., Jr., and Stent, G. S. 1974. Neuronal control of swimming in the medicinal leech. 11. Identification and connections of motor neurones. J. Comp. Physiol. 94, 121-154. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1,263-269. Pineda, F. 1987. Generalization of backpropagation to recurrent neural networks. Phys. Rev. Lett. 19, 2229-2232. Stuart, A. E. 1970. Physiological and morphological properties of motoneurones in the central nervous system of the leech. J. Physiol. 209, 627-646. Thompson, W. J., and Stent, G. S. 1976. Neuronal control of heartbeat in the medicinal leech. J. Comp. Physiol. 111, 309-333. Zipser, D., and Andersen, R. A. 1988. A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature (London) 331, 679-684. Received 25 January 90; accepted 4 April 90
Communicated by Idan Segev
Control of Neuronal Output by Inhibition at the Axon Initial Segment Rodney J. Douglas Department of Physiology, LICT Medical Schooi, Observatory 7925, South Africa
Kevan A. C. Martin MRC Anatomical Neuropharmacology Unit, Department of Pharmacology, Oxford University, South Parks Road, Oxford OX1 3QT, UK We examine the effect of inhibition on the axon initial segment (AIS) by the chandelier ("axoaxonic") cells, using a simplified compartmental model of actual pyramidal neurons from cat visual cortex. We show that within generally accepted ranges, inhibition at the AIS cannot completely prevent action potential discharge: only small amounts of excitatory synaptic current can be inhibited. Moderate amounts of excitatory current always result in action potential discharge, despite AIS inhibition. Inhibition of the somadendrite by basket cells enhances the effect of AIS inhibition and vice versa. Thus the axoaxonic cells may act synergistically with basket cells: the AIS inhibition increases the threshold for action potential discharge, the basket cells then control the suprathreshold discharge. 1 Introduction
The action potential output from neurons is generated at a specialized region of the axon called the axon initial segment (AIS). The frequency of the action potential discharge is directly proportional to the amount of inward synaptic current that arrives at the AIS (Fig. 1).Thus inhibition located at the AIS would seem to offer a very effective means of controlling the output of a neuron. In the cerebral cortex, as elsewhere in the brain (e.g., hippocampus, amygdala), a specialized GABAergc neuron (Freund et al. 1983) called the chandelier cell (SzentAgothai and Arbib 19741, or axoaxonic cell, makes its synapses exclusively the AIS of pyramidal neurons (Fig. 1; Somogyi et al. 1982). Another type of putative inhibitory neuron, the basket cell, targets exclusively the soma and dendrites (Fig. 1; see Somogyi et al. 1983; Martin 1988). Many have speculated as to the chandelier cell's exact function, but all agree it probably acts to inhibit the output of pyramidal neurons (see review Peters 1984). The basket cells, which are also GABAergic (Freund et al. 1983), are thought to act Neural Computation 2, 283-292 (1990) @ 1990 Massachusetts Institute of Technology
284
Rodney J. Douglas and Kevan A. C. Martin
in a more graded manner to produce the stimulus selective properties of cortical neurons (see Martin 1988). Although chandelier cells have been recorded in the visual cortex and appear to have normal receptive fields (Martin and Whitteridge, unpublished), nothing is known of their synaptic action. Since experimental investigation of their action is not yet possible, we have used computer simulations of cortical neurons to study the inhibitory effects of AIS inhibition and its interaction with somadendritic inhibition by basket cells.
2 Model
We took two pyramidal neurons that had been completely filled by intracellular injection of horseradish peroxidase and reconstructed them in 3-D (TRAKA, CeZTek, UK). The detailed structure of the dendritic arbor and soma were transformed into a simple equivalent neuron that consisted of an ellipsoidal somatic compartment, and three to four cylindrical compartments (Fig. 1) (Douglas and Martin 1990). The dimensions of a typical AIS (approximately 50 x lpm) were obtained from electron microscopic studies (Fair& and Valverde 1980; Sloper and Powell 1978). The effects of inhibition of these model cells was investigated using a general neuronal network simulating program (CANON, written by R. J. D.). The program permits neurons to be specified as sets of interconnected compartments, each of which can contain a variety of conductance types. The surfaces of the compartments represented the neuronal membrane. The leak resistance of this membrane was lOkR cm-’ = 0.1 mS cm-’, and the capacitance C , was 1pF cmP2.The specific intracellular resistivity of the compartments was O.lkR/cm. With these values the somatic input resistances of the two model pyramids were 20.9 and 95.7 MR, respectively. In addition to the leak conductances the membranes of the soma, AIS, and nodes of Ranvier (first two nodes) contained active sodium and potassium conductances that mediated Hodgkin-Huxley-like action potentials. The behavior of these compartments and their interaction were computed in the usual way (see, e.g., Getting 1989; Segev et al. 1989; Traub and Llinhs 1979). In this study we were concerned only with the effect on the peak action potential discharge and so we did not incorporate the conductances associated with spike adaptation. Action potentials depended only on Hodgkin-Huxley-like sodium spike conductances and delayed potassium conductances. We found that a maximum sodium spike conductance of 100 mS and a maximum delayed potassium conductance of 60 mS cm-’ were required to generate normal looking action potentials in the layer 5 pyramidal cell. These values were also then used for the layer 2 pyramid. Similar values were reported by Traub and Llin6s (1979) in their simulation of hippocampal pyramidal cells.
285
Inhibition at the Axon Initial Segment
a
b
C
L5 Pyramid
Figure 1: (a) Schematic that summarizes anatomical data concerning the synaptic input to pyramidal neurons (filled shape) of putative inhibitory synapses (open triangles) derived from basket and chandelier (axoaxonic) cells. Excitatory synapses (filled triangles) are shown making contact with dendritic spines. IS initial segment. (b) Montage of actual cortical pyramidal neurons from layers 2 and 5, and the idealized simpified model cells used in simulations. Full axon collateral network not shown. Each idealized cell consists of a cylindrical axon initial segment, an ellipsoidal soma (shown here as sphere), and a number of cylinders that represent the dendritic tree. Dimensions of ellipsoid and cylinders were obtained from detailed measurements of actual neurons shown; 100 pm scale bar refers to actual neurons and vertical axis of model neurons; 50 pm scale bar refers to horizontal axis of model neuron only. (c) Currentdischarge relationships of model cells shown in (b). In this and the following figures, current-discharge plots are fitted with a power function. For clarity, only the fitted line is shown in the following figures.
286
Rodney J. Douglas and Kevan A. C. Martin
Average inhibitory conductances ranged between 0.1 and 1 mS cm-2 on the soma and proximal dendritic cylinder, and between 1 and 10 mS cm-’ on the AIS. In our simulations the AIS was represented by two cylindrical compartments (each 25 pm long x lpm diameter) in series, which were interposed between the soma and the onset of the myelinated axon. Since this AIS of a superficial pyramidal neuron has an area of about 150 x cm’, and receives about 50 synapses (Fairen and Valverde 1980; Sloper and Powell 1978; Somogyi et al. 1982), individual inhibitory synapses applied to the AIS compartments had maximum conductances of about 0.3 nS. The average inhibitory conductance was held constant over the period of excitatory current injection in order to model the sustained inhibition that may prevail in the visual cortex during receptive field stimulation (Douglas et al. 1988; Koch et al. 1990). Program CANON was written in TurboPascal. It was executed on a 25-MHz 80386/80387 RM Nimbus VX, and simulated the pyramidal neuron’s response to a 200 msec current injection in 10 min.
3 Results and Discussion
The response of the two pyramidal neurons to different values of current injected in the soma is shown in Figure 1. These current-discharge curves are similar to those seen in actual cortical neurons in vitro (Connors et al. 1988). In the model they could best be fitted with a power function (solid lines). The steeper response and higher peak rate for the layer 2 pyramidal cell is due to the decreased current load offered by the smaller dendritic arbor. When inhibition is applied to the AIS (Fig. Z), the threshold current increases, that is, more current is now required before the neuron begins to discharge. The reduction in discharge achieved by AIS inhibition is shown by the difference (dotted line) in the discharge rate between control (solid line) and the inhibited case (dashed line) for the layer 2 pyramid. Surprisingly, the maximum discharge rate is hardly altered at all. The relative effectiveness of the inhibition is conveniently expressed as a percent (difference/control), shown here for the layer 5 pyramid (dotted line). This clearly shows the dramatic fall-off in the effectiveness of the chandelier cell inhibition as more excitatory current reaches the AIS. Thus the chandelier inhibition acts to increase the current threshold and is effective only for small excitatory currents that produce low discharge rates; it is relatively ineffective for currents that produce high discharge rates (75+ spikes/sec). In the model, AIS inhibition achieves its effect by sinking the current that flows into the AIS, and preventing activation of the local sodium spike conductance. The effectiveness of the AIS inhibition could be
Inhibition at the Axon Initial Segment
G \
287
200
u)
a
Y
.d
n
150
u)
v
m
100
L
cl
J 0Z
50
u) .r(
0
0
‘;j 200 \
b
L5 P y r
u)
(II
Y
n
150
u) W
a
m
100
L
cl
JZ 0
50
u)
O
n
1
2
3
4
C u r r e n t (nA>
Figure 2: (a) Current-discharge relationship of the model layer 2 pyramidal neuron, and (b) of the model layer 5 neuron, before (solid) and during (dashed) inhibition of the AIS. Difference between control and inhibited case shown as dotted line for layer 2 pyramid; expressed as percent inhibition for layer 5 pyramid. Inhibitory conductance, 5 mS cmP2; inhibitory reversal potential, -80 mV.
288
Rodney J. Douglas and Kevan A. C. Martin
IOmS.
&?
-60mV
IOmS.
&?
-80mV
C u r r e n t (nA)
Figure 3: ( a d ) Effect of AIS inhibitory conductance and inhibitory reversal potential on current-dischargerelationship of the model layer 2 pyramidal neuron, dashed; percent inhibition, dotted. Inhibitory conductance and reversal potential are given above each of the four cases shown.
increased by having a larger inhibitory conductance or a reversal potential for the inhibitory synapse that is much more negative than the resting membrane potential (Fig. 3). But even when these two factors were combined (Fig. 3d) action potential discharge could not be prevented. In this case (Fig. 3d) the current threshold was about 0.6 nA, which is well within the operating range of these neurons. The maximum inhibitory conductance in the AIS is a small fraction ( 5 10%) of the sodium spike conductance. So, although the inhibitory conductance may prevent activation of the spike current, it will not have much effect on the trajectory of the spike once it has been initiated. The AIS spike depolarizes the adjacent active soma to its threshold. The somatic spike, in turn, drives the depolarization of the relatively passive dendritic arbor. The somatic and dendritic trajectories lag behind the AIS
Inhibition at the Axon Initial Segment
289
spike, and so are able to contribute excitatory current to the AIS during its redepolarization phase. The rather surprising result that postsynaptic inhibition is relatively ineffective for strong excitatory inputs arises as a consequence of the saturation of the current discharge curve: if sufficient current can be delivered to the AIS, the neuron will always be able to achieve its peak discharge rate, which is determined by the kinetics of the spike conductances, the membrane time constant, and the neuronal input resistance. In the case where the inhibitory synaptic conductances are small and act to hyperpolarize the neuron, the membrane time constant and neuronal input resistance are hardly altered. The current-discharge curve then shifts to the right on the X-axis (Fig. 3a), that is, the threshold increases, but the shape of the curve remains the same as the control. In the case of large inhibitory conductances, the neuronal input resistance is reduced and the membrane time constant shortens. The inhibitory conductance shunts the excitatory current and this has the effect of raising the threshold. However, if the remaining excitatory current is sufficient to drive the membrane to threshold, the shorter time constant permits the membrane to recharge more quickly after each action potential. This means that the neuron can fire faster for a given excitatory current, offsetting some of the effects of inhibition. Hence the current-discharge curve becomes steeper, but achieves approximately the same maximum discharge for the same current input as the control (Fig. 3b). If the axoaxonic inputs alone were required to inhibit the neuron completely via the mechanism modeled here, they could approximate complete blockade only by forcing the threshold to the upper end of the operational range of excitatory current (say 1.5-2 nA). To produce this threshold would require an inhibitory conductance of the same order as the sodium spike conductance. If excitatory input did then exceed the threshold, the neuron would immediately respond with a high discharge rate: a catastrophic failure of inhibition. Such high conductances have not been seen experimentally. A volley of excitation, which would activate nonspecifically all inhibitory neurons, elicits inhibitory conductances in cortical pyramidal cells in vivo that are less than 20% of their input conductance (see Martin 1988). Basket cell axons terminate on the soma and proximal dendrites of pyramidal cells (Somogyi et al. 1983; see Martin 1988). We modeled their inputs by applying a uniform conductance change to the soma and the proximal dendritic cylinder. The effect of basket cell inhibition was similar to axoaxonic inhibition (Fig. 4). The increase in threshold and steepened slope of the current discharge curve seen in the model, have also been observed in vitro when the GABABagonist baclofen was applied to cortical neurons (Connors et al. 1988). We simulated the combined action of AIS and basket inhibition. When the basket inhibition is relatively hyperpolarizing, the AIS and basket mechanisms sum together (Fig. 4c). When the basket inhibition is relatively shunting, the two mechanisms
Rodney J. Douglas and Kevan A. C. Martin
290
-2
200
0.SmS.
0.2mS. cm, -80mV
-2
cm.
-60mV
3
.-
(u
_-__
50 0 L
! I
s+1
80n 60-
Got, lowering C,,, necessarily lowers the “power” expenditure at all stages up to y, which is why we feel equation 1.9 could be biologically more significant. Our hypothesis is similar to Barlow’s redundancy reduction hypothesis (Barlow 1961), with the two becoming identical when the system is free of noise v. In this limit, redundancy is reduced by diagonalizing the correlation matrix rC, by choosing the transfer matrix A such that 4, = A&AT is diagonal. With R,, diagonal, the relationship det(R,,) 5 IIz(Ryy)Ez becomes an equality giving C(y) = I ( y , s ) so the redundancy (1.9) is eliminated. [In reality, the redundancy (1.9) is a lower bound reflecting the fact that we chose probability distributions which take into account only second-order correlators. More complete knowledge of P ( s ) would lower I(z,s) and I(y, s) and therefore increase 72.1 Where reducing R in equation 1.9 differs considerably from Barlow’s hypothesis is in the manner of redundancy reduction when noise is significant. Under those circumstances, R in equation 1.9 is sizable, not because of correlations in the signal, but because much of the channel capacity is wasted carrying noise. Reducing equation 1.9 when the noise is large has the effect of increasing the signal-to-noise ratio. To do this the system actually increases correlations (more precisely increases the amplitude of the correlated signal relative to the noise amplitude), since correlations are what distinguish signal from noise. For large enough noise, more is gained by lowering the noise in this way than is lost by increasing correlations. For an intermediate regime, where signal and noise are comparable, our principle leads to a compromise solution, which locally accentuates correlations, but on a larger scale reduces them. All these facts can be seen by examining the properties of the explicit solution given below. Before we proceed, it should also be noted that Linsker (1986) has hypothesized that the purpose of the encoding A should be to maximize the mutual information I(y,s), subject to some constraints. This differs from the principle in this paper which focuses on lowering the output channel capacity while maintaining the minimum information needed by the organism. While both principles may be useful to gain insight into the purposes of neural processing in various portions of the brain, in the early visual processing, we beIieve that the primary evolutionary pressure has been to reduce output channel capacity. For example, due to much lower resolution in peripheral vision, the amount of information arriving at the retina is far greater than the information kept. It is difficult to believe that this design is a consequence of inherent local biological hardware constraints, since higher resolution hardware is clearly feasible, as seen in the fovea.
Theory of Early Visual Processing
315
2 Explicit Solution
To actually minimize R we use a lagrange multiplier X to implement the constraint (equation 1.lo) and minimize E{A)
= C(Y) - W(Y, 3) - I ( Z
+ 6,s)l
(2.1)
with respect to the transfer function A, where C(y), [(a,s), and [(y, s ) are given in equations 1.8, 1.4, and 1.5, respectively. One important property of R[n, m] (in equation 1.2) that we shall assume is translation invariance, R[n, m] = R[n - m], which is a consequence of the homogeneity of the ensemble of all visual scenes. We can take advantage of this symmetry to simplify our formulas by assuming A[n,m] = A[n - m]. With this in equation 1.7 are all equal assumption, the diagonal elements (Ryy)8i and hence minimizing C(Y)is equivalent to minimizing the simpler expression Tr(A R AT). Using the identity log(detB) = Tr(1ogB) for any positive definite matrix B, and replacing C(y) by Tr(A R AT), equation 2.1 becomes E{A}
1 Ni
= -J R
dwA(w) R(w) A(-w)
--*
A(w)R(w)A(-w) + N i A(w)N2A(-w) + N i N2
+ N;
1 (2.2)
where all variables are defined in momentum space through the standard discrete two-dimensional fourier transform, for example, A(w) = A ( q , w2)
e-Zm'WA[m]
= m
It is straightforward to see from equation 2.2 that the optimal transfer function A(w) satisfies the following quadratic equation:
where we have defined F(w) = A(w).A(-w)/N;. The fact that A appears only through F , is a manifestation of the original invariances of I and C under orthogonal transformations on the transfer function A[m], that is, under A + U A with UT U = 1. Equation 2.3 has only one positive solution for F , which is given explicitly by (2.4)
316
Joseph J. Atick and A. Norman Redlich
where X is determined by solving I(y, s ) = I(z + 6, s). After eliminating F the latter equation becomes
In general, equation 2.5 must be solved for X numerically. The fact that the transfer function A appears only through F leads to a multitude of degenerate solutions for A, related to each other by orthogonal transformations. What chooses among them has to be some principle of minimum effort in implementing such a transfer function. For example, some of the solutions are nonlocal (by local we mean a neighborhood of a point n on the input grid is mapped to the neighborhood of the corresponding point n on the output grid), so they require more elaborate hardware to implement; hence we examine local solutions. Among these is a unique solution satisfying A(w) = A(-w), which implies that it is rotationally invariant in coordinate space. We compare it to the observed retinal transfer function (ganglion kernel), known to be rotat'ionally symmetric. Since rotation symmetry is known to be broken at the simple cell level, it is significant that this formalism is also capable of producing solutions that are not rotationally invariant even when the correlation function is. It may be that the new features of the class of transfer functions at that level (for example, divergence factor) will lift the degeneracy in favor of the nonsymmetric solutions. (In fact, in one dimension we find solutions that break parity and look like one-dimensional simple cells kernels.) The rotationally invariant solution is obtained by taking the square root of F in equation 2.4 (we take the positive square root, corresponding to on-center cells). In what follows, we examine some of its most important properties. To be specific, we parameterize the correlation function by a decaying exponential
with D the correlation length measured in acuity units and S the signal amplitude. We have done numerical integration of equations 2.4 and 2.5 and determined A[m] for several values of the parameters. In Figure 2, we display one typical solution, which was obtained with SIN = 2.0, D = 50, and N6 = 0.025. In that figure, empty disks represent positive (excitatory), while solid disks represent negative (inhibitory) components of A[m]. Also, the logarithm of the area of a disk is directly related to the amplitude of the component of A[m]at that location. As one can see, the solution has a strong and rather broad excitatory center with a weaker and more diffuse surround. A very significant feature of the theoretical profiles is their insensitivity to D (and to N&),which is necessary to account for the fact that the observed profiles measured in acuity units are similar in different species and at different eccentricities.
Theory of Early Visual Processing
317
Figure 2: Optimal transfer function, A[m], for nondivergent linear codes, with D = 50, SIN = 2, and Nn = 0.025. Open disks denote positive (excitatory) components of A[m]while solid disks denote negative (inhibitory)components. The area of a disk is directly related to the logarithm of A[m]at that location. To get more insight into this solution, let us qualitatively examine its behavior as we change SIN (for a detailed quantitative comparison with physiological data see Atick and Redlich 1990). For that, we find it more convenient to integrate out one of the dimensions (note this is not the same as solving the problem in one dimension). The resulting profile, corresponding to Figure 2, is shown in Figure 3b. In Figure 3, we have also plotted the result for two other values of S I N , namely for low and high noise regimes (Fig. 3a and c, respectively). These show that an interpolation is happening as S I N changes between the two extremes. Analytically, we can also see this from equation 2.4 for any & by taking the limit NIS + 0, where A(w) becomes equal to
One recognizes that this is the square root of the solution one gets by carrying out prediction on the inputs, a signal processing technique whch we advocated for this regime of noise (see also Srinivasan et al. 1982)
Joseph J. Atick and A. Norman Redlich
318
(4
S/N-O.l
I
I
I
I
I
-10
-5
0
5
10
Figure 3: (a-c) Optimal solution at three different values of SIN. These profiles have been produced from the two-dimensional solution by summing over one direction and normalizing the resulting profile such that the height central point is equal to the center height in the two-dimensional solution. as a redundancy reduction technique in our previous paper (Atick and Redlich 1989). The spatial profiles for the square root solution are very similar to the prediction profiles, albeit a bit more spread out in the surround region. This type of profile reduces redundancy by reducing the amount of correlations present in the signal. In the other regime, where noise is very large compared to the signal, the solution for A(w) (&/N2)*/4and has the same qualitative features as the smoothing solution (Atick and Redlich 1989) which in that limit is Asmoothg= & / N 2 . Smoothing increases the signal to noise of the output and, in our earlier work, we argued that it is a good redundancy reducing technique in that noise regime. Moreover, in that work, we
-
Theory of Early Visual Processing
319
argued that to maintain redundancy reduction at all signal-to-noise levels a process that interpolates between prediction and smoothing has to take place. We proposed a convolution of the prediction and the smoothing profiles as a possible interpolation (SPI-coding), which was shown to be better than either prediction or smoothing. In the present analysis, the optimal redundancy reducing transfer function is derived, and, although it is not identical to SPI-coding, it does have many of the same qualitative properties, such as the interpolation just mentioned and the overall center-surround organization. The profiles in Figures 2 and 3 are very similar to the kernels of ganglions measured in experiments on cats and monkeys. We have been able to fit these to the phenomenological difference of gaussian kernel for ganglions (Enroth-Cugell and Robson 1966). The fits are very good with parameters that fall within the range that has been recorded. Another significant way in which the theory agrees with experiment is in the behavior of the kernels as SIN is decreased. In the theoretical profiles, one finds that the size of the center increases, the surround spreads out until it disappears, and finally the overall scale of the profile diminishes as the noise becomes very large. In experiment, these changes have been noted as the luminosity of the incoming light (and hence the signal to noise) is decreased and the retina adapts to the lower intensity (see, for example, Enroth-Cugell and Robson 1966). This active process, in the language of the current theory is an adjustment of the optimal redundancy reducing processing to the SIN level. In closing, we should mention that many of the techniques used to derive optimal encoding for the spatial properties of visual signals can be directly applied to temporal properties. In that case, for low noise the theory would lead to a reduction of temporal correlations, which would have the effect of taking the time derivative, while in the high noise case, the theory would lead to integration. Both types of processing play a significant role in visual perception, and it will be interesting to see how well they can be accounted for by the theory. Another issue that should be addressed is the question of how biological organisms evolved over time to have optimal redundancy reducing neural systems. In our previous paper, we discovered an anti-Hebbian unsupervised learning routine which converges to the prediction configuration and a Hebbian routine which converges to the smoothing profiles. We expect that there exist reasonably local learning algorithms that converge to the optimal solutions described here.
Acknowledgments Work supported by the National Science Foundation, Grant PHYS8620266.
320
Joseph J. Atick and A. Norman Redlich
References Atick, J. J., and Redlich, A. N. 1989. Predicting the ganglion and simple cell receptive field organizations from information theory. Preprint no. IASSNS HEP-89/55 and NW-NN-89/1. Atick, J. J., and Redlich, A. N. 1990. Quantitative tests of a theory of early visual processing: I. Spatial contrast sensitivity profiles. Preprint no. IASSNSHEP90/51. Barlow, H. B. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communication, W. A. Rosenblith, ed. M.I.T. Press, Cambridge, MA. Barlow, H. B. 1989. Unsupervised learning. Neural Comp. 1,295-311. Bodewig, E. 1956. Matrix Calculus. North-Holland, Amsterdam. Enroth-Cugell, C., and Robson, J. G. 1966. The contrast sensitivity of retinal ganglion cells of the cat. J. Physiol. 187,517-552. Linsker, R. 1986. Self-organization in a perceptual network. Computer (March), 105-1 17. Linsker, R. 1989. An application of the principle of maximum information preservation to linear systems. In Advances in Neural Information Processing Systems, D. S. Touretzky, ed., Vol. 1, pp. 186-194. Morgan Kaufmann, San Mateo. Orban, G. A. 1984. Neuronal Operations in the Visual Cortex. Springer-Verlag, Berlin. Shannon, C. E., and Weaver, W. 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana. Srinivisan, M. V., Laughlin, S. B., and Dubs, A. 1982. Predictive coding: A fresh view of inhibition in the retina. Proc. R. SOC.London Ser. B 216, 427-459. Uttley, A. M. 1979. Information Transmission in the Nervous System. Academic Press, London.
Received Y February 90; accepted 10 June YO.
Communicated by Richard Durbin
Derivation of Linear Hebbian Equations from a Nonlinear Hebbian Model of Synaptic Plasticity Kenneth D. Miller Department of Physiology, University of California, San Francisco, CA 94143-0444 USA
A linear Hebbian equation for synaptic plasticity is derived from a more complex, nonlinear model by considering the initial development of the difference between two equivalent excitatory projections. This provides a justification for the use of such a simple equation to model activity-dependent neural development and plasticity, and allows analysis of the biological origins of the terms in the equation. Connections to previously published models are discussed.
Recently, a number of authors (e.g., Linsker 1986; Miller et al. 1986, 1989) have studied linear equations modeling Hebbian or similar correlation-based mechanisms of synaptic plasticity, subject to nonlinear saturation conditions limiting the strengths of individual synapses to some bounded range. Such studies have intrinsic interest for understanding the dynamics of simple feedforward models. However, the biological rules for both neuronal activation and synaptic modification are likely to depend nonlinearly on neuronal activities and synaptic strengths in many ways. When are such simple equations likely to be useful as models of development and plasticity in biological systems? One critical nonlinearity for biological modeling is rectification. Biologically, a synaptic strength cannot change its sign, because a given cell's synapses are either all excitatory or all inhibitory. Saturating or similar nonlinearities that bound the range of synaptic strengths may be ignored if one is concerned with the early development of a pattern of synaptic strengths, and if the initial distribution of synaptic strengths is well on the interior of the allowed region in weight space. However, if a model's outcome depends on a synaptic variable taking both positive and negative values, then the bound on synaptic strengths at zero must be considered. Previous models make two proposals that avoid this rectification nonlinearity. One proposal is to study the difference between the strengths of two separate, initially equivalent excitatory projections innervating a single target structure (Miller et al. 1986, 1989; Miller 1989a). This Neurul Computation 2, 321-333 (7990)
0 1990
Massachusetts Institute of Technology
Kenneth D. Miller
322
difference in strengths is a synaptic variable that may take both positive and negative values. An alternative proposal is to study the sum of the strengths of two input projections, one excitatory, one inhibitory, that are statistically indistinguishablefrom one another in their connectivities and activities (Linsker 1986). The proposal to study the difference between the strengths of two equivalent excitatory projections is motivated by study of the visual system of higher mammals. Examples in that system include the projections from the lateral geniculate nucleus to the visual cortex of inputs serving the left and right eyes (Miller et al. 1989) (reviewed in Miller and Stryker 1990) or of inputs with on-center and off-center receptive fields (Miller 1989a). Examples exist in many other systems (briefly reviewed in Miller 1990). Assuming that the difference between the two projections is initially small, the early development of the difference can be described by equations linearized about the uniform condition of complete equality of the two projections. This can allow linear equations to be used to study aspects of early development in the presence of more general nonlinearities. This paper presents the derivation of previously studied simple, linear Hebbian equations, beginning from a nonlinear Hebbian model in the presence of two equivalent excitatory input projections. The outcome of this derivation is contrasted with that resulting from equivalent excitatory and inhibitory projections. Applications to other models are then discussed. 1 Assumptions
The derivation depends on the following assumptions:
A1
There are two modifiable input projections to a single output layer. The two input projections are equivalent in the following sense: 0
0
There is a topographic mapping that is identicaI for the two input layers: Each of the two input projections represent the same topographic coordinates, and the two project in an overlapping, continuous manner to the output layer. The statistics of neuronal activation are identical within each projection (N.B. the correlations between the two projections may be quite different from those within each projection);
A2
Synaptic modification occurs via a Hebb rule in which the roles of output cell activity and that of input activity are mathematically separable;
A3
The activity of an output cell depends (nonlinearly) only on the summed input to the cell.
Derivation of Linear Hebbian Equations
323
In addition, the following assumptions are made for simplicity. For the most part, they can be relaxed in a straightforward manner, at the cost of more complicated equations:
A4
The Hebb rule and the output activation rule are taken to be instantaneous, ignoring time delays. [Instantaneous rules follow from more complicated rules in the limit in which input patterns are sustained for long times compared to dynamic relaxation times. This limit appears likely to be applicable to visual cortex, where geniculate inputs typically fire in bursts sustained over many tens or hundreds of milliseconds (Creutzfeldt and Ito 1968)];
A5
The statistics of neuronal activation are time invariant;
A6
There are lateral interconnections in the output layer that are time invariant;
A7
The input and output layers are two-dimensional and spatially homogeneous;
AS
The topographic mapping from input to output layers is linear and isotropic.
2 Notation
We let Roman letters (z, y, z , . . .) label topographic location in the output layer, and Greek letters ( a ,/3, y,. . .) label topographic location in each of the input layers. We use the labels 1 and 2 to refer to the two input projections. We define the following functions: 1. o ( z , t ) :activity ( e g , firing rate, or membrane potential) of output cell at location z at time t; 2.
Z'(Q, t ) ,i2(a,t ) : activity of input of type 1 or 2, respectively, from location Q at time t;
3. A(z-a): synaptic density or "arbor" function, describing connectivity from the input layer to the output layer. This tells the number of synapses from an input with topographic location a onto the output cell with topographic location z. This is assumed time independent and independent of projection type;
4. s ~ ( z , a , t ) , s ~ ( z , a ,strength t): of the kth synapse of type 1 or 2, respectively, from the input at Q to the output cell at IC at time t. There are A ( . - a ) such synapses of each type; 5. S'(z,a , t ) ,S2(z,Q, t ) : total synaptic strength at time t from the input of type 1 or 2, respectively, at location a, to the output cell at IC. S'(z, Q, t ) = sL(z,Q, t ) [and similarly for S2(z,a , t ) ] ;
xk
Kenneth D. Miller
324
6. B ( x - y): intracortical connectivity function, describing total (timeinvariant) synaptic strength from the output cell at y to the output cell at x. B depends only on z - y by assumption A7 of spatial homogeneity. 3 Derivation of Linear Hebbian Equations from a Nonlinear Hebbian
Rule The Hebbian equation for the development of a single type 1 synapse sk from a to x can, by assumptions A2 and A4, be written d s k ,f f , t ) dt
= Ah, [o(x,t)] hi
[ZI(a,t)]- E - ~ s ~ ( x , c Y , ~ )
subject to 0 I s: 5 s,,
(3.1)
We assume that h, is a differentiablefunction, but h, and hi are otherwise arbitrary functions incorporating nonlinearities in the plasticity rule. A, E , and y are constants. Summing over all type 1 synapses from LY to x yields dS'(z, a ,t )
dt
=
AA(x - a)h, [ o ( x ,t ) ]h, [i'(n,t ) ]
-EA(x - a ) - yS'(rc, a,t ) subject to 0 2 S'(z,a , t ) 5 smaxA(z- a )
(3.2)
(and similarly for 5''). There are small differences between equations 3.1 and 3.2 when some but not all synapses sk have reached saturation. We will be concerned with the early development of a pattern, before synapses saturate, and so ignore these differences. We will omit explicit mention of the saturation limits hereafter. Define the direct input to a cell as O(x,t) f Cs{S*(x,P,t)fi[i'(P,t)]+ S2(x, 0,t ) f i [ i 2 ( pt,) ] } The . nonlinear activation rule is, by assumptions A3 and A4, (3.3)
f, and g are assumed to be differentiable functions, but they and fi are otherwise arbitrary functions incorporating the nonlinearities in the activation rules. We make the following nontrivial assumption: A9
For each input vector e(t),equation 3.3 defines a unique output vector o ( t ) .
Biologically, this is the assumption that the inputs determine the state of the outputs. Mathematically, this can be motivated by studies of the
Derivation of Linear Hebbian Equations
325
Hartline-Ratliff equation (Hadeler and Kuhn 1987).' With this assumption, o(z, t ) can be regarded as a function of the variables 8(y, t ) for varying y. We now transform from the variables S' and S2 to sum and difference variables. Define the following:
+
Note that 8(z, t ) = Bs(z,t ) BD(z,t ) . The Hebb rule for the difference, S D = S' - S2 is, from equation 3.2, dSD(zl
dt
t,
= XA(z -
a)ho[o(z,t ) ]hD(a,t ) - ySD(z,Q, t )
(3.5)
S D is a synaptic variable that can take on both positive and negative values, and whose initial values are near zero. We will develop a linear equation for S D by linearizing equation 3.5 about the uniform condition S D = 0. We will accomplish this by expanding equation 3.5 about O D = 0 to first order in OD. Let os(z, t ) be the solution of
Then, letting a prime signify the derivative of a function, dSD(z,a, t )
dt
=
XA(z - a ) h y ( a ,t ) {ho[o'(x, t ) ]
-
y s D ( z Q, , t ) + o [(eD)*]
(3.7)
'The HartlineRatliff equation is equation 3.3 for g(x) = {r, s 1 0; 0, x < 0) and f,,(x) = x. That equation has a unique output for every input, for symmetric B, iff 1 - B is positive definite; a more general condition for B nonsymmetric can also be derived (Hadeler and Kuhn 1987).
Kenneth D. Miller
326
Letting g”(z, t ) = g’ {Bs(z,t ) of o(z, t ) is
+ CyB ( z
-
y)fo[os(y, t ) ] } the , derivative
(3.8) where 1 is the identity matrix, B is the matrix with elements &, = B ( z - y)fA[os(y, t)]g‘s(y, t ) ,and [. . .Ixy means the xy element of the matrix in brackets. Letting Z(z, y, t ) = [l
+ B + (By + . . .]“Y
(3.9) (3.10) (3.11)
we find that equation 3.7 becomes, to first order in O D ,
This equation can be interpreted intuitively. The first term is the Hebbian term of equation 3.5 in which the output cell’s activity has been replaced by the activity it would have if OD = 0, that is, if S D = 0. The last term is the Hebbian term with the output cell’s activity replaced by the first order change in that activity due to the fact that OD # 0. In this term, M ( z , t ) measures the degree to which, near OD = 0, the activity of the output cell at z can be significantly modified, for purposes of the Hebb rule, by changes in the total input it receives. 1(z,y, t ) measures the change in the total input to the cell at z due to changes in the direct input to the cell at y. @‘(a, p,t)S”(y, p, t ) incorporates both the change in the direct input to the cell at y due to the fact that OD # 0, and the difference in the activities of the inputs from a that are being modified.
4 Averaging
Given some statistical distribution of input patterns i(a,t),equation 3.12 is a stochastic differential equation. To transform it to a deterministic equation, we average it over input activity patterns. The result is an equation
Derivation of Linear Hebbian Equations
327
for the mean value (S”), averaged over input patterns. The right-hand side of the equation consists of an infinite series of terms, corresponding to the various cumulants of the stochastic operators of equation 3.12 (Keller 1977; Miller 1989b). However, when X and y are sufficiently small that S D can be considered constant over a period in which all input activity patterns are sampled, only the first term is significant. We restrict attention to that term. After averaging, the first term on the right side of equation 3.12 yields zero, by equality of the two input projections. The lowest order term resulting from averaging of the last term is
XA(z - a )
c
(@(XI
t)Z(.l
Y,t)C”(a,P, t ) )SD(Y,P , t )
Y.0
where we retain the notation S D for (S”). We now assume: We can approximate ( M ( z ) Z ( z , y ) C ” ( a , P ) )by (M(z)I(z,y))
A10
(C D ( a8, ) ) . Assumption A10 will be true if the sum of the two eyes‘ inputs is statistically independent of the difference between the two eyes‘ inputs. By equivalence of the two input projections the sum and difference are independent at the level of two-point interactions: ( S s S D )= (S’S’)-(S’S’) = 0 = ( S s ) ( S D ) By . assumption A7 of spatial homogeneity, ( M ( z ) J ( z , y ) ) can depend only on z - y, while (C”(a,S)) can depend only on a - p. With these assumptions, then, the linearized version of this nonlinear model becomes
-ySD(x, a, t )
(4.1)
where
and
Note that the nonlinear functions referring to the output cell, h,, fo, and 9, enter into equation 4.1 only in terms of their derivatives. This reflects the fact that the base level of output activity, os, makes no
328
Kenneth D. Miller
contribution to the development of the difference S D because the first term of equation 3.12 averages to 0. Only the alterations in output activity induced by eDcontribute to the development of SD. We have not yet achieved a linear equation for development. I(. - y) depends on S s through the derivatives of h,, f o , and g. Because the equation for Ss remains nonlinear, equation 4.1 is actually part of a coupled nonlinear system. Intuitively, the sum Ss is primarily responsible for the activation of output cells when S D is small. S s therefore serves to "gate" the transmission of influence across the output layer: the cells at z and at y must both be activated within their dynamic range, so that small changes in their inputs cause changes in their responses or in their contribution to Hebbian plasticity, in order for Z(x - y ) to be nonzero. To render the equation linear, we must assume All
The shape of I(x - y) does not vary significantly during the early, linear development of S D .
Changes in the amplitude of I(" - y) will alter only the speed of development, not its outcome, and can be ignored. Assumption A l l can be loosely motivated by noting that Ss is approximately spatially uniform, so that B ( z - y) should be the primary source of spatial structure in I(" - y)? and that cortical development may act to keep cortical cells operating within their dynamic range. 5 Comparison to the Sum of an Excitatory and an Inhibitory Projection
An alternative proposal to that developed here is to study the sum of the strengths of two indistinguishable input projections, one excitatory and one inhibitory (Linsker 1986). This case is mathematically distinct from the sum of two equivalent excitatory projections, because the Hebb rule does not change sign for the inhibitory population relative to the excitatory population. That is, in response to correlated activity of the preand postsynaptic cells, inhibitory synapses become weaker, not stronger, by a Hebbian rule. To understand the significance of this distinction, let S2 now represent an inhibitory projection, so that S2 I 0. Then the variable that is initially small, and in which we expand in order to linearize, is the synaptic sum Ss, rather than the difference S D . Define oD analogously *The correlation structure of the summed inputs can also contribute to I(. - y), since cortical cells with separation x - y must be coactivated for I ( x - y) to be nonzero. Arguments can be made that the relevant lengths in I(.-y) appear smaller than an arbor diameter (e.g., see Miller 1990; Miller and Stryker 1990), and thus are on a scale over which cortical cells receive coactivated inputs regardless of input correlation structure.
Derivation of Linear Hebbian Equations
329
to the definition of os in equation 3.6, with OD in place of Bs. Let hC(a,t ) z hi[Z'(a,t)] hi[i2((r,t)]. Then one finds in place of equation 3.12
+
d S S ( x ,a, t ) dt
=
XA(2 - .)h0 [oD(,, t ) ]@ ( a ,t ) 2~A(z a) +XA(X- a)lMs(z, t ) Is(,, y, t ) -ySs(x, a, t)
-
C Yi;l
C S b ,P, W S ( y , P, t )
(5.1)
where C s = 112 hff:, and M S and I s are defined like M and I except that derivatives are taken at OD and oD rather than at 0' and os. Unlike equation 3.12, the first term of equation 5.1 does not disappear after averaging. This means that the development of S s depends upon a Hebbian coupling between the summed input activities, and the output cell's activity in response to SD (the activity the cell would have if S s = 0). Thus, direct Hebbian couplings to both S D and S s drive the initial development of S s , rendering it difficult to describe the dynamics by a simple linear equation like equation 4.1. In Linsker (1986), two assumptions were made that together lead to the disappearance of this first term. First, the output functions h,, fo, and g were taken to be linear. This causes the first term to be proportional to CD. To present the second assumption, we define correlation functions C", C'*, C2', C22among and between the two input projections by C J k ( a - D )= (h,[zJ(a, t ) ]f,[z'((P, t ) ] ) . By equivalence of the two projections, C" = C22and C" = C2'. Then C D = C" - C". The second assumption was that correlations between the two projections are identical to those within the two projections; that is, CI2 = C". This means that C D = 0, and so the first term disappears. This second assumption more generally ensures that S D does not change in time, prior to synaptic saturation. Equation 5.1 also differs from equation 3.12 in implicitly containing two additional parameters that Linsker named Icl and k2. kl is the decay parameter E . The parameter k2 arises as follows. One can reexpress the "correlation functions" CJ in terms of "covariance functions" Q J k C J k= QJk k2, where
+
Q"(a
-PI
( h [ z ' ( a , t ) ] )()h [ b k ( P ) t ) ]
= ((hi [ z 3 ( ~ , 1 ) ]-
and kZ =
j
( h , [iJ(a.t)](ft[ i " ( P . t ) ] )
I
k2 is independent of the choice of j and k. The Qs have the advantage that lim(,-p)+?cQjk(a- 0)= 0; if fi and h, are linear, the Qs are true covariance functions. The correlation function relevant to the sum of an inhibitory and an excitatory projection is C s = C1' +C12 = Qn+Q1' f 2 k z .
330
Kenneth D. Miller
In contrast, the correlation function relevant to the difference between two excitatory projections is C D = C" - C12 = Q'l - & I 2 , which has no k2 dependence. Thus, the parameters Icl and k2 do not arise in considering the difference between two excitatory input projections, because they are identical for each input projection and thus disapear from the equation for the difference; whereas these parameters do arise in considering the sum of an excitatory and an inhibitory projection. In MacKay and Miller (1990), it was shown that these parameters can significantly alter the dynamics, and play crucial roles in many of the results of Linsker (1986). In summary, the proposal to study equivalent excitatory and inhibitory projections does not robustly yield a linear equation in the presence of nonlinearities in the output functions h,, f,, and g. Even in the absence of such additional nonlinearities, it can lead to different dynamic outcomes than the proposal studied here. It also is biologically problematic. It would not apply straightforwardly to such feedforward projections as the retinogeniculate and geniculocortical projections in the mammalian visual system, which are exclilsively excitatory. Where both inhibitory and excitatory populations do exist, the two are not likely to be equivalent. For example, inhibitory neurons are often interneurons that, when active, inhibit nearby excitatory neurons, potentially rendering the three correlation structures C", C12,and CZ2quite distinct; connectivity of such interneurons is also distinct from that of nearby excitatory cells (Toyama et al. 1981; Singer 1977). Similarly, while there is extensive evidence that excitatory synapses onto excitatory cells may be modified in a Hebbian manner (Nicoll et al. 1988), current evidence suggests that there may be little modification of inhibitory synapses, or of excitatory synapses onto the aspinous inhibitory interneurons, under the same stimulus paradigms (Abraham et al. 1987; Grfith et al. 1986; Singer 1977).
6 Connections to Previous Models
Equation 4.1 is that studied in Miller et al. (1986,1989) and Miller (1989a). It is also formally equivalent to that studied in Linsker (1986) except for the absence of the two parameters kl and k2.3 The current approach allows the analysis of other previous models. For example, in the model of Willshaw and von der Malsburg (19761, g was taken to be a linear threshold function [g(z) = 2 - 6 for z > 6, where 6 is a constant threshold; g(z) = 0 otherwise]; this can be approximated by a differentiable function. The functions h, and f, were taken to be the identity, while h, and fi were taken to be step functions: 1 if the 3Also, lateral interactions in the output layer were not introduced until the final layer in Linsker (1986). They were then introduced perturbatively, so that I was approximated by 1 + 8. B was referred to as f in that paper.
Derivation of Linear Hebbian Equations
331
input was active, 0 if it was not. A time-dependent activation rule was used, but input activations were always sustained until a steady state was reached so that this rule is equivalent to equation 3.3. These rules were applied only to a single input projection, but the present analysis allows examination of the case of two input projections. From equation 4.2, it can be seen that choosing g to be a linear threshold function has two intuitively obvious effects: (1)on the average, patterns for which 6’ would fail to bring the output cell at z above threshold do not cause any modification of S D onto that cell; ( 2 ) such patterns also make no average contribution to I(y - x) for all 9, that is, if the cell at E. is not above threshold it cannot influence plasticity on the cell at y. More generally, given an ensemble of input patterns and the initial distribution of Ss, the functions I(. - y) and C D ( a- p ) could be calculated explicitly from equation 4.2 and 4.3, respectively. Similarly, Hopfield (1984) proposed a neuronal activation rule in which fi and fo are taken to be sigmoidal functions and g is the identity, while many current models (i.e., Rumelhart et al. 1986) take f L and fo to be the identity, but take g to be sigmoidal. Again, such activation rules can be analyzed within the current framework. 7 Conclusions
It is intuitively appealing to think that activity-dependent neural development may be described in terms of functions A, I , and C that describe, respectively, connectivity from input to output layer (”arbors”), intralaminar connectivity within the output layer, and correlations in activity within the input layer. I have shown that formulation of linear equations in terms of such functions can be sensible for modeling aspects of early neural development in the presence of nonlinearities in the rules governing cortical activation and Hebbian plasticity. The functions I and C can be expressed in terms of the ensemble of input activities and the functions describing cortical activation and plasticity. This gives a more general relevance to results obtained elsewhere characterizing the outcome of development under equation 4.1 in terms of these functions (Miller et al. 1989; Miller 1989a; MacKay and Miller 1990). The current formulation is of course extremely simplified. Notable simplifications include the lack of plasticity in intralaminar connections in the output layer, the instantaneous nature of the equations, the assumption of spatial homogeneity, and, more generally, the lack of any attempt at biophysical realism. The derivation requires several additional assumptions whose validities are difficult to evaluate. The current effort provides a unified framework for analyzing a large class of previous models. It is encouraging that the resulting linear model is sufficient to explain many features of cortical development (Miller et al. 1989; Miller 1989a); it will be of interest, as more complex models are formulated,
332
Kenneth D. Miller
to see the degree to which they force changes in the basic framework analyzed here.
Acknowledgments I thank J. B. Keller for suggesting to me long ago that the ocular dominance problem should be approached by studying the early development of a n ocular dominance pattern and linearizing about the nearly uniform initial condition. I thank M. I? Stryker for supporting this work, which was performed in his laboratory. I a m supported by an N.E.I. Fellowship and by a Human Frontiers Science Program Grant to M. P. Stryker (T. Tsumoto, Coordinator). I thank M. P. Stryker, D. J. C. MacKay, and especially the action editor for helpful comments.
References Abraham, W. C., Gustafsson, B., and Wigstrom, H. 1987. Long-term potentiation involves enhanced synaptic excitation relative to synaptic inhibition in guinea-pig hippocampus. J. Physiol. (London) 394,367-380. Creutzfeldt, O., and Ito, M. 1968. Functional synaptic organization of primary visual cortex neurones in the cat. Exp. Brain Res. 6, 324-352. Grifith, W. H., Brown, T. H., and Johnston, D. 1986. Voltage-clamp analysis of synaptic inhibition during long-term potentiation in hippocampus. I. Neurophys. 55, 767-775. Hadeler, K. P., and Kuhn, D. 1987. Stationary states of the Hartline-Ratliff model. Biol. Cybern. 56, 411417. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81,3088-3092. Keller, J. B. 1977. Effective behavior of heterogeneous media. In Statistical Mechanics and Statistical Methods in Theory and Application, U. Landman, ed., pp. 631-644. Plenum Press, New York. Linsker, R. 1986. From basic network principles to neural architecture (series). Proc. Natl. Acad. Sci. U.S.A. 83, 7508-7512, 8390-8394, 8779-8783. MacKay, D. J. C., and Miller, K. D. Analysis of Linsker’s simulations of Hebbian rules. Neural Comp. 2,169-182. Miller, K. D. 1989a. Orientation-selective cells can emerge from a Hebbian mechanism through interactions between ON- and OFF-center inputs. SOC. Neurosci. Abst. 15, 794. Miller, K. D. 198913. Correlation-based mechanisms in visual cortex: Theoretical and experimental sfudies. Ph.D. Thesis, Stanford University Medical School (University Microfilms, AM Arbor). Miller, K. D. 1990. Correlation-based mechanisms of neural development. In Neuroscience and Connectionist Theory, M.A. Gluck and D.E. Rumelhart, eds., pp. 267-353. Lawrence Erlbaum, Hillsdale, NJ.
Derivation of Linear Hebbian Equations
333
Miller, K. D., Keller, J. B., and Stryker, M. P. 1986. Models for the formation of ocular dominance columns solved by linear stability analysis. SOC.Neurosci. Abst. 12, 1373. Miller, K. D., Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245, 605-615. Miller, K. D., and Stryker, M. P. 1990. Ocular dominance column formation: Mechanisms and models. In Connectionist Modeling and Brain Function: The Developing Interface, S. J- Hanson and C. R. Olson, eds., pp. 255-350. MIT Press/Bradford Books, Cambridge, MA. Nicoll, R. A., Kauer, J. A., and Malenka, R. C. 1988. The current excitement in long-term potentiation. Neuron 1, 97-103. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by back-propagating errors. Nature 323,533-536. Singer, W. 1977. Effects of monocular deprivation on excitatory and inhibitory pathways in cat striate cortex. Exp. Brain Res. 30, 25-41. Toyama, K., Kimura, M., and Tanaka, K. 1981. Organization of cat visual cortex as investigated by cross-correlation techniques. J. Neurophysiol. 46, 202-213. Willshaw, D. J., and von der Malsburg, C. 1976. How patterned neural connections can be set up by self-organization. Proc. R. Soc. London Ser. B 194, 431-445.
Received 23 January 90; accepted 9 June 90.
Communicated by John Pearson
Spontaneous Development of Modularity in Simple Cortical Models Alex Chernjavsky Molecular and Celltrlar Physiology, Beckman Center, Stanford University, Stanford, CA 94305-5426 USA
John Moody* Yale Computer Science, P.O. Box 2258 Yale Station, New Haven, CT 06520 U S A
The existence of modular structures in the organization of nervous systems (e.g., cortical columns, patches of neostriatum, and olfactory glomeruli) is well known. However, the detailed dynamic mechanisms by which such structures develop remain a mystery. We propose a mechanism for the formation of modular structures that utilizes a combination of intrinsic network dynamics and Hebbian learning. Specifically, we show that under certain conditions, layered networks can support spontaneous localized activity patterns, which we call collective excitations, even in the absence of localized or spatially correlated afferent stimulation. These collective excitations can then induce the formation of modular structures in both the afferent and lateral connections via a Hebbian learning mechanism. The networks we consider are spatially homogeneous before learning, but the spontaneous emergence of localized collective excitations and the consequent development of modules in the connection patterns break translational symmetry. The essential conditions required to support collective excitations include internal units with sufficiently high gains and certain patterns of lateral connectivity. Our proposed mechanism is likely to play a role in understanding more complex (and more biologically realistic) systems. 1 Modularity in Nervous Systems Modular organization exists throughout the nervous system on many different spatial scales. On the very small scale, synapses may be clustered into functional groups on dendrites. On the very large scale, the brain as a whole is composed of many anatomically and functionally distinct regions. At intermediate scales, those of networks and maps, the *Please address correspondence to John Moody.
Neural Computation
2, 334-354 (1990) @ 1990 Massachusetts Institute of Technology
Spontaneous Development of Modularity
335
cortex exhibits columnar structures (see Mountcastle 1957, 1978). Many modality-specific variations of intermediate scale modular organization are known. Examples in neocortex include orientation selective columns, ocular dominance columns, color sensitive blobs, somatosensory barrels, and the frontal eye fields of association cortex. Examples in other brain regions include the patches of neostriatum and the olfactory glomeruli. Other modular structures have been hypothesized, such as cell assemblies (see Braitenberg 19781, colonies in motor cortex, minicolumns, and neuronal groups. These modular structures can be divided into two distinct classes, functional modules and anatomical modules. Functional modules are structures whose presence is inferred strictly on the basis of physiology. Their existence is due most likely to modular organization in the patterns of synaptic efficacies; they do not exhibit corresponding patterns in the distribution of afferent or lateral connections or of cell bodies. (Patterns of synaptic efficacy are not anatomically detectable with present technology.) Functional modules include the orientation selective columns, the frontal eye fields of association cortex, and the color sensitive blobs (convincing anatomical correlates of the cytochrome oxidase activity patterns have not yet been found). Cell assemblies, the colonies of motor cortex, and neuronal groups (should these structures exist) are also candidates for functional modules. Anatomical modules are structures which, in addition to having a functional role, are also detectable anatomically by a clear organization of the afferent connections, the dendritic arbors, or the distribution of cell bodies and neuropil. Anatomical modules include the ocular dominance columns, somatosensory barrels, patches of neostriatum, and olfactory glomeruli. A complete biophysical picture of the development of modular structures is still unavailable. However, two general statements about the developmental process can be made. First, the development of modules typically occurs by the differentiation of an initially homogeneous structure. This fact has been demonstrated convincingly for certain anatomical modular structures such as the somatosensory barrels (see Rice et al. 1985 and the review by Kaas et al. 1983), the ocular dominance columns (see Stryker 1986 for a review) and the patches of neostriatum (Goldman-Rakic 1981). We conjecture that functional modules also develop from an initially homogeneous structure. Second, it is well established that the form of afferent electrical activity is crucial for the development of certain modular structures in afferent connections. Removal or modification of afferent sensory stimulation causes abnormal development in several well-known systems. Examples include the somatosensory barrels (see Kaas et al. 1983) and the ocular dominance columns (see Stryker 1986). These findings, along with others
336
Alex Chernjavsky and John Moody
(see Kalil19891, support the conjecture that an activity-dependent Hebblike mechanism is operative in the developmental process. It should be noted that much of the existing evidence for the development and form of modular organization concerns only the afferent sensory projection patterns in sensory areas. The development and form of modular structures in the patterns of lateral intrinsic connections of cortex (both within and between modules) have received less attention from experimenters, but are of great interest. Previous attempts to model modular development (e.g., Pearson et al. 1987; Miller et al. 1989) have focused on the idea that spatially correlated sensory stimuli drive the formation of modular structures in sensory areas. While spatially correlated afferent stimulation will certainly encourage modular development, we believe that additional factors intrinsic to the network architecture and dynamics must be present to ensure that stable modules of a characteristic size will form. The importance of intrinsic factors in this development is emphasized when one observes two important facts. First, connections from thalamic (sensory) afferents account for only about 0.1% of all synapses of neocortex (Braitenberg 1978) and on the order of 10% of the synapses in layer IV. Local intrinsic connections and cortico-cortical connections account for the remaining 90% (layer IV) to 99.9% (overall) of neocortical synapses. Thus, the role of afferent sensory activity has been most likely over-emphasized in previous models of activity-dependent modular development. Second, columnar structures throughout the neocortex all have a roughly uniform size of a few hundred microns. This uniformity occurs in spite of the fact that the various sensory modalities have naturally different length scales of correlated activity in their afferent stimulation patterns. Thus, while correlated afferent stimulation probably influences the development of modules, the effects of internal network dynamics must contribute crucially to the observed developmental process. Specifically, the length scale on which modules form must be determined by factors intrinsic to the structure of cortex. These observations provide the motivation for our operating hypothesis below. 2 Operating Hypothesis and Modeling Approach Our hypothesis in this work is that localized activity patterns in an initially homogeneous layer of cells induce the development of modular structure within the layer via an activity-dependent Hebb-like mechanism. We further hypothesize that the emergence of localized activity patterns on a specific spatial scale within a layer may be due to the properties of the intrinsic network dynamics alone and does not necessarily depend on the system receiving localized patterns of afferent activity.
Spontaneous Development of Modularity
337
Finally, we hypothesize that a single mechanism can drive the formation of modules in both afferent and lateral connections. Our investigation therefore has two parts. First, we show that localized patterns of activity on a preferred spatial scale, which we call collective excitations, spontaneously emerge in homogeneous networks with appropriate lateral connectivity and cellular response properties when driven with arbitrary stimulus (see Section 3 and Moody 1990). Second, we show that these collective excitations induce the formation of modular structures in both the afferent and lateral connectivity patterns when coupled to a Hebbian learning mechanism (see Section 4). [In Sections 5 and 6, we provide a discussion of our results and a comparison to related work.] The emergence of collective excitations at a preferred spatial scale in a homogeneous network breaks translational symmetry and is an example of spontaneous symmetry breaking. The Hebbian learning freezes the modular structure into the connection patterns. The time scale of collective excitations is short, while the Hebbian learning process occurs over a longer time scale. The spontaneous symmetry breaking mechanism is similar to that which drives pattern formation in reaction-diffusion systems (Turing 1952; Meinhardt 1982). Reaction-diffusion models have been applied to pattern formation in both biological and physical systems. One of the best known applications is to the development of zebra stripes and leopard spots (see Murray 1988). In the context of network dynamics, a model exhibiting spontaneous symmetry breaking has been proposed by Cowan (1982) to explain geometric visual hallucination patterns. Previous work by Pearson et al. (1987) demonstrated empirically that modularity emerged in simulations of an idealized but rather complex model of somatosensory cortex. The Pearson work was purely empirical and did not attempt to analyze theoretically why the modules developed. It provided an impetus, however, for our developing the theoretical results that we present here and in Moody (1990). As mentioned above, a major difference between our work and Pearson’s is that we do not assume spatially correlated afferent stimulation. Our work is thus intended to provide a possible theoretical foundation for the development of modularity as a direct result of network dynamics. Our proposed mechanism can model the formation of both functional and anatomical modules, although the interpretation of the simulations is different in the two cases (see section 4). We have limited our attention to simple models that we can analyze mathematically to identify the essential requirements for the formation of modules. To convince ourselves that both collective excitations and the consequent development of modules are somewhat universal, we have considered several different network models. All models exhibit collective excitations. We believe that more biologically realistic (and therefore more complicated) models will very likely exhibit similar behaviors.
338
Alex Chernjavsky and John Moody
The presentation here is an expanded version of that given in Chernjavsky and Moody (1990). 3 Network Dynamics: Collective Excitations
The analysis of network dynamics presented in this section is adapted from Moody (1990). Due to space limitations, we present here a detailed analysis of only the simplest models that exhibit collective excitations. All network models that we consider possess a single layer of receptor cells that provides input to a single internal layer of laterally connected cells. Two general classes of models are considered (see Fig. 1): additive models and shunting inhibition models. The additive models contain a single population of internal cells that makes both lateral excitatory and inhibitory connections. Both connection types are additive. The shunting inhibition models have two populations of cells in the internal layer: excitatory cells that make additive synaptic axonal contact with other cells and inhibitory cells that shunt the activities of excitatory cells. The additive models are further subdivided into models with linear internal units and models with nonlinear (particularly sigmoidal) internal units. The shunting inhibition models have linear excitatory units and sigmoidal inhibitory units. We have considered two variants of the shunting models, those with and without lateral excitatory connections. For simplicity and tractability, we have limited the use of nonlinear response functions to at most one cell population in all models. More elaborate network models could make greater use of nonlinearity, a greater variety of cell types (e.g., disinhibitory cells), or use more ornate connectivity patterns. However, such additional structure can only add richness to the network behavior and is not likely to remove the collective excitation phenomenon. 3.1 Dynamics for the Linear Additive Model. To elucidate the fundamental requirements for the spontaneous emergence of collective excitations, we now focus on the minimal model that exhibits the phenomenon, the linear additive model. This model is exactly solvable. As we will see, collective excitations will emerge provided that the appropriate lateral connectivity patterns are present and that the gains of the internal units are sufficiently high. These basic requirements will carry over to the nonlinear additive and shunting models. One kind of lateral connectivity pattern which supports the emergence of collective excitations is local excitation coupled with longer range inhibition. This kind of pattern is analogous to the local autocatalysis and longer range suppression found in reaction-diffusion models (Turing 1952; Meinhardt 1982).
Spontaneous Development of Modularity
339
Receptor Units
Internal Units
(4 Receptor Units
Excitatory Units
Inhibitory Units
u u u u u u u u u
Figure 1: Network models. (A) Additive model. (B) Shunting inhibition model. Only local segments of an extended network are shown. Both afferent and lateral connections in the extended networks are localized. Artwork after Pearson et al. (1987).
The network relaxation equations for the linear additive model are d
7 dd- xt
-K
1
f
C M 0 0 else
The reader may prefer to consider T[sin( )] to be a square wave function. Neurul Computation 2, 405-408 (1990) @ 1990 Massachusetts Institute of Technology
Mark J. Brady
406
3 Selection of Weights
It must be shown that a suitable u( ) can be chosen. This is equivalent to showing that we can choose a suitable weight vector w . In this model, it will be shown that only one condition can disqualify w as a valid weight vector. That condition is as follows: if there exist two input vectors v, and v, such that w . v, = w . vJ yet 6, # then w is not an acceptable weight vector. Such a w is unacceptable because w ‘v, = w ’v, implies u(v,) = u(v,), which implies p(v,) = p(v,) or 6, = b,. This contradicts b, # bJ. To find an acceptable w, let us ensure that whenever b, # b,, w . v, # w . vJ. Notice that w . v, = w . v,
is the same as w . v, - w . VJ = 0
or w . (v, - V J )= 0
An sk = (vz- v,) can be defined for each case where 6, # 4. w must then be adjusted to some w’, so that w’ . S k # 0. This can be done by setting w’ = w
For
SL
+ &Sk
satisfying
W.SL
#O
w should not be disturbed so as to result in the condition
because if w is acceptable with respect to sL we desire that it should remain so. Since there is at most one E that satisfies equation 3,l and there are finitely many sy, for which w must not be incorrectly disturbed, a suitable E can be found. The algorithm proceeds by adjusting w as described above for each s k where w . s k = 0. 4 Selection of X
The function u( ) projects each input vector onto w . The length of this projection is in R. X must be set so that u(v,) lies under a portion of the sine function, which is greater than zero whenever bi = 1 and u ( v ~ )is above a portion of the sine function, which is less than zero whenever bi = 0.
Learning Algorithm for Networks
407
Ideally, one would like a maximum of the sine curve to occur at u(v,) when b, = 1 and a minimum to occur when b, = 0. In other words, we would like to have (2n/A) . a(vJ = n/2
+n
for some integer n when h,
=
'
2n
1 and
(2n/x).a(v,) = 31312 + . zn for some integer 7~ when h, = 0. Starting with some arbitrary waveIength A*, one can adjust the wavelength for each projection a(.,). The a(v,) can first be reordered with i > 3 implying n(v,) > ~ ( v , ) .A, will be defined to be the wavelength after adjustment for a(v,). In general sin[a(v,)].f 1 or -1 as desired, for arbitrary A. However, there are values 7' such that sin[a(vl)+ 7-1 = 1 or -1 as desired. Define r, = the smallest such r
+
The number of wavelengths between a(v,) T , and zero will determine, in part, how much A,-, will need to be adjusted to form A,. Define UJ,
=
[the number of wavelengths between a(v,) and zero] =
[a(v,)+ r J / L 1
Let AA,-1 be the desired adjustment to A,-I in forming A,. A,
= A,-1
+ Ax,-,
From the definitions given so far one can deduce that
ax,-,
= -TJUJ7
=
-rJ-,/[a(v,)
+ 7-J
After the wavelength is adjusted for n(v,), subsequent adjustments will disturb the wavelength that is ideal for u(v,). Training pair z can afford to have the wavelength disturbed by at most
Therefore, the condition ~ f / 4 a ( v , )> LIAI,
+ ax,, + a~,,~ + . . . + AA,-,~
71-1
=
1r,+lA,/[a(v,+l) +
7,3+11
(4.1)
,=I
must be satisfied. Since rJ+l < A, and the A, are decreasing, (1) will hold if n ( v , )
0. The algorithm is the f ~ l l o w i n g . ~
+
+
1. Let ill, = i L f 0 ( 6 ~ / [ l 6 M o ( c6/4, , n2 1)],6/8,n 1) = #(n'/c26).Call examples until either (a) we have MI positive examples or (b) we have called A4 = max(2M1/ E , :ln6-') examples (whichever comes first). In case (b), halt and output hypothesis g(x) = 0 (i.e., classify all examples as negative). ~
2We restrict to hyperplanes through the origin. "or ease of notation, define d(.r)to mean O j i ) x (terms of logarithmic size).
Eric B. Baum
516
2. Find a half space h bounded by a hyperplane through the origin such that all the positive examples are in 6. This can be done by Karmarkar’s algorithm.
3. Let S = { [ T ” , f(.c”)]} be the set of the first A4,3(c,6/4,n2+ 1) examples we found which are in h. Find a w E 9’‘’ s.t. CLJ ui,Jx,xJ > 0 for any positive example T E S and CIJU J , ~ ~ < , X 0~ for any negative example in S . w can be found, for example, by Karmarkar’s algorithm. 4. Output hypothesis 9: clJw,J3‘,.cJ > 0.
.c
is positive if and only if x E h and
5. Halt.
Figure 3: The input space is divided by target hyperplanes w1 and w2 and the hyperplane ah, which bounds half space h, that we draw in step 2 of our algorithm. Region T = {z : w1 .x 2 0, w2 .s 2 0) is labeled, as is 7‘ = {z : -x E T } .
Polynomial Time Algorithm
517
Notice that the hypothesis function 9 output by this net is equivalent to a feedforward threshold net with two hidden units. One of these is a linear threshold with output Q(C,ui(lz,), where w h is the normal to the boundary hyperplane of h. The other is a quadratic threshold with J ) . Here H is the Heaviside threshold function and z, is We now prove this algorithm is correct, provided D is inversion symmetric, that is, for all .r, D ( r ) = D(-.r). Theorem 3. The class E' of intersections of two half spaces is i-learnable.
Proof. It is easy to see4 that with confidence 1-0, if step (lb) occurs, then at least a fraction 1 - f of all examples are negative, and the hypothesis g(.r) = 0 is f-accurate.s Likewise, if step (lb) does not occur, we have confidence 1 - 6/4 that at least a fraction f/4 of examples are positive.6 If step (lb) does not occur, we find an open half space fi containing all the positive examples. By Theorem 1, with confidence 1 - 6/8, we conclude that the measure l l ( h n r ) < 0 ~ / 1 6 A l 0 (6/4. ~ , n2 + 1). Now if .r E i., then -.I E r , and vice versa; and if .I' t h, then - r E 11. Thus if .r E fi n i. then - 1 ' E 12 n t . But I)(..) = D(-.r) by hypothesis. Thus the measure D ( h n C) < I l ( 1 1 n 1.1. Now we use bound the conditional probability Prob(.r E i.1.r E A ) that a random exampIe in h is also in i.: E h)= ~ n r( ) / ~i ( h5) ~ ( n ht . ) / D ( h ) Recall we saw above that with confidence 1 - 30/8, D ( h ) > t/4 and D ( h n r ) < &/16,kI,(t-',6/4, t 7 ' + 1). Thus we have
12) < ~/4A10(c~',5/4,n2 + 1) A l o ( f , 6/4, t i 2 + 1) random examples in 6. Since each of
Prob(.r t i1.r E
Now we take these AI" examples has probability less than 6/4Mo of being in i., we have confidence 1 - 6/4 that none of these examples is in i.. Thus we can in fact find a set of w,, as in the proof of Theorem 2, and by Theorem 1 (with confidence 1 - 6/41,this I P correctly classifies a fraction 1 - t of examples ow our hypothesis function is s is positive if and only > 0. With confidence 1 - 5 / 8 this correctly classifies all but a fraction much less than c of points in h and with confidence 1 - 76/8 it correctly classifies all but a fraction f of points in Ti. Q.E.D. *To see this define L F ; ( p . i n . r ) as the probability of having at most T' successes in independent Bernoulli trials with probability of success 7) and recall (Angluin and ' ~ / ~ . this formula Valiant 1979), for 0 5 i j 5 I, that f,E[P*,trt.(1 / j ) r n p ] 5 F / ' ~ ~ ~Applying with 7 n = M , p = f, and /j = 1 M , / M f yields the claim. "Thus if (lb) occurs we have confidence 1 - 6 we are f-accurate. The rest of the proof shows that if (Ib) does not occur, we have confidence 1 - 6 we are 6-accurate. Thus we have confidence overall 1 6 our algorithm is r-accurate. 'To see this define G E ( p .7 1 ) . r ) as the probability of having at least 4' successes in i n Bernoulli trials each with probability of success p and recall (Angluin and Valiant 1979), for 0 5 [j 5 1, that CE'(7).r i i . (1+ / l ) , r r t p ] 5 ~ - j j ' ~ ~ ~ Applying J ' / ~ . this formula with nr = M , ) J = f j 4 , and I f = 1 yields the claim. rri
~
~
~
518
Eric B. Baum
4 Some Remarks on Robustness The algorithm we gave in Section 2 required, for proof of convergence, that the distribution be inversion symmetric. It also used a large number 6(n3/t2S) examples, in spite of the fact that only O(n/t) examples are known to be necessary for learning the class N and only O ( n 2 / t ) examples are necessary for training hypothesis nets with one hidden linear threshold and one hidden quadratic threshold unit. Both of these restrictions arose so that we could be sure of obtaining a large set of examples that is not in ?, and that are therefore known to be quadratically separable. We proposed using Karmarkar’s algorithm for the linear programming steps. This allowed us to guarantee convergence in polynomial time. However, if we had an algorithm that was able to find a near optimal linear separator, for a set of examples that is only approximately linearly separable, this might be much more effective in pra~tice.~ For example, if in step 3 the set of examples we had was not exactly linearly separable, but instead contained by mistake either a few examples from ? or a few noisy examples, we might still find an t accurate classifier by using a robust method for searching for a near optimal linear separator. This would, for example, allow us to use fewer examples or to tolerate some variation from inversion symmetry. Note that our method will (provably) work for a somewhat broader range of cases than strictly inversion symmetric distributions. For example, if we can write D = D1+ D2, where Dl(x)= D1 (-x) > 0 everywhere and D 2 ( x ) 2 D2(-2) for all z E r, our proof goes through essentially without modification. (4need not be positive definite.)
5 Learning with Membership Queries The only use we made of the inversion symmetry of D in the proof of Theorem 3 was to obtain a large set of examples that we had confidence was not in i . The problem of isolating a set of examples not in ? becomes trivial if we allow membership queries, since an example y is in ? if and only if -y is a positive example. We may thus easily modify the algorithm using queries by finding a set S+containing examples in .i. and a set ,’$ containing examples not in ? (where membership in one or the other of these is easily established by query). We then find a half space h containing all examples in S+and correctly classify all examples in by the method of Theorem 2. A straightforward analysis establishes (1) how many examples we need for sufficient confidence, relying on Theorem 1,
s+
7We will report elsewhere on a new, apparently very effective heuristic for tinding near optimal linear separators.
Polynomial Time Algorithm
519
and ( 2 ) that we can with confidence acquire these examples rapidly. The algorithm is as follows. 1. Call examples, and for all negative examples y, query whether -y is negative or positive. Accumulate in a set S? all examples found such that y is negative but -y is positive. Accumulate in set S, all examples not in SF. Continue until either (a) we have M? = 11/10(~/2,6/2,n + 1) examples in SF and we have M = Mo(c/2,6/2, n2+ 1)examples not in S?;or (b) one of the following happens. ( b l ) If in our first Mcut= max[4M;/t, 16t-’ln(26-’), 2I%?/c]examples we have not obtained M+ in S,, proceed to step 5. (62) If instead we do not find &’ examples in SFin these first Mat calls proceed to step 6.
2. Find a half space h bounded by a hyperplane through the origin such that all examples in Sf are in h and all positive examples are in h. h can be found, for example, using Karmarkar’s algorithm. 3. Find a w E iR”’ s.t. C LwtJx,xj j > 0 for any positive example x E 3, and C i jw,jlc,xJ < 0 for any negative example in $. w can be found, for example, by Karmarkar’s algorithm. 4. Output hypothesis 9: x is positive if and only if x E h and CiJW , ~ Z ; X>~ 0; and halt.
5. We conclude with confidence 1-512 that fewer than a fraction €12 of examples lie in i. Thus we simply follow step 3; output hypothesis 9: z is positive if and only if ZzJ wijx,zJ > 0; and halt. 6. We conclude with confidence 1-612 that fewer than a fraction €12of examples are positive and thus simply output hypothesis: g ( x ) = 0 (i.e., classify all examples as negative) and halt.
Theorem 4. The class F of intersections of two half spaces is learnable from examples and membership queries. Proof. It is easy to see with confidence 1-612, as in the proof of Theorem 3, that if l ( b 1 ) occurs, then the probability of finding an example in i: is less than €12. Thus we are allowed to neglect all examples in ?, provided that we correctly classify a fraction 1 - ~ / of 2 all other examples. We then go to step 5, and find a quadratic classifer for the more than I%? examples we have in S?. By Theorem 1, as in the proof of Theorem 2, this gives us with confidence 1 - 6 / 2 a classifier that is €12 accurate. Thus overall if step ( b l ) is realized, we have confidence at least 1 - 6 of producing an €-accurate hypothesis. Likewise, if (b2) occurs, then with confidence 1- 612 at most a fraction 1 - €12 of examples are not in i , and thus at most a fraction ~ / 2of examples are positive, and we are justified in hypothesizing all examples are negative, which we do in step 6.
520
Eric B.
Baum
Assume now we reach step 2. The half space h found in step 2 contains, with confidence 1 - 6/2, all but a possible ~ / of 2 the probability for examples in i , by Theorem 1. The quadratic separator found in step 3 correctly classifies, by Theorem 1, with confidence 1 - S/2, all but a frac2 the measure not in ?. Note that there is no possibility that tion 1 - ~ / of the set 5, will fail to be quadratically separable since it has no members in ?. This is in contrast to the situation in the proof of Theorem 3, where we had only statistical assurance of quadratic separability. Putting this all together, we correctly classify with confidence 1- S/2 all but a fraction 4 2 of the measure in i , and with equal confidence all but the same fraction of measure not in i . It is easy to see that, if we use some polynomial time algorithm such as Karmarkar's for the linear programming steps, that this algorithm converges in polynomial time. Q.E.D.
6 Concluding Remarks
The task of learning an intersection of two half spaces, or of learning functions described by feedforward nets with two hidden units, is hard and interesting because a Credit Assignment problem apparently must be solved. We appear to have avoided this problem. We have solved the task provided the boundary planes go through the origin and provided either membership queries are allowed or one restricts to inversion symmetric distributions. This result suggests limited optimism that fast algorithms can be found for learning interesting classes of functions such as feedforward nets with a layer of hidden units. Hopefully the methods used here can be extended to interesting open questions such as: Can one learn an intersection of half spaces in the full distribution independent model, without membership queries?' Can one find robust learning algorithms for larger nets, if one is willing to assume some restrictions on the distrib~tion?~ We report elsewhere (Baum 1990b) on different algorithms that use queries to learn some larger nets as well as intersections of Ic half spaces, in time polynomial in n and k. It is perhaps interesting that our algorithm produces as output function a net with two different types of hidden units: one linear threshold and the other quadratic threshold. It has frequently been remarked that a shortcoming of neural net models is that they typically involve only one type of neuron, whereas biological brains use many different types of neurons. In the current case we observe a synergy in which the 8Extensionof our results to haIf spaces bounded by inhomogeneous planes is a subcase, since the inversion symmetry we use is defined relative to a point of intersection of the planes. 'Extension to intersections of more than two half spaces is an interesting subcase.
Polynomial Time Algorithm
52I
linear neuron is used to pick o u t only one of the two parabolic lobes that are naturally associated with a quadratic neuron. For this reason we are able readily to approximate a region defined as the intersection of two half spaces. It seems reasonable to hope that neural nets using mixtures of linear threshold a n d more complex neurons can be useful in other contexts.
Acknowledgments A n earlier version of this paper appeared in the ACM sponsored "Proceedings of the Third Annual Workshop on Computational Learning Theory," August 6-8, 1990, Rochester, NY, Association for Computing Machinery, Inc.
References Angluin, D. 1988. Queries and concept learning. Machine Learn. 2, 319-342. Angluin, D., and Valiant, L. G. 1979. Fast probabilistic algorithms for Hamiltonian circuits and matchings. J. Comput. Syst. Sci. 18, 155-193. Baum, E. B. 1989. A proposal for more powerful learning algorithms. Neural Comp. 1,201-207. Baum, E. B. 1990a. On learning a union of half spaces. 1. Complex. 6, 67-101. Baum, E. B. 1990b. Neural net algorithms that learn in polynomial time from examples and queries. l E E E Transactions in Neural Networks, in press. Blum, A. 1989. On the computational complexity of training simple neural networks. MIT Tech. Rep. MIT/LCS/TR-445. Blum, A., and Rivest, R. L. 1988. Training a 3-node neural network is NPcomplete. In Advances in Neural Information Processing Systems I , D. S. Touretzky, ed., pp. 494-501. Morgan Kaufmann, San Mateo, CA. Blumer, A., Ehrenfeucht, A,, Haussler, D., and Warmuth, M. 1987. Learnability and the Vapnik-Ckervonenkis Dimension. U.C.S.C. Tech. Rep. UCSC-CRL-8720, and 1. A C M 36(4), 1989, pp. 929-965. Karmarkar, N. 1984. A new polynomial time algorithm for linear programming. Combinatorica 4, 373-395. Khachian, L. G. 1979. A polynomial time algorithm for linear programming. Dokl. Akad. Nauk. USSR 244(5), 1093-1096. Translated in Soviet Mnfh. Dokl. 20, 191-194. Le Cun, Y. 1985. A learning procedure for asymmetric threshold networks. Pioc. Cognitiva 85, 599-604. Minsky, M., and Papert, S. 1969. Perceptrons, and lntroduction to Computational Geometry. MIT Press, Cambridge, MA. Ridgeway, W. C. I11 1962. An adaptive logic system with generalizing properties. Tech. Rep. 1556-1, Solid State Electronics Lab, Stanford University. Rosenblatt, F. 1962. Principles of Neurodynamics. Spartan Books, New York.
522
Eric B. Baum
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Valiant, L. G. 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1142. Valiant, L. G., and Warmuth, M. 1989. Unpublished manuscript. Wenocur, R. S., and Dudley, R. M. 1981. Some special Vapnik-Chervonenkis classes. Discrete Math. 33, 313-318.
Received 16 April 90; accepted 3 July 90.
Communicated by Scott Kirkpatrick and Bemardo Huberman
Phase Transitions in Connectionist Models Having Rapidly Varying Connection Strengths James A. Reggia Mark Edwards Department of Computer Science, Institute for Advanced Computer Studies, University of Maryland, College Park, M D 20742 U S A
A phase transition in a connectionist model refers to a qualitative change in the model’s behavior as parameters determining the spread of activation (gain, decay rate, etc.) pass through certain critical values. As connectionist methods have been increasingly adopted to model various problems in neuroscience, artificial intelligence, and cognitive science, there has been an increased need to understand and predict these phase transitions to assure meaningful model behavior. This paper extends previous results on phase transitions to encompass a class of connectionist models having rapidly varying connection strengths (”fast weights”). Phase transitions are predicted theoretically and then verified through a series of computer simulations. These results broaden the range of connectionist models for which phase transitions are identified and lay the foundation for future studies comparing models with rapidly varying and slowly varying connection strengths.
1 Introduction
It has recently been demonstrated that connectionist models (neural models) with large networks can exhibit “phase transitions” analogous to those occurring in physical systems (e.g., Chover 1988; Huberman and Hogg 1987; Kryukov and Kirillov 1989; Kurten 1987; Shrager et al. 1987). In other words, as parameters governing the spread of activation (decay rate, gain, etc.) pass through certain predictable critical values, the behavior of a connectionist model can change qualitatively. These results are important because they focus attention on this phenomenon in interpreting model results and because they provide specific guidelines for selecting meaningful parameters in designing connectionist models in applications. For example, it is often desirable for network activation to remain bounded in amplitude within a bounded portion of the Neural Computation 2, 523-535 (1990)
@ 1990 Massachusetts Institute of Technology
524
J. A. Reggia and M. Edwards
network and to exert an influence for a finite amount of time. The issue thus arises in large models of how to characterize qualitatively the spatiotemporal patterns of network activation as a function of network parameters. Previous results on phase transitions in connectionist models are limited in their applicability to networks whose weights are assumed to be essentially fixed during information processing (”static weights”). Although these previous results are often applicable to models where weights change very slowly (“slow weights,“ e.g., during learning), they cannot be applied directly in situations where connection strengths vary at a rate comparable to the rate at which node activation levels vary. This latter situation arises, for example, with competitive activation methods, a class of activation methods in which nodes actively compete for sources of network activation (Reggia 1985).’ Competitive activation mechanisms have a number of useful properties, including the fact that they allow one to minimize the number of inhibitory (negatively weighted) links in connectionist networks. They have been used successfully in several recent applications of interest in A1 and cognitive science, such as print-to-sound conversion (Reggia et al. 1988), abductive diagnostic hypothesis formation (Peng and Reggia 1989; Wald et al. 1989), and various constraint optimization tasks (Bourret et al. 1989; Peng et al. 1990; Whitfield et al. 1989). They currently are under investigation for potential neurobiological applications. This paper analyzes the phase transitions that occur in a class of connectionist models with rapidly varying connection strengths (“fast weights”). This class of models includes but is far more general than competitive activation mechanisms. Phase transitions are identified and shown to be a function of not only the balance of excitation and inhibitioddecay in a model, but also to depend in a predictable way on the specific form of the activation rule used (differential vs. difference equation). Computer simulations are described that demonstrate the theoretical predictions about phase transitions as well as other results. 2 General Formulation
A connectionist model consists of a set of nodes N representing various potentially active states with weighted links between them. Let a z ( t ) be the activation level of node a E N at time f , and let the connection strength to node a from node j be given by c,,(t). The “output“ to node 2 from node 3 at any time t is given by c Z J ( t )a,(t). . This can be contrasted with the output signal U U I., a~ l ( t ) that has generally been used in previous ’Competitive activation mechanisms usually have “resting weights” on connections that are fixed or slowly varying, but the actual effective connection weights/strengths of relevance are rapidly changing in response to node activation levels.
Phase Transitions in Connectionist Models
525
connectionist models, where w , is ~ essentially a fixed weight (except for slow changes occurring during learning). Unlike previous connectionist models, the connection strength values ctJ( t ) used here may be very rapidly varying functions of time. In general, c,,(t) # c,,(t). We assume that node fanout is substantial, that is, that each node connects to several other nodes. In many connectionist models, it is convenient to distinguish different types of interactions that can occur between nodes, and we do so here. Consider an arbitrary node J E N . Then the set of all nodes N can be divided into four disjoint subsets with respect to the output connections of node J : 1. S, =
{ J } , that is, node J receives a "self-connection." The connection strength c J J ( t )will be restricted to being a constant c,,(t) = cs, which may be either positive or negative.
2. P, = { k E N l c A J ( t )> 0 for all t } , that is, nodes receiving positive or excitatory connections from node j . The strengths of excitatory output connections from j to nodes in PJ may vary rapidly but are subject to the constraint (2.1)
where cp > 0 is a constant. 3. NJ = { k E NIck,(t) < 0 for all t } , that is, nodes receiving negative or inhibitory connections from node j . The strengths of inhibitory output connections from j to nodes in NJ may vary but are subject to the constraint
where cn < 0 is a constant. 4. 0, = { k E NlcBJ= 0 for all t } , that is, all other nodes in
N that do
not receive connections from node j . The parameters c,, cp, and c,, are network-wide constants that represent the gain on self, excitatory, and inhibitory connections, respectively. Let e , ( t ) represent the external input to node i at time t , let T , be a constant bias at node %, and let the dynamic behavior of the network
J. A. Reggia and M. Edwards
526
be characterized by the discrete-time recursion relation for activation of node i
where 0 < 6 '5 1 represents the fineness of time quantization. For sufficiently small values of 6, equation 2.3 numerically approximates the first-order differential equation
for reasonably well-behaved cii functions (Euler's method). On the other hand, if 6 = 1, for example, then equation 2.3 may in general behave qualitatively differently from equation 2.4. In this special situation, if we also restrict cZl(t)to be nonnegative and constant for all time, then equation 2.3 becomes the linear difference equation described in Huberman and Hogg (1987).2 Thus, equation 2.3 directly generalizes the formulation in Huberman and Hogg (1987) to handle rapidly varying connection strengths, to explicitly distinguish excitatory and inhibitory connection strengths, and to encompass models where activation levels are represented as first-order differential equations of the form given in equation 2.4. 3 Phase Transitions
The phase transitions of connectionist models based on equation 2.3 can be characterized in terms of the parameters cs, cp, and c,,. Define the total network activation at time t to be A ( t ) = CzEN at, and let c = cs+cp+c,. A value c > 0 indicates a model in which excitatory influences dominate in the sense that total excitatory gain exceeds total inhibitory gain. A value c < 0 indicates that inhibitory influences dominate. As in Huberman and Hogg (1987), we consider the case of constant external input values to individual nodes so that the total external input E ( t )is a constant E. Define R = CzEN r,.
Theorem 1. For networks governed by equation 2.3 with constant external input E , total network activation is given by
A ( t + 6) = (I + & ) A ( & ) + b(E + R )
(3.1)
'To see this let c, = -7,cp = a, cn = 0, T, = 0, 6 = 1, and c , ( t ) = aR,, where y and are the "decay" and "gain" parameters and R,? is a constant connection strength in Huberman and Hogg (1987). IY
Phase Transitions in Connectionist Models
527
Proof. By straightforward calculation from equation 2.3,
r
= =
+
+ + +
1
by equations 2.1 and 2.2
A ( t ) 6 [ E & cA(t)] (1 + bc)A(t) 6 ( E + R )
0
Corollary 1. Connectionist models based on equation 2.3 have total network activation given by
A ( f )= C
{
c
[l
+
1
A(0) (1 +
-
1
(3.2)
where A ( 0 ) is the initial total network activation. For very small values of 6 such that spread of network activation is effectively governed by the differential equation 2.4, we have c
C
A((i)]ect - 1)
(3.3)
in the limit as 6 + 0.
Corollary 2. Connectionist models based on equation 2.3 have total network activation that asymptotically approaches afixed point A" = -( E + R ) / c whenever -216 < c < 0. Proof. By equation 3.1, as t increases the behavior of A ( t ) is determined by ( l + h ~ ) ~ ' an ~ , infinite sequence that converges as t + m when Il-tbcl < 1, or -216 < c < 0. Letting A(t +S) = A ( t ) = A* in equation 3.2 gives the fixed point. 0 This last result indicates that when connectionist models using equation 2.3 converge, their total network activation approaches a fixed point whose value is independent of 6, the fineness of time quantization. This
528
J. A. Reggia and M. Edwards
does not, of course, imply that individual node values reach equilibrium nor that they even remain finite. For example, in networks with both excitatory and inhibitory connections it is possible that two nodes’ activation levels could grow without bound, one toward +m and one toward -m, in perfect balance so that the total network activation remains balanced. The second corollary above indicates that systems based on equation 2.3 have two phase transitions given by c = 0 and c = -216. The first phase transition is independent of 6 and indicates parameter values c = 0 where total excitatory and inhibitory gains are equal. This corresponds to a phase transition described in the fixed weight model of Huberman and Hogg (1987): However, the interpretation here is somewhat different because of the separation of excitatory (c,) and inhibitory (c,) influences, and the designation of c, as gain on a self-connection rather than decay. The corollary implies that inhibition must dominate (c < 0) for a connectionist model based on equation 2.3 to converge. If c > 0, that is, if excitation dominates, then the model will not converge on a fixed point: the ”event horizon” (Huberman and Hogg 1987) will in general grow indefinitely in time and space. The second phase transition, c = -216, indicates a limit on how much inhibition can dominate and still result in a convergent model. This phase transition was not encountered in Huberman and Hogg (1987) because of a priori assumptions concerning legal ranges of values for parameters (e.g., a ”decay rate” c, where -1 < c, < 0). These assumptions are not made here nor in many connectionist models [e.g., c, > 0, i.e., selfexcitation of nodes (Kohonen 1984), or c, < -1, which makes no sense if Ic,I is considered to be a ”decay rate” but which makes perfect sense as gain on a self-connection (Reggia et al. 1988)l. This second phase transition recedes in importance as 6 decreases in size, so that in the limit where equation 2.3 represents the first-order differential equation 2.4 this second phase transition no longer exists. Finally, we assumed that the parameters c, ,c, and c, are identical for each node in the above derivation as a matter of convenience. This is not essential: the same results are obtained with the relaxed condition that c, the sum of these three parameters, is the same for all nodes. 4 Simulations
Two sets of simulations verify the predicted phase transitions and fixed points and give information about the activation patterns that occur. Networks in these simulations always have IN1 = 100 nodes. In each ’With c, = -7, c, = (Y, and c,, = 0 we have ru/y = 1. We are essentially ignoring the connectivity parameter CL in Huberman and Hogg (1987) as their results concerning I/ carry over unchanged. Most connectionist models are concerned with values >1 and we have assumed earlier that this holds here.
Phase Transitions in Connectionist Models
529
simulation all nodes have the same value T , = r and a constant external input of 1.0 is applied to an arbitrary node ( E = 1.0). Simulations terminate either when IA(t) - A ( f - 6)1 < 0.001 [in which case the simulation is said to converge and A ( t ) at convergence is taken as the numerical approximation of the predicted value A"] or when for any arbitrary node 2 , la,(t)l > 1000 (in which case the simulation is said to diverge). All simulations were implemented using MIRRORS/II (DAutrechy et al. 19881, a general purpose connectionist modeling system, using double precision arithmetic on a VAX 8600. The first set of 24 simulations verify the c = 0 phase transition in networks with no internode inhibitory connections (cn = 0). This result is of special interest to us because of its relevance to other ongoing research involving competitive activation mechanisms (Reggia et al. 1990). Each node is randomly connected to 10 other nodes. A resting weight L L , ~between 0.0 and 1.0 is randomly assigned to each connection according to a uniform distribution (resting weight w , ~should not be confused with connection strength c,,), The dynamically varying connection strength c,] to node I from node J is determined by
which can be seen to satisfy equation 2.1. Equation 4.1 implements a competitive activation mechanism because each node j divides its total output among its immediate neighbors in proportion to neighbor node activation levels (see Reggia 1985; Reggia et al. 1988, 1990 for further explanation). Simulations have 6 = 0.1, r = 0.0, and initial node activations 1 x lop9 (to avoid divide-by-zero fault in equation 4.1). The value c is incrementally varied between simulations from -1.0 to -0.01 and from 0.01 to 1.0 by changing c, and cp. Figure 1 shows the total network activation at time of simulation termination as a function of c. The 12 simulations with c < 0 all converged on a fixed point with the numerically generated value A ( f ) within 1% of the theoretically predicted A* value - ( E + R ) / c . For example, with c = -0.8 the predicted A* = 1.25 was approximated by A ( t ) = 1.24 at convergence ( / A A ( t ) l< 0.001). The 12 simulations with e > 0 all diverged, that is, total network activation level grew without apparent bounds. Figure 2 shows the number of iterations R until convergence ( c < 0) or determination of divergence ( c > 0) plotted against c. The phase transition at c = 0 is again evident. The set of 24 simulations described above was run three times with the same results; each time different random connections and randomly generated weights on those connections were used. The second set of simulations introduces negative time-varying connection strengths and also examines the lower phase transition c = -216 as S varies. Each node in the network is connected to all other nodes, and resting weights are again randomly assigned according to a uniform
J. A. Reggia and M. Edwards
530
4500
-
4oooA
3500 3000 2500 -
m-
: 500 0
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
C
Figure 1: Total network activation A ( t ) at time of termination of simulations as a function of c. Below predicted phase transition at c = 0 all simulations converge on a value of A ( t ) close to predicted A* value (these nonzero values are too small to be seen precisely on the vertical scale used). Above predicted phase transition aI1 simulations diverge. The phase transition is quite crisp: at c = -0.01 convergence to 99.0 is seen (A* = 100.0 predicted), at c = 0.01, divergence is seen. The curve shown represents 24 simulations. Repeating all of these simulations two more times starting with different random connections and resting weights produces virtually identical results.
distribution. Now, however, resting weights lie between -1.0 and 1.0 so, on average, half of the internode connections are negative/inhibitory and half are positive/excitatory. The dynamically varying connection strength o n excitatory connections is again determined by equation 4.1; that on inhibitory connections is determined by (4.2) which satisfies equation 2.2 (note that c, < 0). A value T = 0.1 is used and each node’s initial activation level is set at ai(0) = - r / c . Three variations of the second set of simulations are done, where the only difference between the three models tested is the value of 6. The
Phase Transitions in Connectionist Models
531
5000 4500 -
I
4Ooo-
*
3500-
3000 2500 ZOO0 1500 1000 500 0 -1
I
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 2: Phase transition at c = 0 in a network with rapidly varying connection strengths (c, = 0, 6 = 0.1, T = 0). Plotted are number of iterations n until either convergence (c < 0 ) or until divergence (c > 0) as a function of c. The curve shown represents 24 simulations.
three values of 6 used are 0.400, 0.500, and 0.666. For each value of 6, the phase transition at c = 0 is again verified in a fashion similar to that described above for the first set of simulations. In addition, the lower phase transition c = -2/6 is verified in an analogous fashion (see Fig. 3). The lower phase transition is observed to move as a function of delta exactly as predicted, with phase transitions at c = -5, -4, and -3 when b = 0.400, 0.500, and 0.666, respectively. In the second set of simulations, with both positive and negative connection strengths, the activation levels of individual nodes exhibit quite interesting behavior. As c gradually decreases between each simulation from zero towards the lower phase transition at -2/6, at first not only A ( t ) but also the maximum and minimum individual node activation levels steadily approach a fixed point. However, as c gets closer to the, lower transition point, with some simulations total network activation begins to oscillate. A ( t ) still spirals in on the predicted A* value, but the maximum and minimum activation levels of individual nodes grow arbitrarily large. Thus, although the predicted phase transitions and fixed points for A ( t ) are always confirmed, individual node activations might still diverge in the region near -2/6.
J. A. Reggia and M. Edwards
532
C
Figure 3: Variation of lower phase transition c = -216 as 6 varies. Shown here is total number of iterations n until convergence/divergence as a function of c. Values of 6 used are 0.400 (solid line; predicted phase transition at c = -5.01, 0.500 (dotted line; predicted phase transition at c = -4.01, and 0.666 (mixed dots and dashes; predicted phase transition at c = -3.0). 5 Discussion
With the growing use of large connectionist models, awareness of phase transitions is becoming increasingly important. Accordingly, this paper has characterized through both analysis and computer simulations the phase transitions in connectionist models using a class of activation rules described by equation 2.3. In the special case where 6 = 1 and weights are appropriate nonnegative constants, equation 2.3 reduces to the model considered in Huberman and Hogg (1987). In this special case the results obtained here are consistent with those in Huberman and Hogg (1987) and Shrager et al. (19871, although an additional phase transition, not reported previously, exists when a priori assumptions about parameter values are relaxed. In addition, equation 2.3 encompasses a wide range of additional models with constant connection strengths in which spread of activation is determined by differential equations (equation 2.41, and which have a wider range of parameters over which the model produces useful behavior (i.e., as S decreases the phase transition at c = -216 recedes in importance).
Phase Transitions in Connectionist Models
533
By allowing rapidly varying connection strengths, equation 2.3 is also directly applicable to connectionist models where inhibitory effects are produced not by static inhibitory connections but by dynamic competition for activation. Many associative networks in A1 and cognitive science theories do not have inhibitory connections, so implementing these theories as connectionist models raises the difficult issue of where to put inhibitory connections and what weights to assign to them (see Reggia 1985; Reggia et al. 1988, 1990 for a discussion). A competitive activation mechanism resolves this issue by eliminating inhibitory connections between nodes (c, = 0). According to the results in this paper, such models should require strong self-inhibitory connections to function effectively (e, < -Q. This conclusion is interesting because in some previous applications of competitive activation mechanisms it was observed that strong self-inhibition (c, 5 -2 when c4 = 1)was necessary to produce maximally circumscribed network activation (Reggia et aI. 1988; Peng et al. 1990). This empirical observation was not understood at the time. Although these connectionist models used activation rules comparable but somewhat more complex than equation 2.3, it seems reasonable to explain the more diffuse activation patterns seen with c, 2 -1 as occurring because the model was operating near a phase transition ( c = 0). As noted earlier, it is important to recognize that the convergence of A ( t ) on a fixed point when -2/6 < c < 0 does not guarantee that each individual node’s activation approaches a fixed point nor even that it is bounded. In this context it is interesting to note that as c approached the lower phase transition -2/S in networks with inhibitory connections (second set of simulations), individual node activation levels sometimes grew without apparent bounds. The value A ( t ) still approximated the predicted A* value because positive and negative node activations balanced one another. This raises the question of whether an additional phase transition between -216 and 0 exists, below which individual node activations might not be bounded. Such a phase transition might be derived, for example, if one could determine the convergence range for C ,af rather than for A = C , a, as was done in this paper. Such a determination is a difficult and open task. As a practical consideration, the problem of balanced divergent node activations can be minimized by using a value of T > 0, as in the second set of simulations. This has the effect of shifting node activation levels in a positive direction, thereby avoiding the balancing of positive and negative node activation levels. It should also be noted that the problem of unbounded growth of individual node activation in networks with fixed weights has been addressed elsewhere (Hogg and Huberman 1987, pp. 296-297). Introducing nonlinearities (e.g., hard bounds on node activation levels) can lead to saturation of node activation levels and moving wavefronts that propagate throughout a network. Finally, it should be noted that a variety of recent work with biologically oriented random/stochastic neural network models has also
534
J. A. Reggia and M. Edwards
identified phase transitions (e.g., Chover 1988; Kryukov a n d Kirillov 1989; Kurten 1987). This related work has focused on parameters not considered here, such as the probability of neuron firing o r neuron threshold. This related work as well as that of Huberman a n d Hogg (1987) and Shrager et al. (1987) a n d the results presented in this paper are collectively providing a better understanding of phase transitions in connectionist models across a broad spectrum of models a n d applications.
Acknowledgments
Supported i n part by NSF award IRI-8451430 a n d in part by NIH award NS29414.
References Bourret, P., Goodall, S., and Samuelides, M. 1989. Optimal scheduling by competitive activation: Application to the satellite antennas scheduling problem. Proc. Int. Joint Conf. Neural Networks, IEEE I, 565-572. Chover, J. 1988. Phase transitions in neural networks. In Neural Information Processing Systems, D. Anderson, ed., pp. 192-200. American Institute of Physics, New York, NY. DAutrechy, C. L., Reggia, J., Sutton, G., and Goodall, S. 1988. A general purpose simulation environment for developing connectionist models. Simulation 51, 5-19. Hogg, T., and Huberman, B. 1987. Artificial intelligence and large scale computation: A physics perspective. Pkys. Rep. 156(5), 227-310. Huberman, B., and Hogg, T. 1987. Phase transitions in artificial intelligence systems. Artificial Intelligence 33, 155-171. Kohonen, T. 1984. Self-Organization and Associative Memory, Ch. 5. SpringerVerlag, Berlin. Kryukov, V., and Kirillov, A. 1989. Phase transitions and metastability in neural nets. Proc. Int. Joint Conf. Neural Networks IEEE I, 761-766. Kurten, K. 1987. Phase transitions in quasirandom neural networks. Proc. Int. Joint Conf. Neural Networks IEEE 11, 197-204. Peng, Y., and Reggia, J. 1989. A connectionist model for diagnostic problem solving. I E E E Trans. Syst. M a n Cybernet. 19, 285-298. Peng, Y., Reggia, J., and Li, T. 1990. A connectionist solution for vertex cover problems. Submitted. Reggia, J. 1985. Virtual lateral inhibition in parallel activation models of associative memory. Proc. Ninth Int. Joint Conf. Artificial Intelligence, Los Angeles, CA, 244-248. Reggia, J., Marsland, P., and Berndt, R. 1988. Competitive dynamics in a dualroute connectionist model of print-to-sound transformation. Complex Syst. 2,509-547.
Phase Transitions in Connectionist Models
535
Reggia, J., Peng, Y., and Bourret, P. 7990. Recent applications of competitive activation mechanisms. In Neural Networks: Advances a i d Applications, E. Gelenbe, ed. North-Holland, Amsterdam, in press. Shrager, J., Hogg, T., and Huberman, B. 1987. Observation of phase transitions in spreading activation networks. Science 236, 1092-1094. Wald, J., Farach, M., Tagamets, M., and Reggia, J. 1989. Generating plausible diagnostic hypotheses with self-processing causal networks. J . Exp. Tlzeor. Artificial Intelligence 1,91-112. Whitfield, K., Goodall, S., and Reggia, J. 1989. A connectionist model for dynamic control. Telernatics lrzformatics 6, 375-390.
Received 22 May 90; accepted 10 August 90.
Communicated by Gunther Palm
An M-ary Neural Network Model R. Bijjani P. Das Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180-3590 USA
An M-ary neural network model is described, and is shown to have a higher error correction capability than the bipolar Hopfield neural net. The M-ary model is then applied to an image recognition problem.
1 Introduction Living organisms are very efficient in performing certain tasks, such as pattern recognition and adaptive learning. The increase in demand for the development of a machine capable of performing similar tasks caused an acceleration in the pace of brain research. The last four decades have yielded many theories and mathematical models loosely describing the operation of the neurons - the basic computational element and the building block of the brain. These models came to be collectively known as the neural network models. All neural nets share in the common characteristic of processing information in a largely collective nature as opposed to the predominantly serial manner of conventional computers. This collective nature arises from the complex structure of massively interconnected nerve cells or neurons. The brain stores information by modifying the strengths of the interconnections, or synaptic weights, between the neurons. The neurons themselves preserve no information; their role is limited to act as simple nonlinear decision-making elements. Memory or recognition is the information retrieval process in which the brain categorizes incoming information by matching it up with its stored data. For successful recognition, the incoming information must present a cue or association to the desired stored image and hence the term associative memoy. Neural nets can be classified according to their desired functions (associative memory, classifiers, etc.), the characteristics of their processing elements, the network layout or topology, and the learning rules (Athale 1987; Lippman 1987). This paper describes an M-ary extension of the Hopfield model (Hopfield 1984). The statistical properties of the new model are calculated from the point of view of error probabilty and error correction capability. The Neural Computation 2, 536-551 (1990) @ 1990 Massachusetts Institute of Technology
M-ary Neural Network Model
537
M-ary model's performance is then compared to that of the bipolar Hopfield model and is found to possess a higher error correction capability. Yet the Hopfield model is determined to possess a faster convergence rate for the case of moderately (less than about 40%) corrupted signals. Lastly, an application in image processing using the M-ary model is presented. The experimental outcomes are found to be in agreement with the theory presented. 2 The M-ary Model
The M-ary neural network is an associative memory system capable of discerning between a highly distorted version of a stored vector from other retained information. The network consists of N neurons, where each neuron can exist in one of M distinct states. The fundamental algorithm for an M-ary system is as follows: a set of p M-ary vectors, each of length N , are to be stored for future recollection. The vector components are represented by which signifies the ith component of the pth vector. The network is assumed to be globally interconnected, that is all neurons are painvise connected by synaptic weights. The vectors are stored collectively by modifying the values of the synaptic matrix J , whose elements Jt3 represent the weights of the interconnections between the elements or nodes z and j . JzJ is computed as the outer product of the stored vectors as follows:
(r,
where i, j = 1 , 2 , . . . , N . For satisfactory performance, we require that the components of the stored vectors be independent and identically distributed, that is, the vector components are permitted to take any of the equiprobable values ak drawn from an alphabet of size M . In other words:
1 (2.2) M where i = 1 , 2 , . . . , N ; p = 1,2,.. . , p ; and k = 1 , 2 , . .. , M . A content addressable memory system recovers a stored vector [*, which it most closely associates with the input vector. We shall examine two likely occurrences. The first is when the system is initially presented at time t = to with an input vector S(t0) contaminated with additive white gaussian noise. Where the jth element P{[:
=ak} =
-
the noise component nJ(to) being a zero-mean gaussian random variable with a power spectral density N0/2. The second occurrence is when the
R. Bijjani and I? Das
538
Figure 1: Multilevel threshold operation. input vector S(t0) differs from the desired vector tUin d elements and is identical to tUin the remaining ( N - d ) elements. Let the output after some time t 2 t o be S(t + l), with the index ( t 1) representing the time at one neural cycle time (iteration) following the discrete time t. The retrieval process does not require any synchronization. The individual neurons observe the following routine in a rather random and asynchronous manner.
+
Si(t
+ I ) = f [ h i ( t+ I)]
(2.3)
where
and where the activation function f(h)shown in Figure 1 represents a multilevel thresholding operation defined as
f(h)= ak
if and only if
w-1
+ Q'R 2